the old world: policy in, scalar out

for the last decade, most RL research has followed the same template. you have an agent, it has a policy, and it interacts w/ an environment that hands back a scalar reward. the agent's job is to maximize cumulative reward. clean, elegant, mathematically tractable. also increasingly disconnected from what we actually want AI systems to do.

the classic setup — think Atari, MuJoCo, Go — assumes a fixed, fully specified reward function. the environment tells you exactly how well you did. +1 for winning, -1 for losing, maybe some shaped intermediate rewards if you're feeling generous. the agent doesn't need to understand anything. it just needs to find the policy that collects the most points.

this paradigm produced incredible results. AlphaGo, AlphaStar, OpenAI Five. but here's the thing: none of these agents reason. they don't plan in the way you and i plan. they don't try something, realize it's not working, switch tools, and try again. they're reactive. brilliant, superhuman pattern matchers — but reactive.

what's different now

agentic RL is a fundamentally different proposition. the agent isn't just a policy network outputting actions. it's a system that reasons about its situation, decomposes problems into steps, uses external tools, evaluates its own progress, and retries when things go wrong. the "agent" in agentic RL is closer to what you'd recognize as intelligent behavior.

what changed? LLMs. specifically, LLMs gave us something RL never had before: a general-purpose reasoning backbone. pre-LLM, if you wanted an RL agent that could plan and reason, you had to hand-engineer the planning module, the state representation, the action space. now the LLM handles all of that. it can read instructions, write code, call APIs, interpret results, and decide what to do next — all in natural language.

RL then provides the optimization loop. the LLM gives you the capacity to act intelligently; RL gives you the pressure to act well. without RL, you have a capable but aimless system. it can reason but doesn't know what to optimize for. without the LLM, you have a well-optimized but narrow system. it knows its objective but can't generalize.

the combination is something new. not just "RL with better function approximation." a qualitatively different kind of agent.

reward is no longer a number from the environment

here's the key insight, and it's the one i think most people underestimate: in agentic RL, reward is no longer a scalar handed to you by the environment. it's a learned judgment.

RLHF was the first big demonstration of this. instead of coding up a reward function, you train a model to predict human preferences. the reward model looks at two outputs, decides which one a human would prefer, and uses that signal to train the policy. the reward is subjective, contextual, and impossible to specify by hand.

RLAIF takes it further — use an AI model itself as the judge. constitutional AI, self-play evaluation, automated red-teaming. the reward signal becomes a conversation between models rather than a lookup table in the environment.

this shift matters enormously for domains where quality is subjective. writing, design, conversation — and video generation. you can't write a mathematical function that captures "does this video look physically plausible while being aesthetically pleasing and narratively coherent." but you can train a model to judge it. maybe.

the video AI connection

video generation is where agentic RL gets really interesting — and really hard. models like Sora and Seedance can generate impressive clips, but evaluating and improving them is an open problem.

the quality of a generated video isn't one thing. it's physics fidelity (does the ball bounce right?), temporal coherence (does the scene flow smoothly?), aesthetic quality (is it beautiful?), and semantic accuracy (does it match the prompt?). these axes are often orthogonal. a video can be physically accurate but ugly, or gorgeous but physically nonsensical.

traditional metrics like FID and FVD capture distributional similarity but miss all the things humans actually care about. you need a learned reward model that understands physics, aesthetics, and narrative simultaneously. and you need an RL loop that can use that reward signal to actually improve the generator.

this is the frontier. not just generating video, but building the evaluation and optimization infrastructure that makes video generation systematically improvable. it's the same paradigm shift that RLHF brought to language — applied to a modality that's orders of magnitude more complex.

why this is genuinely hard

i don't want to undersell the difficulty. agentic RL for video faces challenges that don't exist in text:

sparse rewards over long horizons. a 10-second video at 24fps is 240 frames. the agent needs to make good decisions at every frame, but you might only get meaningful feedback on the final result. credit assignment over 240 steps is brutal.

multi-dimensional quality. the reward isn't a single number. it's a vector across physics, aesthetics, coherence, and more. collapsing it to a scalar loses critical information. keeping it as a vector makes optimization much harder.

reward hacking. in text, reward hacking produces verbose, sycophantic outputs. annoying but recoverable. in video, reward hacking can produce mode collapse — beautiful but completely static frames, or perfectly smooth but semantically empty motion. catastrophic in ways that are hard to detect automatically.

data scarcity. expert video evaluations are expensive. crowdsourced annotations disagree wildly. the data flywheel that powered RLHF for text doesn't spin up easily for video.

why i work on this

i've spent years at the intersection of RL, NLP, and production systems at Microsoft. what draws me to agentic RL for video isn't just the technical challenge — it's that it sits at the exact convergence of everything i care about.

my work on entity-aware summarization was fundamentally about alignment: making sure AI outputs are faithful to ground truth. the same principle applies here, just in a harder modality. when a video generator shows a ball falling upward, that's an alignment failure. when a reward model doesn't notice, that's a deeper one.

i think the next few years will be defined by whoever figures out the reward modeling problem for video. not the generation architecture — transformers and diffusion models are converging on good solutions there. the bottleneck is evaluation. how do you build a system that knows what good looks like?

that's the question. and i think agentic RL — agents that can reason about quality, decompose evaluation into rubrics, and learn from expert feedback — is how we get there.