the problem: evaluation beyond FID
video generation has gotten impressively good. Sora, Seedance, Kling, Runway Gen-3 — the raw generation quality is approaching usability. but there's a gap between "can produce impressive demos" and "can be systematically improved via RL." that gap is evaluation.
the standard metrics in the generative modeling world — FID (Fréchet Inception Distance) for images, FVD (Fréchet Video Distance) for video — measure distributional similarity between generated and real data. they answer the question: "does this batch of generated videos look statistically similar to real videos?" useful for benchmarking, terrible for RL.
RL needs a per-sample reward signal. not "does this distribution look right?" but "is this specific video good?" and "good" itself is multi-dimensional. a video can be physically plausible but aesthetically boring. it can be beautiful but violate basic physics. it can match the prompt perfectly but have jarring temporal artifacts.
to apply RL to video generation, we need reward models that capture what humans actually care about. and building those models is — i think — the central technical challenge of the next few years.
challenge 1: reward model design
what makes a "good" video? ask ten experts and you'll get ten different decompositions. but roughly, there are a few orthogonal axes:
physics fidelity. does the ball bounce correctly? does water flow downhill? does cloth drape naturally? this is partially checkable against learned physics simulators, but the space of physical interactions in open-domain video is enormous.
temporal coherence. do objects persist across frames? does the lighting stay consistent? does the camera motion feel natural? temporal artifacts — flickering textures, morphing objects, sudden jumps — are the most common failure mode of current generators.
aesthetic quality. composition, color grading, visual appeal. this is inherently subjective but not random — there's high inter-rater agreement on extreme cases and meaningful disagreement in the middle.
semantic accuracy. does the video match the text prompt? "a cat playing piano in a jazz bar" should show a cat, a piano, and a jazz bar, with the cat actually interacting w/ the instrument.
the naive approach is to train a single reward model that outputs a scalar. but collapsing these axes into one number loses critical information. if your model scores 0.7, is that because physics is perfect but aesthetics are mediocre? or because everything is okay but nothing is great?
formally, we want a reward function $R: \mathcal{V} \times \mathcal{T} \rightarrow \mathbb{R}^k$ where $\mathcal{V}$ is the video space, $\mathcal{T}$ is the text prompt space, and $k$ is the number of quality dimensions. the RL objective then becomes:
$$\max_\theta \mathbb{E}_{v \sim \pi_\theta(\cdot | t)} \left[ \sum_{i=1}^{k} w_i \cdot R_i(v, t) - \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right]$$
where $w_i$ are weighting coefficients across quality dimensions and $\beta$ controls the KL penalty against a reference policy $\pi_{\text{ref}}$ (to prevent reward hacking). the challenge: the $w_i$ values are themselves context-dependent. a physics simulation video should weight physics fidelity highly; an artistic music video should weight aesthetics more.
challenge 2: credit assignment over long sequences
credit assignment is RL's oldest problem. in video, it's especially nasty.
a 10-second video at 24fps is 240 frames. if the video is "bad," where did the model go wrong? maybe it's frame 47, where an object starts morphing. maybe it's the global planning stage, where the model decided on a camera trajectory that makes the scene incoherent. maybe it's nowhere specific — the overall motion dynamics just feel slightly off.
in text RL (like RLHF for chat), the sequence length is typically dozens to hundreds of tokens, and there's a natural structure (sentences, paragraphs) that helps localize quality. video has no such structure. quality is a continuous, spatiotemporal property.
one approach: hierarchical credit assignment. decompose the video into temporal segments and spatial regions, evaluate each independently, then compose:
$$R(v) = \sum_{s=1}^{S} \gamma^s \left[ R_{\text{segment}}(v_s) + \lambda \sum_{r \in \text{regions}(s)} R_{\text{local}}(v_s^r) \right]$$
where $v_s$ is the $s$-th temporal segment and $v_s^r$ is a spatial region within that segment. this gives you a richer signal than a single scalar, but it requires reward models that can operate at multiple granularities. training such models is itself a significant challenge — you need annotations at the segment and region level, not just clip-level.
challenge 3: training instability and reward hacking
reward hacking in video is catastrophic in ways that text reward hacking isn't.
in language, reward hacking typically produces text that's verbose, sycophantic, or superficially impressive but vacuous. annoying, but the output is still text. you can read it and see the problem.
in video, reward hacking can produce:
- mode collapse to static frames. the model learns that a single beautiful frame, held for 10 seconds, scores well on aesthetics and semantic accuracy. technically matches the prompt, technically looks good. completely useless as video.
- texture smoothing. the model discovers that blurring fine details reduces temporal flickering artifacts. the reward model, which penalizes flickering, rewards this. result: everything looks like a watercolor painting.
- physics shortcuts. instead of learning realistic physics, the model learns to avoid situations where physics matters. prompt says "ball bouncing" — the model generates the ball at rest. no bouncing = no physics errors = high physics score.
the KL penalty helps but isn't sufficient. you need multi-faceted reward models that can't all be gamed simultaneously, plus adversarial probing to identify failure modes before they compound during training.
training instability is also worse because the generation process is expensive. each sample from a video diffusion model takes orders of magnitude more compute than a text sample. you can't just "run more rollouts" the way you can in language RL. every training step is precious, which means instability is costly.
challenge 4: the data flywheel cold start
RLHF for text had a crucial advantage: human annotations are cheap and fast. anyone can read two chatbot responses and say which one they prefer. the preference data flywheel spins up quickly.
video evaluation is different:
- expert annotations are expensive. evaluating physics fidelity requires domain expertise. evaluating cinematic quality requires trained eyes. you can't crowdsource "does this ball bounce correctly?" to random annotators — the inter-rater reliability is too low.
- evaluation takes time. you need to watch the full clip. for temporal coherence, you often need to watch multiple times. a single annotation might take 2-5 minutes vs. seconds for text.
- disagreement is structured, not random. experts disagree on aesthetics but agree on physics. novices agree on aesthetics but miss physics errors. collapsing across annotators loses information; modeling annotator expertise adds complexity.
the cold start problem: you need a good reward model to train the generator, but you need a decent generator to produce samples worth annotating, and you need annotations to train the reward model.
breaking this cycle probably requires synthetic data — using existing video understanding models (VideoMAE, InternVideo, etc.) as weak labelers to bootstrap the reward model, then refining w/ targeted expert annotations on the hardest cases.
a GRPO-style training loop for video
Group Relative Policy Optimization (GRPO) offers a promising framework for video RL because it avoids learning a separate value function — which would be extremely expensive to train for video. here's a sketch of what a video-adapted version might look like:
# GRPO-style training for video generation
def train_step(prompts, generator, reward_models, ref_generator):
for prompt in prompts:
# 1. sample a group of candidate videos
videos = [generator.sample(prompt) for _ in range(G)]
# 2. score each video across reward dimensions
rewards = {}
for dim, rm in reward_models.items():
scores = [rm.score(v, prompt) for v in videos]
# group-relative normalization
mean_s, std_s = mean(scores), std(scores)
rewards[dim] = [(s - mean_s) / (std_s + eps) for s in scores]
# 3. composite reward w/ dimension weights
composite = [
sum(w[dim] * rewards[dim][i] for dim in rewards)
for i in range(G)
]
# 4. compute advantages relative to group
advantages = composite # already normalized
# 5. policy gradient update w/ KL constraint
for video, adv in zip(videos, advantages):
log_prob = generator.log_prob(video, prompt)
ref_log_prob = ref_generator.log_prob(video, prompt)
kl = log_prob - ref_log_prob
# clipped objective (PPO-style within GRPO)
ratio = exp(log_prob - log_prob.detach())
loss = -min(
ratio * adv,
clip(ratio, 1 - eps_clip, 1 + eps_clip) * adv
) + beta * kl
loss.backward()
the key ideas: (1) group-relative normalization removes the need for a value baseline, (2) multi-dimensional rewards are composed with learned or scheduled weights, (3) KL penalty against the reference generator prevents reward hacking, and (4) the clipped objective stabilizes training.
the main practical issue: $G$ (group size) needs to be large enough for stable normalization, but each sample is a full video generation pass. at current costs, $G = 16$ might be the practical limit, which is much smaller than what's typical in text GRPO ($G = 64$ or more). this makes the gradient estimates noisier and training less stable.
what's actually working
despite these challenges, there's real progress:
DPO for video preference alignment. Direct Preference Optimization sidesteps explicit reward modeling entirely. given pairs of videos (one preferred, one not), DPO directly optimizes the generator to produce preferred outputs. the advantage: no reward model to hack. the limitation: you still need preference pairs, and generating them is expensive.
hierarchical reward decomposition. several groups are training separate reward models for physics, aesthetics, and temporal coherence, then combining scores. this works better than monolithic scoring because each sub-model can be trained on targeted data. physics reward models can leverage simulation data; aesthetic models can leverage photography datasets.
synthetic reward bootstrapping. using CLIP scores, VQA models, and optical flow analysis as cheap proxy rewards to bootstrap training, then fine-tuning w/ human preferences. the proxy rewards are imperfect, but they provide enough signal to get the generator into a regime where human annotation is efficient.
temporal attention probing. using attention maps from the generation process itself as a diagnostic tool. if the model isn't attending to the right regions at the right times, that's a signal for the reward model — even without explicit supervision.
the honest summary: we're at the "it works in controlled settings but breaks in the wild" stage. the theoretical framework is solidifying — GRPO, DPO, multi-dimensional rewards, hierarchical decomposition. the engineering challenge is making it work at scale, with noisy data, on open-domain prompts. that's where the next breakthroughs will come from.