Building Reward Models That Actually See

the enormous gap

here's something that doesn't get discussed enough: the gap between "this model generates text well" and "this model evaluates video well" is enormous. like, qualitatively-different-problem enormous.

text reward models work because language has structure. grammar, semantics, logic — there are rules, and violations of those rules are detectable. you can check if a summary is faithful to a source document by comparing entities and claims. you can check if code runs by executing it. the evaluation is often reducible to something verifiable.

video doesn't have that luxury. "good video" is a judgment that integrates physics, aesthetics, temporal dynamics, narrative logic, and prompt adherence simultaneously. there's no compiler you can run. there's no fact-checking database. the ground truth is: does this look right to a human who understands how the world works?

and that's the core challenge of building reward models for video AI. you're trying to distill human perceptual judgment — the kind that takes years of visual experience to develop — into a differentiable function.

the proxy metric trap

the current state of the art in video evaluation is honestly kind of embarrassing. here's what we've been using:

CLIP-based metrics. CLIPScore measures alignment between text and image/video by embedding both into a shared space and computing cosine similarity. it's fast, cheap, differentiable. it also misses basically everything that matters. CLIP can't tell you if a ball bounces correctly. it can't detect temporal flickering. it can tell you "this frame contains a cat and a piano" but not "the cat's paw is moving across the keys in a physically plausible way."

FID/FVD. distributional metrics. they tell you whether your generated distribution matches the real distribution statistically. useful for benchmarking model architectures. useless for RL, which needs per-sample rewards. also: FVD was designed when generated videos were 16 frames at 64x64 resolution. applying it to 240-frame 1080p videos is... a stretch.

VQA scores. Visual Question Answering models can be prompted w/ evaluation questions ("is there a cat in this video?" "is the cat touching the piano?"). clever, compositional, and surprisingly useful for semantic accuracy. but VQA models have their own failure modes, and they can't evaluate aesthetics or physics — only semantic content.

the problem with all of these is Goodhart's Law: when the metric becomes the target, it ceases to be a good metric. optimize CLIPScore and you get videos that are semantically recognizable but physically nonsensical. optimize FVD and you get videos that look like the training set (which might be what you want, or might be the opposite of creative generation). optimize VQA accuracy and you get videos that answer questions correctly but look terrible.

what expert evaluators actually look at

i've spent time watching how expert video evaluators — cinematographers, VFX artists, animation directors — assess generated video. their process is nothing like computing a CLIP score. it's closer to:

physics scan. first pass: does anything violate basic physics? does the ball bounce right? does hair flow naturally? do shadows match the light source? experts catch physics errors in milliseconds that automated metrics completely miss. a cloth draped over a chair that clips through the surface. water flowing slightly uphill. gravity that's 80% of what it should be — not obviously wrong, but felt as wrong.

temporal coherence check. second pass: scrub through the video. do objects maintain identity? does the texture of a wall stay consistent? does the color grade shift unexpectedly? temporal coherence is about consistency over time, and it requires watching the video multiple times at different speeds.

narrative logic. does the cut make sense? if someone throws a ball, does the next shot show it landing? does the camera motion tell a story? this is the highest-level evaluation, and it's where current metrics are most helpless. narrative logic requires understanding causality, intention, and visual storytelling.

the "uncanny" detector. the hardest thing to formalize. experts have an intuition for when something is almost right but not quite. faces that are 95% correct but feel wrong. motion that's smooth but somehow lifeless. this is the frontier — the stuff that separates current generators from truly convincing output.

the path forward

so how do we get from "CLIP score" to "expert-calibrated reward model"? i think the path has four stages:

stage 1: entity-aware evaluation. before you can judge quality, you need to verify that the right things are in the video. this is basically VQA on steroids — not just "is there a cat?" but "is this the same cat in every frame? does it have consistent markings? does it persist when occluded?" entity tracking + attribute consistency as a foundation for everything else.

stage 2: multi-rubric scoring. decompose "video quality" into independent rubrics: physics ($R_{\text{phys}}$), temporal coherence ($R_{\text{temp}}$), aesthetics ($R_{\text{aes}}$), semantic accuracy ($R_{\text{sem}}$), and narrative logic ($R_{\text{nar}}$). train a separate model for each rubric on targeted data. physics models train on simulation data + expert annotations of physical plausibility. aesthetic models train on photography datasets + professional evaluations. each model can be small and focused.

stage 3: learned composition. learn how to weight and combine the rubric scores into a holistic quality judgment. this isn't just $w_1 R_1 + w_2 R_2$ — the weights should be context-dependent. a simulation should weight physics heavily; a music video should weight aesthetics. a meta-model that takes the rubric scores + prompt context and outputs a calibrated composite score.

stage 4: calibration on expert data. the final step: calibrate the composite model against actual expert preferences. this is where exclusive, high-quality annotation data becomes irreplaceable. you show the model pairs of videos, collect expert preferences, and fine-tune the composition to match. the rubric models provide structure; the expert data provides ground truth.

connection to entity-aware evaluation

this might seem like a completely different problem from what i worked on at Microsoft w/ entity-aware summarization, but the core principle is the same: align AI output with ground truth by tracking the entities that matter.

in entitySum, the problem was: given a webpage, generate a summary that doesn't hallucinate — that only includes claims supported by the source. the solution: track entities (products, prices, features) and verify that every entity mention in the summary has a corresponding grounding in the source.

in video reward modeling, the analogous problem is: given a text prompt, does the generated video contain the right entities behaving in the right ways? "a golden retriever catching a frisbee in a park" — is there a golden retriever (not a labrador)? is there a frisbee (not a ball)? is it a park (not a beach)? is the dog catching the frisbee (not ignoring it)?

entity-aware evaluation is stage 1 — the foundation. you can't judge physics fidelity if you haven't first verified that the right objects are present and persistent. you can't judge narrative logic if you haven't tracked which entities interact with which. the entity graph is the scaffold on which all higher-level evaluation hangs.

the modality changed — text to video — but the principle didn't. ground your evaluation in verifiable entities, then build up to subjective quality judgments on top of that grounding.

why data is the real moat

there's a popular narrative that the moat in AI is the model architecture. i think that's wrong, esp for video RL. architectures converge. transformers, diffusion, flow matching — the generation side is becoming commoditized. everyone will have a good video generator within a year or two.

the moat is the reward model. and the moat around the reward model is the calibration data.

you can replicate a model architecture from a paper. you can re-implement a training loop from pseudocode. you can't replicate the thousands of hours of expert video evaluations that calibrate a reward model to actually see what humans see. that data is expensive, slow to collect, and requires domain expertise to annotate well.

this is why data partnerships matter more than ever. access to professional evaluators — cinematographers, VFX supervisors, animation directors — is the bottleneck. not compute, not algorithms. eyes.

the teams that build exclusive relationships w/ professional evaluation communities will have reward models that see things other models miss. and those reward models will compound: better evaluation → better generation → better samples to evaluate → even better evaluation.

that flywheel, once it spins up, is very hard to replicate. and it starts w/ data, not models.