PhyGround

Benchmarking Physical Reasoning in Generative World Models

1Northeastern University, 2Tulane University, 3University of Washington, 4Carnegie Mellon University
Overview of PhyGround

Overview of PhyGround. PhyGround decomposes each video model's holistic physical reasoning score into scores for 13 physical laws. We recruited 459 annotators to conduct a large-scale, quality-controlled human study. Based on these human annotations, we released PhyJudge-9B, a fine-tuned judge model that supports reproducible automated evaluation.

Abstract

Generative world models are increasingly used for video generation, where learned simulators are expected to capture the physical rules that govern real-world dynamics. However, evaluating whether generated videos actually follow these rules remains challenging. Existing physics-focused video benchmarks have made important progress, but they still face three key challenges, including the coarse evaluation frameworks that hide law-specific failures, response biases and fatigue that undermine the validity of annotation judgments, and automated evaluators that are insufficiently physics-aware or difficult to audit. To address those challenges, we introduce PhyGround, a criteria-grounded benchmark for evaluating physical reasoning in video generation. The benchmark contains 250 curated prompts, each augmented with an expected physical outcome, and a taxonomy of 13 physical laws across solid-body mechanics, fluid dynamics, and optics. Each law is operationalized through observable sub-questions to enable per-law diagnostics. We evaluate eight modern video generation models through a large-scale, quality-controlled human study, grounded on social science lab experiment design. A total of 459 annotators provided 5,796 complete annotations and over 37.4K fine-grained labels; after quality control, the retained annotations exhibited high split-half model-ranking correlations (Spearman's ρ > 0.90). To support reproducible automated evaluation, we release PhyJudge-9B, an open physics-specialized VLM judge. PhyJudge-9B achieves substantially lower aggregate relative bias than Gemini-3.1-Pro (3.3% vs.\ 16.6%). We release prompts, human annotations, model checkpoints, and evaluation code on the project page.

Human Evaluation Leaderboard

Eight modern video generation models, scored by 459 human annotators on the full 250-prompt benchmark.

# Model General quality Physical reasoning by domain Overall
SA PTV Persist. Solid-Body Fluid Optical
1 Wan2.2-27B-A14B 3.10 3.37 3.50 3.23 3.18 3.55 3.28
2 Veo-3.1 closed 3.26 3.29 3.42 3.12 3.65 3.69 3.28
3 OmniWeaving 2.97 3.17 3.34 2.98 3.26 3.22 3.10
4 Cosmos-14B 2.66 2.98 3.20 2.81 2.98 3.38 2.91
5 LTX-2.3-22B 2.58 2.74 2.72 2.62 3.07 2.98 2.69
6 Wan2.2-TI2V-5B 2.44 2.68 2.78 2.58 2.71 2.99 2.63
7 Cosmos-2B 2.33 2.56 2.77 2.53 2.77 3.35 2.58
8 LTX-2-19B 2.56 2.64 2.48 2.46 3.04 2.78 2.56

Scores: 1โ€“5 scale (higher is better). 250 prompts. โ–  best   โ–  second.

Method

Benchmark design

PhyGround consists of 250 curated prompts, each paired with an expected physical outcome and a taxonomy of 13 physical laws spanning solid-body mechanics, fluid dynamics, and optics. Every law is operationalised through observable sub-questions, which lets us decompose a model's holistic physical reasoning score into per-law diagnostics rather than collapsing failures into a single number.

The human evaluation is grounded in social-science lab-experiment design: 459 annotators contributed 5,796 complete annotations and 37.4K fine-grained labels under quality control. The retained annotations exhibit high split-half model-ranking correlations (Spearman's ρ > 0.90), giving us a stable ground truth to train and audit automated judges.

PhyJudge-9B

To support reproducible automated evaluation, we release PhyJudge-9B, an open physics-specialised VLM judge. PhyJudge-9B achieves substantially lower aggregate relative bias than Gemini-3.1-Pro (3.3% vs. 16.6%), making it usable as a drop-in scorer for new video models.

Sample videos by physical law

For each of the 13 laws in the PhyGround taxonomy, we sample one prompt and show two of the eight evaluated models side by side.

Collision

“Arrows are shot at a moving target, with some arrows hitting and some missing visibly.”

OmniWeaving
Cosmos-14B

Momentum

“A steel ball bearing rolls across a flat surface and collides with another ball bearing, both ceasing motion.”

Wan2.2-27B-A14B
Cosmos-2B

Gravity

“A player checks another player, sending them sprawling onto the ice.”

LTX-2.3-22B
Cosmos-14B

Impenetrability

“A black balloon is sitting on a wooden table next to a small rotating platform with a lit matchstick taped to it. The match rotates clockwise and touches the balloon. The balloon immediately pops upon contact with the flame. Static shot with no camera movement.”

Cosmos-14B
Wan2.2-TI2V-5B

Material

“A shuttlecock bounces on the court after being hit by a racquet.”

Cosmos-14B
Veo-3.1

Inertia

“Two soccer balls are simultaneously juggled, one with the feet, the other with the head, before both are briefly balanced on a knee.”

Cosmos-2B
LTX-2.3-22B

Boundary Interaction

“A glass bottle filled with dark amber syrup sits on a rustic wooden table. In the foreground, a wooden spoon containing the same thick liquid tilts downward. The viscous syrup flows smoothly off the edge of the spoon in a slow, continuous stream.”

LTX-2-19B
Veo-3.1

Flow Dynamics

“A stream of water from a handheld nozzle hits a small oil fire, extinguishing the blaze.”

Cosmos-14B
Wan2.2-27B-A14B

Shadow

“A person walks confidently across a rocky desert. Under the warm sunlight, the person casts a long, distinct shadow that stretches and shifts dynamically across the uneven ground as they move.”

Veo-3.1
Cosmos-2B

Fluid Continuity

“Hot chocolate is poured from a mug into a saucer, a small amount sloshing over the edge.”

Wan2.2-TI2V-5B
Veo-3.1

Displacement

“The person gently places an egg yolk in a measuring cup into a pot of boiling water. The egg yolk quickly cooks and solidifies due to the heat.”

Wan2.2-TI2V-5B
Cosmos-2B

Buoyancy

“A ski jet is being towed by a larger boat on calmer waters.”

Cosmos-14B
OmniWeaving

Reflection

“A teapot on a rotating display base that rotates clockwise in front of a mirror reflecting the teapot’s image. The teapot’s reflection in the mirror rotates in the opposite direction. Static shot with no camera movement.”

LTX-2-19B
OmniWeaving

BibTeX

@article{lin2026phyground,
  title  = {{PhyGround}: Benchmarking Physical Reasoning in Generative World Models},
  author = {Lin, Juyi and Akbari, Arash and He, Yumei and Zhao, Lin and Zhang, Haichao and
            Akbari, Arman and Xu, Xingchen and Lu, Zoe Y. and Nan, Enfu and Deng, Hokin and
            Yeh, Edmund and Ostadabbas, Sarah and Fu, Yun and Dy, Jennifer and
            Zhao, Pu and Wang, Yanzhi},
  year   = {2026}
}