PhyGround: Benchmarking Physical Reasoning in Generative World Models

Generative world models are increasingly used for video generation, where learned simulators are expected to capture the physical rules that govern real-world dynamics. However, evaluating whether generated videos actually follow these rules remains challenging. Existing physics-focused video benchmarks have made important progress, but they still face three key challenges, including the coarse evaluation frameworks that hide law-specific failures, response biases and fatigue that undermine the validity of annotation judgments, and automated evaluators that are insufficiently physics-aware or difficult to audit. To address those challenges, we introduce PhyGround, a criteria-grounded benchmark for evaluating physical reasoning in video generation. The benchmark contains 250 curated prompts, each augmented with an expected physical outcome, and a taxonomy of 13 physical laws across solid-body mechanics, fluid dynamics, and optics. Each law is operationalized through observable sub-questions to enable per-law diagnostics. We evaluate eight modern video generation models through a large-scale, quality-controlled human study, grounded on social science lab experiment design. A total of 459 annotators provided 5,796 complete annotations and over 37.4K fine-grained labels; after quality control, the retained annotations exhibited high split-half model-ranking correlations (Spearman's ρ > 0.90). To support reproducible automated evaluation, we release PhyJudge-9B, an open physics-specialized VLM judge. PhyJudge-9B achieves substantially lower aggregate relative bias than Gemini-3.1-Pro (3.3% vs.\ 16.6%). We release prompts, human annotations, model checkpoints, and evaluation code on the project page.

#	Model	General quality			Physical reasoning by domain			Overall
#	Model	SA	PTV	Persist.	Solid-Body	Fluid	Optical	Overall
1	Wan2.2-27B-A14B	3.10	3.37	3.50	3.23	3.18	3.55	3.28
2	Veo-3.1 closed	3.26	3.29	3.42	3.12	3.65	3.69	3.28
3	OmniWeaving	2.97	3.17	3.34	2.98	3.26	3.22	3.10
4	Cosmos-14B	2.66	2.98	3.20	2.81	2.98	3.38	2.91
5	LTX-2.3-22B	2.58	2.74	2.72	2.62	3.07	2.98	2.69
6	Wan2.2-TI2V-5B	2.44	2.68	2.78	2.58	2.71	2.99	2.63
7	Cosmos-2B	2.33	2.56	2.77	2.53	2.77	3.35	2.58
8	LTX-2-19B	2.56	2.64	2.48	2.46	3.04	2.78	2.56

PhyGround

Benchmarking Physical Reasoning in Generative World Models

Abstract

Qualitative model comparison

Human Evaluation Leaderboard

Method

Benchmark design

PhyJudge-9B

Sample videos by physical law

💥 Collision

OmniWeaving

Cosmos-14B

🎳 Momentum

Wan2.2-27B-A14B

Cosmos-2B

🍎 Gravity

LTX-2.3-22B

Cosmos-14B

🚧 Impenetrability

Cosmos-14B

Wan2.2-TI2V-5B

🧪 Material

Cosmos-14B

Veo-3.1

🎱 Inertia

Cosmos-2B

LTX-2.3-22B

🧱 Boundary Interaction

LTX-2-19B

Veo-3.1

🌊 Flow Dynamics

Cosmos-14B

Wan2.2-27B-A14B

🌑 Shadow

Veo-3.1

Cosmos-2B

🚰 Fluid Continuity

Wan2.2-TI2V-5B

Veo-3.1

🪣 Displacement

Wan2.2-TI2V-5B

Cosmos-2B

🛟 Buoyancy

Cosmos-14B

OmniWeaving

🪞 Reflection

LTX-2-19B

OmniWeaving

BibTeX

Collision

Momentum

Gravity

Impenetrability

Material

Inertia

Boundary Interaction

Flow Dynamics

Shadow

Fluid Continuity

Displacement

Buoyancy

Reflection