We propose PhysMaster, which captures physical knowledge as a representation for guiding video generation models to enhance their physics-awareness.
Video generation models nowadays are capable of generating visually realistic videos, but often fail to adhere to physical laws, limiting their ability to generate physically plausible videos and serve as ''world models''. To address this issue, we propose PhysMaster, which captures physical knowledge as a representation for guiding video generation models to enhance their physics-awareness. Specifically, based on the image-to-video (I2V) task where the model is expected to predict physically plausible dynamics from the input image. Since the input image serves as a direct source of physical priors, we devise PhysEncoder to encode physical information from it as an extra condition to inject physical knowledge into the video generation process, avoiding explicit simulation of physical processes. The misalignment between high-level physical knowledge and generation model, as well as the difficulty in unified modeling motivate PhysEncoder to leverage generative feedback from generation models to optimize physical representations with Direct Preference Optimization (DPO) in an end-to-end manner, allowing for physical representation learning from human preferences, which provides a feasible solution for improving physics-awareness of PhysEncoder and thus of video generation. More importantly, our PhysEncoder is trained on a simple, yet fundamental task but proves to generalize effectively to diverse physical scenarios guided by relevant fundamental physical principles. This implies that our PhysMaster, which unifies different physical processes via representation learning in reinforcement learning (RL) paradigm, has potential to act as a generic solution for physics-aware video generation and broader applications.
Click to jump to each section.
We propose a three-stage training pipeline for PhysMaster to enable physical representation learning of PhysEncoder by leveraging the generative feedback from the video generation model. The core idea is formulating DPO for PhysEncoder Ep with the reward signal from generated videos of pretrained DiT model vθ , thus help physical knowledge learning.
We post-train the DiT video generation model conditioned on physical representations from PhysEncoder in different training stages for validation of its capability of enhancing model’s physical realism of generated videos in simulated scenes. PhysEncoder in Stage I can be viewed as Depth Anything model adapted to our setting via SFT, thus serving as a baseline for comparing with PhysEncoder in Stage III which is optimized through the following DPO processes. The results show a clear advantage of the PhysEncoder in Stage III over the one in Stage I in learning physical priors in various simulated scenarios and guiding the model to generate physically accurate videos. The model utilizing the initial physical representation from PhysEncoder in Stage I tend to generate physically implausible videos, especially in object interactions. For instance, rigid bodies undergo fusion, penetration or deformation in ''Collision Dynamics'', and inconsistent number of objects or incorrect force responses in ''Collapse Dynamics''. In contrast, the model assisted by PhysEncoder in Stage III exhibits improved physical realism in object rigidity, motion consistency, and interactive rationality across diverse scenarios. A user study is also conducted to compare the physical plausibility of video pairs, separately generated by the model assisted by physical representation from PhysEncoder of Stage I and III, which consistently prove the superiority of the latter and convincingly verifies our pipeline’s efficacy.
PhysEncoder Used in Post-training | Support | Roll | Link | Dominoes | Contain | Collide |
---|---|---|---|---|---|---|
PhysEncoder in Stage I (Depth Baseline) | 45.1 | 27.7 | 24.7 | 35.5 | 15.0 | 15.9 |
PhysEncoder in Stage III | 54.9 | 72.3 | 75.3 | 64.5 | 85.0 | 84.1 |
Our PhysMaster demonstrates its physics-awareness by enhancing the physical realism of generated
videos on synthetic data on a series of physical scenes. This suggests its potential to generalize to
real-world scenarios and a broader range of physical laws.
To substantiate this claim of generalization,
we extend beyond the proxy task of dropping and collision in simulation scene by incorporating a
real-world proxy task of liquid motion.
We then finetune our model using the three-stage training
pipeline on a combined dataset on both simulated and real-world proxy tasks governed by different
physical principles, and assess the physics-awareness of the resulting video generation model in each
stage.
It allows us to demonstrate the generalizability of our approach in two key aspects: (1) Physical
Attributes: handling different object materials and physical laws; and (2) Data Domain: adapting to
both synthetic and real-world data.
To evaluate the training results, we conduct comparisons in two aspects.
Object dropping and collision. We evaluate the performance of proxy task on both real-world and
simulated test datasets from PisaBench.
“Sim” represents test split of simulation scenario and “Real” is the real-world test
dataset. The comparison indicates that in our training pipeline, with SFT endowing the model with
preliminary ability to predict objects’ motion in the dropping and collision scenario, the optimization
of PhysEncoder in last stage improves its capability in guiding model towards higher level of physics-awareness. It is worth noting
that, through joint training on combined data, we also achieve a significant performance enhancement
on the out-of-domain real-world test data. The performance on the real-world domain does not suffer
degradation, even though no real-world data for the dropping and collision task is included in the
training set. This success is thanks to the strong generalization capability of the PhysEncoder itself.
Training Stages | Real | Sim | ||||
---|---|---|---|---|---|---|
L2 ↓ | CD ↓ | IoU ↑ | L2 ↓ | CD ↓ | IoU ↑ | |
Base | 0.1600 | 0.459 | 0.104 | 0.1066 | 0.331 | 0.115 |
SFT for vθ & Ep (Stage I) | 0.0762 | 0.179 | 0.158 | 0.0533 | 0.134 | 0.135 |
SFT + DPO for both vθ and Ep (Stage III) | 0.0748 | 0.176 | 0.163 | 0.0471 | 0.118 | 0.143 |
Liquid motion. Figure 4 provides a qualitative comparison of the models in Stage I and Stage III of our training pipeline on the "liquid motion" task. The video generated by the latter model exhibits significantly more plausible physical behavior. For instance, as the liquid is poured, the liquid level in the glass bottle gradually rises, and the transparent water demonstrates realistic refraction of the ball along with believable interaction with the human hand. These results consistently demonstrate the generalizability and effectiveness of PhysMaster in injecting physical information and enhancing physical plausibility of generation across different physical phenomena and data domains. Therefore, PhysMaster provides a generalizable solution for unleashing the capabilities of physical comprehension across diverse physical phenomena. This highlights its potential to act as a foundational solution for physics-aware video generation and energize more sophisticated applications.
We propose PhysMaster, which learns physical representation from input image for guiding I2V model to generate physically plausible videos. We optimize physical encoder PhysEncoder based on generative feedback from a pretrained video generation model via DPO on a simple but fundamental proxy task, which proves to enhance the model’s physical accuracy and demonstrate generalizability across relevant physical scenarios by injecting physical knowledge into generation, proving its potential to act as a generic solution for physics-aware video generation and broader applications.
@article{tong2024cambrian,
title={{Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs}},
author={Tong, Shengbang and Brown, Ellis and Wu, Penghao and Woo, Sanghyun and Middepogu, Manoj and Akula, Sai Charitha and Yang, Jihan and Yang, Shusheng, and Iyer, Adithya and Pan, Xichen and Wang, Austin and Fergus, Rob and LeCun, Yann and Xie, Saining},
journal={arXiv preprint arXiv:2406.16860},
year={2024}
}