PhysMaster

Mastering Physical Representation
for Video Generation

via Reinforcement Learning

We propose PhysMaster, which captures physical knowledge as a representation for guiding video generation models to enhance their physics-awareness.

Visual Representation Icon
Physical Representation Injection: Based on the image-to-video task, we devise PhysEncoder to encode physical knowledge from the input image as an extra condition to inject into the video generation process.
Connector Design Icon
Representation Learning by RLHF: PhysEncoder leverages generative feedback from generation models to optimize physical representation with Direct Preference Optimization in an end-to-end manner.
Instruction Tuning Data Icon
Training Paradigm: We improve physics-awareness of PhysEncoder and thus of video generation model in a three-stage training pipeline on a simple, yet fundamental task, which proves to generalize effectively to diverse physical scenarios guided by relevant fundamental physical principles.
Instruction Tuning Recipes Icon
Generic Solution: Our PhysMaster, which learns physical knowledge via representation learning, has potential to act as a generic solution for physics-aware video generation and broader applications.
Teaser Image

Video generation models nowadays are capable of generating visually realistic videos, but often fail to adhere to physical laws, limiting their ability to generate physically plausible videos and serve as ''world models''. To address this issue, we propose PhysMaster, which captures physical knowledge as a representation for guiding video generation models to enhance their physics-awareness. Specifically, based on the image-to-video (I2V) task where the model is expected to predict physically plausible dynamics from the input image. Since the input image serves as a direct source of physical priors, we devise PhysEncoder to encode physical information from it as an extra condition to inject physical knowledge into the video generation process, avoiding explicit simulation of physical processes. The misalignment between high-level physical knowledge and generation model, as well as the difficulty in unified modeling motivate PhysEncoder to leverage generative feedback from generation models to optimize physical representations with Direct Preference Optimization (DPO) in an end-to-end manner, allowing for physical representation learning from human preferences, which provides a feasible solution for improving physics-awareness of PhysEncoder and thus of video generation. More importantly, our PhysEncoder is trained on a simple, yet fundamental task but proves to generalize effectively to diverse physical scenarios guided by relevant fundamental physical principles. This implies that our PhysMaster, which unifies different physical processes via representation learning in reinforcement learning (RL) paradigm, has potential to act as a generic solution for physics-aware video generation and broader applications.

Data Logo Training
Pipeline
Connector Logo Simulation
Results
Visual Representation Logo Real-world
Results

Click to jump to each section.


Three-stage Training Pipeline

We propose a three-stage training pipeline for PhysMaster to enable physical representation learning of PhysEncoder by leveraging the generative feedback from the video generation model. The core idea is formulating DPO for PhysEncoder Ep with the reward signal from generated videos of pretrained DiT model vθ , thus help physical knowledge learning.

Spatial Vision Aggregator (SVA)
Figure 1: Training pipeline of PhysMaster.

Evaluation in Simulation Environment

We post-train the DiT video generation model conditioned on physical representations from PhysEncoder in different training stages for validation of its capability of enhancing model’s physical realism of generated videos in simulated scenes. PhysEncoder in Stage I can be viewed as Depth Anything model adapted to our setting via SFT, thus serving as a baseline for comparing with PhysEncoder in Stage III which is optimized through the following DPO processes. The results show a clear advantage of the PhysEncoder in Stage III over the one in Stage I in learning physical priors in various simulated scenarios and guiding the model to generate physically accurate videos. The model utilizing the initial physical representation from PhysEncoder in Stage I tend to generate physically implausible videos, especially in object interactions. For instance, rigid bodies undergo fusion, penetration or deformation in ''Collision Dynamics'', and inconsistent number of objects or incorrect force responses in ''Collapse Dynamics''. In contrast, the model assisted by PhysEncoder in Stage III exhibits improved physical realism in object rigidity, motion consistency, and interactive rationality across diverse scenarios. A user study is also conducted to compare the physical plausibility of video pairs, separately generated by the model assisted by physical representation from PhysEncoder of Stage I and III, which consistently prove the superiority of the latter and convincingly verifies our pipeline’s efficacy.

PhysEncoder Used in Post-training Support Roll Link Dominoes Contain Collide
PhysEncoder in Stage I (Depth Baseline) 45.1 27.7 24.7 35.5 15.0 15.9
PhysEncoder in Stage III 54.9 72.3 75.3 64.5 85.0 84.1
Table 1: User study results show that using PhysEncoder from Stage III during post-training yields better physics-awareness than Stage I across all simulated tasks.
         
Figure 2: Qualitative comparisons for models using PhysEncoder from different stages for post-training on relevant scenarios, where artifacts violating rigid-body physics (e.g., fusion, penetration) are witnessed using PhysEncoder in Stage I (right) while PhysEncoder in Stage III (left) helps in preserving object shape and exhibiting physically consistent reactions.

Evaluation in Real-world Scenarios

Our PhysMaster demonstrates its physics-awareness by enhancing the physical realism of generated videos on synthetic data on a series of physical scenes. This suggests its potential to generalize to real-world scenarios and a broader range of physical laws. To substantiate this claim of generalization, we extend beyond the proxy task of dropping and collision in simulation scene by incorporating a real-world proxy task of liquid motion. We then finetune our model using the three-stage training pipeline on a combined dataset on both simulated and real-world proxy tasks governed by different physical principles, and assess the physics-awareness of the resulting video generation model in each stage. It allows us to demonstrate the generalizability of our approach in two key aspects: (1) Physical Attributes: handling different object materials and physical laws; and (2) Data Domain: adapting to both synthetic and real-world data. To evaluate the training results, we conduct comparisons in two aspects.
Object dropping and collision. We evaluate the performance of proxy task on both real-world and simulated test datasets from PisaBench. “Sim” represents test split of simulation scenario and “Real” is the real-world test dataset. The comparison indicates that in our training pipeline, with SFT endowing the model with preliminary ability to predict objects’ motion in the dropping and collision scenario, the optimization of PhysEncoder in last stage improves its capability in guiding model towards higher level of physics-awareness. It is worth noting that, through joint training on combined data, we also achieve a significant performance enhancement on the out-of-domain real-world test data. The performance on the real-world domain does not suffer degradation, even though no real-world data for the dropping and collision task is included in the training set. This success is thanks to the strong generalization capability of the PhysEncoder itself.

Training Stages Real Sim
L2 ↓ CD ↓ IoU ↑ L2 ↓ CD ↓ IoU ↑
Base 0.1600 0.459 0.104 0.1066 0.331 0.115
SFT for vθ & Ep (Stage I) 0.0762 0.179 0.158 0.0533 0.134 0.135
SFT + DPO for both vθ and Ep (Stage III) 0.0748 0.176 0.163 0.0471 0.118 0.143
Table 2: Quantitative results for models from different training stages on the task of ''object dropping and collision'', evaluated on the test sets split into ''Sim'' and ''Real'' for simulated and real-world datasets. vθ is DiT model and Ep is PhysEncoder.
   
Figure 3: Qualitative comparisons for models in different training stages on the real-world test set of object dropping and collision. The model exhibits a preliminary capability for predicting object motion trends after SFT in Stage I (right). Two-stage DPO further improves model performance in preserving objects’ rigidity and complying with physical laws (e.g., gravitational acceleration and collision) as in Stage III (left).

Liquid motion. Figure 4 provides a qualitative comparison of the models in Stage I and Stage III of our training pipeline on the "liquid motion" task. The video generated by the latter model exhibits significantly more plausible physical behavior. For instance, as the liquid is poured, the liquid level in the glass bottle gradually rises, and the transparent water demonstrates realistic refraction of the ball along with believable interaction with the human hand. These results consistently demonstrate the generalizability and effectiveness of PhysMaster in injecting physical information and enhancing physical plausibility of generation across different physical phenomena and data domains. Therefore, PhysMaster provides a generalizable solution for unleashing the capabilities of physical comprehension across diverse physical phenomena. This highlights its potential to act as a foundational solution for physics-aware video generation and energize more sophisticated applications.

       
Figure 4: Qualitative comparisons for models using PhysEncoder in different training stages on the real-world test set of liquid motion. Our model using PhysEncoder in Stage III (left) is capable of generating more realistic liquid motion in real-world scenarios, compared to the baseline using PhysEncoder in Stage I (right).

Conclusion

We propose PhysMaster, which learns physical representation from input image for guiding I2V model to generate physically plausible videos. We optimize physical encoder PhysEncoder based on generative feedback from a pretrained video generation model via DPO on a simple but fundamental proxy task, which proves to enhance the model’s physical accuracy and demonstrate generalizability across relevant physical scenarios by injecting physical knowledge into generation, proving its potential to act as a generic solution for physics-aware video generation and broader applications.

BibTeX

@article{tong2024cambrian,
  title={{Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs}},
  author={Tong, Shengbang and Brown, Ellis and Wu, Penghao and Woo, Sanghyun and Middepogu, Manoj and Akula, Sai Charitha and Yang, Jihan and Yang, Shusheng, and Iyer, Adithya and Pan, Xichen and Wang, Austin and Fergus, Rob and LeCun, Yann and Xie, Saining},
  journal={arXiv preprint arXiv:2406.16860},
  year={2024}
}