We propose PhysMaster, which captures physical knowledge as a representation for guiding video generation models to enhance their physics-awareness.
Video generation models nowadays are capable of generating visually realistic videos, but often fail to adhere to physical laws, limiting their ability to generate physically plausible videos and serve as ''world models''. To address this issue, we propose PhysMaster, which captures physical knowledge as a representation for guiding video generation models to enhance their physics-awareness. Specifically, PhysMaster is based on the image-to-video task where the model is expected to predict physically plausible dynamics from the input image. Since the input image provides physical priors like relative positions and potential interactions of objects in the scenario, we devise PhysEncoder to encode physical information from it as an extra condition to inject physical knowledge into the video generation process. The lack of proper supervision on the model's physical performance beyond mere appearance motivates PhysEncoder to apply reinforcement learning with human feedback to physical representation learning, which leverages feedback from generation models to optimize physical representations with Direct Preference Optimization (DPO) in an end-to-end manner. PhysMaster provides a feasible solution for improving physics-awareness of PhysEncoder and thus of video generation, proving its ability on a simple proxy task and generalizability to wide-ranging physical scenarios. This implies that our PhysMaster, which unifies solutions for various physical processes via representation learning in the reinforcement learning paradigm, can act as a generic and plug-in solution for physics-aware video generation and broader applications.
Click to jump to each section.
We propose a three-stage training pipeline for PhysMaster to enable physical representation learning of PhysEncoder by leveraging the generative feedback from the video generation model. The core idea is formulating DPO for PhysEncoder Ep with the reward signal from generated videos of pretrained DiT model vθ , thus help physical knowledge learning.
Our work aims to provide a scalable and generalizable methodology for learning physics from targeted data, so for demonstrating the effectiveness of our PhysMaster,
we start by defining a proxy task ("free-fall") under simple physical principles and construct domain-specific data for preliminary validation.
We compare the physical accuracy of our model with existing works and ablate different training techniques of PhysEncoder.
We propose PhysMaster, which learns physical representation from input image for guiding I2V model to generate physically plausible videos. We optimize physical encoder PhysEncoder based on generative feedback from a pretrained video generation model via DPO on both proxy task and general open-world scenarios, which proves to enhance the model's physical accuracy and demonstrate generalizability across various physical processes by injecting physical knowledge into generation, proving its potential to act as a generic solution for physics-aware video generation and broader applications.
@article{Ji2025physmaster,
title={{PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning}},
author={Ji, Sihui and Chen, Xi and Tao, Xin and Wan, Pengfei and Zhao, Hengshuang},
journal={arXiv preprint arXiv:2510.13809},
year={2025}
}