Pandora: Towards General World ModelWe introduce Pandora, a step towards a General World Model (GWM) that:
Pandora accepts free-text actions as inputs during video generation to steer the video on the fly. This differs crucially from previous text-to-video models which only allow text prompts at the beginning of the video. The on-the-fly control fulfills the promise of the world model to support interactive content generation and enhance robust reasoning and planning.
World models simulate alternative futures of the world. Pandora allows you to control the future. Here we show some counterfactual futures – different videos generated from the same initial state but different actions.
Initial state
Future 1
Future 2
Action: Magma erupts from the crater
Action: The sky gets dark
Action: Turn left. There is a white van
Action: Turn right. There is a red car
Action: The man waves his hand
Action: Two men walks towards the microphone
Action: Use spoon to scoop some broccoli
Action: Use spoon to stir the rice
Pandora is capable of generating videos across a wide range of general domains, such as indoor/outdoor, natural/urban, human/robot, 2D/3D, and other scenarios. You may find more videos in the Pandora’s Box gallery.
Loading video...
Pouring milk into the glass cup from a milk bottle
Loading video...
Flame ignites woods emitting some smoke
Loading video...
Wind blows the leaves
Loading video...
Let the traffic go
Loading video...
The man drops the bag
Loading video...
Fold the towel
Loading video...
Jump right
Loading video...
Look around
Loading video...
Set fire on the river
Loading video...
Fireworks bloom in the night sky
Instruction tuning with high-quality data allows the model to learn effective action control and transfer to different unseen domains. E.g., Pandora saw the only 2D game Coinrun during training, but can seamlessly apply the learned actions to other 2D games.
Source Domain
Target Domain
Source Domain
Target Domain
Existing diffusion video models typically produce videos of a fixed length. By integrating the video model with the Pandora autoregressive backbone, longer videos with potentially unlimited duration can be generated. We show 8-second videos generated by Pandora, even though our training videos are up to 5 seconds.
Loading video...
The car moves forward
Loading video...
The man is flying
Loading video...
Walk into the theatre
Loading video...
The car moves forward
Loading video...
Let the train move
Loading video...
The plane is flying
Pandora as a prelimiary step towards GWM is still limited. It can fail in generating consistent videos, simulating complex scenarios, understanding commonsense and physical laws, and following instructions/actions.
Loading video...
Pick up the wallet
Loading video...
Take out the nozzle
Loading video...
The man is dancing
Loading video...
The train door is open
We processed the videos on this website with FLAVR for frame interpolation to make them smoother. No other post-processing used.
If you have concern about the copyright of any image/video on this website, please contact us.
1
2
3
4
5
6
@article{xiang2024pandora,
title={Pandora: Towards General World Model with Natural Language Actions and Video States},
author={Jiannan Xiang and Guangyi Liu and Yi Gu and Qiyue Gao and Yuting Ning and Yuheng Zha and Zeyu Feng and Tianhua Tao and Shibo Hao and Yemin Shi and Zhengzhong Liu and Eric P. Xing and Zhiting Hu},
journal={arXiv preprint arXiv:2406.09455},
year={2024}
}