Animating still images to create realistic videos is a challenging problem in computer vision and graphics. Existing methods like video GANs can generate relatively short videos from static images, but often suffer from temporal incoherence over longer durations. In the paper "Generative Image Dynamics" published at CVPR 2023, researchers from Google propose a new approach to model scene dynamics that enables generating coherent, indefinitely long videos from single images.
The key idea is to model a generative prior over motion trajectories rather than directly generating RGB pixels. The model predicts a per-pixel representation called a "neural stochastic motion texture", which captures the distribution of possible long-term dense motion flows for each pixel in the input image. This motion texture is represented efficiently in the frequency domain as coefficients of a Fourier series basis, making it suitable for modeling the natural oscillatory motions that often occur in scenes like trees and plants swaying, candles flickering, etc.
The model is trained on a dataset of videos depicting such natural motions, from which dense optical flow is extracted to create ground truth motion textures. At inference time, a conditional latent diffusion model predicts a neural stochastic motion texture from a single input image. This texture can then be transformed into a sequence of time-varying motion fields that drive an image-based rendering module to generate future video frames.
Compared to directly generating pixels, this model:
- Produces more coherent videos that don't drift or diverge, since it captures the underlying motion structure
- Allows fine-grained control over the generated motions, like adjusting speed or magnitude
- Enables downstream applications like creating seamless loops or interactive animations
Experiments demonstrate the method generates high-quality videos indefinitely from static images, significantly outperforming recent baselines like video GANs. The motion-centric modeling strategy seems promising for controllable and robust video generation. Next steps could involve extending the motion representations to handle more complex non-periodic motions.