Training-Free High-Resolution Video Generation via Tiled Diffusion with Latent Prior Regularization
Anonymous ECCV 2025 Submission
We propose Fresco2Video, a training-free method to generate a large spatial video from a single large and complex image with multiple scenes using a pre-trained video diffusion model. Our approach augments tiled denoising with a latent prior regularizer that enforces global coherence.
The prior is obtained by resizing the input image to the model's native scale, generating a small video, and upsampling its latents. Then, to generate the large video, at each timestep we fuse per-tile noise predictions with this prior by minimizing a single weighted least-squares energy in model-output space.
Without any training, Fresco2Video scales Image-to-Video diffusion models to multi-kilopixel images with many scenes, like frescoes. Our method quantitatively outperforms tiled denoising and super-resolution baselines on video evaluation metrics and in user studies.
Compare Fresco2Video against baseline methods on the same input images. Our method achieves both global coherence (no seams, consistent motion) and local detail preservation at high resolutions.
Output: 2182 x 2770 pixels
Output: 892 x 3000 pixels
Output: 1517 x 2300 pixels
Output: 2048 x 2048 pixels