Fresco2Video

Training-Free High-Resolution Video Generation via Tiled Diffusion with Latent Prior Regularization

Anonymous ECCV 2025 Submission

Scroll to zoom • Drag to pan • Double-click to reset

Scroll down to see method comparisons

Abstract

We propose Fresco2Video, a training-free method to generate a large spatial video from a single large and complex image with multiple scenes using a pre-trained video diffusion model. Our approach augments tiled denoising with a latent prior regularizer that enforces global coherence.

The prior is obtained by resizing the input image to the model's native scale, generating a small video, and upsampling its latents. Then, to generate the large video, at each timestep we fuse per-tile noise predictions with this prior by minimizing a single weighted least-squares energy in model-output space.

Without any training, Fresco2Video scales Image-to-Video diffusion models to multi-kilopixel images with many scenes, like frescoes. Our method quantitatively outperforms tiled denoising and super-resolution baselines on video evaluation metrics and in user studies.

Key Contributions

  • A training-free formulation that couples tiled denoising with a latent prior via a single objective
  • A closed-form update that balances global coherence and local detail through a scalar schedule and spatial gating
  • Empirical scaling to multi-kilopixel videos with superior metrics and user preference over baselines

Method Comparisons

Compare Fresco2Video against baseline methods on the same input images. Our method achieves both global coherence (no seams, consistent motion) and local detail preservation at high resolutions.

Click on any video to view it in full resolution

Sample #1 - Classical Fresco

Output: 2182 x 2770 pixels

MultiDiffusion
DynamicScaler
DemoFusion
Wan Solo (Native Res)
Wan + VSR (STAR)

Sample #2 - Panoramic Composition

Output: 892 x 3000 pixels

MultiDiffusion
DynamicScaler
DemoFusion
Wan Solo (Native Res)
Wan + VSR (STAR)

Sample #3 - Historical Painting

Output: 1517 x 2300 pixels

MultiDiffusion
DynamicScaler
DemoFusion
Wan Solo (Native Res)
Wan + VSR (STAR)

Sample #4 - Complex Square Composition

Output: 2048 x 2048 pixels

MultiDiffusion
DynamicScaler
DemoFusion
Wan Solo (Native Res)
Wan + VSR (STAR)