FreeCam: Camera-Controlled Video Generation for High Pose Fidelity without Depth Priors

KAIST
*Indicates Equal Contribution
FreeCam teaser image


FreeCam Results. Given a source video and a target camera trajectory, our method generates a video that faithfully follows the specified camera path. With the reference coordinate system defined by the initial frame of the source video (highlighted in red), FreeCam enables novel-view video synthesis along arbitrary trajectories. The examples show generated videos following backward (first) and arc (second) camera trajectories.

Abstract

Novel-view video generation for dynamic scenes has emerged as a prominent direction alongside recent advances in video diffusion models. Nevertheless, existing approaches exhibit limitations that reduce flexibility. Methods that build on Image-to-Video models inherit biases from the base model, thereby constraining the target camera pose of the first frame to remain near the source. The limited diversity of camera trajectories in available datasets further restricts learned models to a narrow range of motions. While projection-based methods that rely on depth estimation do not impose explicit camera-pose constraints, they are susceptible to projection errors arising from depth warping.
To address these limitations, we present FreeCam, a depth-free, camera-controlled video-to-video generation framework supporting unconstrained camera paths. Our framework combines two key components: infinite homography warping, which encodes 3D camera rotations directly in a 2D latent space to achieve high camera-pose fidelity; and a data-augmentation pipeline that converts existing multi-view datasets into sequences with unbiased arbitrary trajectories and heterogeneous focal lengths, enabling training across diverse camera motions and focal settings.
When evaluated on an unbiased test set with arbitrary camera poses, FreeCam achieves high camera-pose accuracy while maintaining high visual fidelity, without depth prior. Moreover, despite being trained exclusively on synthetic data, FreeCam generalizes well to real-world videos. Ablation studies show that combining the proposed data-processing pipeline and infinite homography warping yields +5.46 dB PSNR on average. Comparative studies further indicate that FreeCam outperforms existing methods in trajectory accuracy (rotation accuracy +20.9% and translation accuracy +38.0%) while maintaining high visual fidelity.

Qualitative Results (In-the-Wild)

Qualitative Results (WebVid Dataset)

Comparison with Other Methods (Synthetic Dataset)

Comparison with Other Methods (WebVid Dataset)