FreeCam: Camera-Controlled Video Generation for High Pose Fidelity without Depth Priors
Abstract
Novel-view video generation for dynamic scenes has emerged as a prominent direction alongside recent advances in video diffusion models. Nevertheless, existing approaches exhibit limitations that reduce flexibility. Methods that build on Image-to-Video models inherit biases from the base model, thereby constraining the target camera pose of the first frame to remain near the source. The limited diversity of camera trajectories in available datasets further restricts learned models to a narrow range of motions. While projection-based methods that rely on depth estimation do not impose explicit camera-pose constraints, they are susceptible to projection errors arising from depth warping. To address these limitations, we present FreeCam, a depth-free, camera-controlled video-to-video generation framework supporting unconstrained camera paths. Our framework combines two key components: infinite homography warping, which encodes 3D camera rotations directly in a 2D latent space to achieve high camera-pose fidelity; and a data-augmentation pipeline that converts existing multi-view datasets into sequences with unbiased arbitrary trajectories and heterogeneous focal lengths, enabling training across diverse camera motions and focal settings. When evaluated on an unbiased test set with arbitrary camera poses, FreeCam achieves high camera-pose accuracy while maintaining high visual fidelity, without depth prior. Moreover, despite being trained exclusively on synthetic data, FreeCam generalizes well to real-world videos. Ablation studies show that combining the proposed data-processing pipeline and infinite homography warping yields +5.46 dB PSNR on average. Comparative studies further indicate that FreeCam outperforms existing methods in trajectory accuracy (rotation accuracy +20.9% and translation accuracy +38.0%) while maintaining high visual fidelity.