Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation

KAIST
*Indicates Equal Contribution
InfCam teaser image


InfCam Results. Given a video and a target camera trajectory, InfCam generates a video that faithfully follows the specified camera path. The world coordinate origin is defined by the first frame’s camera pose (highlighted in red). The leftmost column visualizes the backward, arc, and rotational camera trajectories, and the right side shows input–generated video pairs corresponding to each trajectory. The rotational trajectory is generated with a shorter focal length to illustrate wide field-of-view generation. The black dashed box in the last row indicates the original field-of-view of the input video.

Abstract

Recent progress in video diffusion models has spurred growing interest in camera-controlled novel-view video generation for dynamic scenes, aiming to provide creators with cinematic camera control capabilities in post-production. A key challenge in camera-controlled video generation is ensuring fidelity to the specified camera pose, while maintaining view consistency and reasoning about occluded geometry from limited observations. To address this, existing methods either train trajectory-conditioned video generation model on trajectory–video pair dataset, or estimate depth from the input video to reproject it along a target trajectory and generate the unprojected regions. Nevertheless, existing methods struggle to generate camera-pose–faithful, high-quality videos for two main reasons: (1) reprojection-based approaches are highly susceptible to errors caused by inaccurate depth estimation; and (2) the limited diversity of camera trajectories in existing datasets restricts learned models.
To address these limitations, we present InfCam, a depth-free, camera-controlled video-to-video generation framework with high pose fidelity. The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model. Conditioning on this noise-free rotational information, the residual parallax term is predicted through end-to-end training to achieve high camera-pose fidelity; and (2) a data augmentation pipeline that transforms existing synthetic multiview datasets into sequences with diverse trajectories and focal lengths. Experimental results demonstrate that InfCam outperforms baseline methods in camera-pose accuracy and visual fidelity, generalizing well from synthetic to real-world data.

Motivation

Motivation

(a) Proposed infinite homography-based approach. The model is conditioned on images warped by H∞, so it focuses on learning the parallax relative to the plane at infinity. This parallax is confined to the region between the epipole e′ and the point x∞ on the epipolar line l′ (shown as the yellow segment), which reduces the search space and leads to higher camera pose fidelity. End-to-end training further enables the network to implicitly refine the underlying 3D geometry, correcting inaccuracies in the unprojected 3D point X.
(b) Existing reprojection-based approach. Errors in depth estimation produce unreliable conditioning and cause artifacts in the generated image. Because no gradients flow back into the depth estimation network, the incorrect reprojection position x′ remains fixed during training, preventing these errors from being corrected.

Method

Model Architecture

Infinite Homography vs. Reprojection-based Conditioning

(a) DiT block with homography-guided self-attention layer. The homography-guided self-attention layer takes source, target, and warped latents, combined with camera embeddings, as input and performs per-frame attention, ensuring temporal alignment. By conditioning on warped latents, the model enables rotation-aware reasoning and constrained parallax estimation. Only the source and target latents proceed to the subsequent Wan2.1 layers.
(b) Warping module. This module warps the input latent with infinite homography to handle rotation, then adds camera embeddings to account for translation. This decomposition simplifies reprojection to parallax estimation relative to the plane at infinity, enabling higher camera trajectory fidelity.

Augmented MultiCamVideo (AugMCV)

Model Architecture

(a) SynCamVideo. Captured with stationary cameras placed at distinct positions.
(b) MultiCamVideo. Captured with dynamic cameras following diverse trajectories, all sharing the same initial frame.
(c) Augmented MultiCamVideo. An augmented version of MultiCamVideo with varied starting poses and different focal lengths.

Experimental Results

Qualitative Results (In-the-Wild)

Qualitative Results (WebVid Dataset)

Quantitative Results

AugMCV result
AugMCV dataset
WebVid result
WebVid dataset

AugMCV dataset. We evaluate our method under two scenarios: (1) source and target videos with identical camera intrinsics, and (2) source and target videos with different camera intrinsics. Across both settings and all metrics, our approach consistently outperforms the baselines, producing videos that are clearly closer to the ground truth.
WebVid dataset. We further validate our method on the WebVid dataset, where it again consistently outperforms baseline approaches in terms of both camera pose accuracy and visual fidelity, with particularly pronounced gains in camera pose accuracy.