Instant4D: 4D Gaussian Splatting in Minutes

Teaser

Instant4D reconstructs a causal video in minutes.

Highlights

  • We propose Instant4D, a modern and fully automated pipeline that reconstructs casual monocular videos in few minutes, achieving 30x speed up.
  • We introduce a grid pruning strategy that reduces the number of Gaussians by 92%, preserving the occlusion structures and enabling scalability to long video sequences.
  • We present a novel design for 4DGS in monocular setup, which achieves 29% better than current state-of-the-art methods on the Dycheck Dataset
Videos: The above videos display the reconstuction results from the sora generated videos. Each video is 9:16 aspect ratio with 5 seconds duration. The reconstruction process for each video takes around 5 minutes including preprocessing time.

Abstract

In this work, we present Instant4D, a monocular reconstruction system that leverages native 4D representation to efficiently process casual video sequences within minutes, without calibrated cameras or depth sensors. reconstruction system that leverages native 4D representation to efficiently process casual video sequences within minutes, without calibrated cameras or depth sensors.

Our method begins with geometric recovery through deep visual SLAM, followed by grid pruning to optimize scene representation. Our design significantly reduces redundancy while maintaining geometric integrity, cutting model size to under 10% of its original footprint. To handle temporal dynamics efficiently, we introduce a streamlined 4D Gaussian representation, achieving a 30x speed-up and reducing training time to within two minutes, while maintaining competitive performance across several benchmarks. We further apply our model to in-the-wild videos, showcasing its generalizability.

Method Pipeline

Model Pipeline
Figure 1: Pipeline of Instant4D. We use Deep Visual SLAM model and Unidepth to obtain camera parameters, and metric depth. The metrics depth would be further optimized to consistent video depth. After that we back project from consistent depth to get dense point cloud, further voxel filtered to sparse point cloud, as discuss in Section 3.2. Based on the 4d Gaussians Initialization, we can reconstruct a scene in 2 minutes. More details about optimization are described in Section 3.3.
Dycheck Results
Figure 2: Qualitative results on Dycheck dataset.
Dycheck Results
Table 2: DyCheck iPhone benchmark. Methods above the mid-rule are trained with ground-truth camera; those below operate without calibrated poses. Runtime denotes the mean training time per scene and Mem the peak GPU memory during optimization. Runtime for RoDyGS, RoDynRF, and D-NeRF is provided by the authors of RodyGS.