Instant4D: 4D Gaussian Splatting in Minutes

Zhanpeng Luo¹ Haoxi Ran² Li Lu³

¹University of Pittsburgh ²Carnegie Mellon University ³Sichuan University

NeurIPS 2025

Paper arXiv Code

Instant4D reconstructs a causal video in minutes.

Highlights

We propose Instant4D, a modern and fully automated pipeline that reconstructs casual monocular videos in few minutes, achieving 30x speed up.
We introduce a grid pruning strategy that reduces the number of Gaussians by 92%, preserving the occlusion structures and enabling scalability to long video sequences.
We present a novel design for 4DGS in monocular setup, which achieves 29% better than current state-of-the-art methods on the Dycheck Dataset

Einstein's Bear Ride

Panda Guitar Serenade

Bear's Sweet Surprise

Fox Makes Pizza

Sheep Vendor Fantasy

Videos: The above videos display the reconstuction results from the sora generated videos. Each video is 9:16 aspect ratio with 5 seconds duration. The reconstruction process for each video takes around 5 minutes including preprocessing time.

Abstract

In this work, we present Instant4D, a monocular reconstruction system that leverages native 4D representation to efficiently process casual video sequences within minutes, without calibrated cameras or depth sensors. reconstruction system that leverages native 4D representation to efficiently process casual video sequences within minutes, without calibrated cameras or depth sensors.

Our method begins with geometric recovery through deep visual SLAM, followed by grid pruning to optimize scene representation. Our design significantly reduces redundancy while maintaining geometric integrity, cutting model size to under 10% of its original footprint. To handle temporal dynamics efficiently, we introduce a streamlined 4D Gaussian representation, achieving a 30x speed-up and reducing training time to within two minutes, while maintaining competitive performance across several benchmarks. We further apply our model to in-the-wild videos, showcasing its generalizability.

Method Pipeline

Figure 1: Pipeline of Instant4D. We use Deep Visual SLAM model and Unidepth to obtain camera parameters, and metric depth. The metrics depth would be further optimized to consistent video depth. After that we back project from consistent depth to get dense point cloud, further voxel filtered to sparse point cloud, as discuss in Section 3.2. Based on the 4d Gaussians Initialization, we can reconstruct a scene in 2 minutes. More details about optimization are described in Section 3.3.

Figure 2: Qualitative results on Dycheck dataset.

Table 2: DyCheck iPhone benchmark. Methods above the mid-rule are trained with ground-truth camera; those below operate without calibrated poses. Runtime denotes the mean training time per scene and Mem the peak GPU memory during optimization. Runtime for RoDyGS, RoDynRF, and D-NeRF is provided by the authors of RodyGS.