MonoInstance

Abstract

Monocular depth priors have been widely adopted by neural rendering in multi-view based tasks such as 3D reconstruction and novel view synthesis. However, due to the inconsistent prediction on each view, how to more effectively leverage monocular cues in a multi-view context remains a challenge. Current methods treat the entire estimated depth map indiscriminately, and use it as ground truth supervision, while ignoring the inherent inaccuracy and cross-view inconsistency in monocular priors. To resolve these issues, we propose MonoInstance, a general approach that explores the uncertainty of monocular depths to provide enhanced geometric priors for neural rendering and reconstruction. Our key insight lies in aligning each segmented instance depths from multiple views within a common 3D space, thereby casting the uncertainty estimation of monocular depths into a density measure within noisy point clouds. For high-uncertainty areas where depth priors are unreliable, we further introduce a constraint term that encourages the projected instances to align with corresponding instance masks on nearby views. MonoInstance is a versatile strategy which can be seamlessly integrated into various multi-view neural rendering frameworks. Our experimental results demonstrate that MonoInstance significantly improves the performance in both reconstruction and novel view synthesis under various benchmarks.

Method

In this paper, we introduce MonoInstance, which detects uncertainty in 3D according to inconsistent clues from monocular priors on multi-view. Our method is a general strategy to enhance monocular priors for various multi-view neural rendering and reconstruction frameworks. Based on the uncertainty maps, we introduce novel strategies to reduce the negative impact brought by inconsistent monocular clues and mine more reliable supervision through photometric consistency.

Here is an overview of our method. We take multi-view 3D reconstruction through NeRF based rendering as an example. (a) Starting from multi-view consistent instance segmentation and estimated monocular depths, we align the same instance from different viewpoints by back-projecting instance depths into a point cloud. The monocular inconsistent clues across different views become a measurement of density estimation in neighborhood of each point, leading to uncertainty maps (Section 3.2). The estimated uncertainty maps are further utilized in (b) neural rendering pipeline to guide adaptive depth loss, ray sampling (Section 3.4) and (c) instance mask constraints (Section 3.3).

Visualization Results

Experiments on Dense-view Reconstruction Task

We compare our method with the latest indoor scene reconstruction methods using dense viewpoints on ScanNet and Replica datasets.

Experiments on Sparse-view Reconstruction Task

We also evaluate our method in reconstructing 3D shapes from sparse observations on DTU dataset. Each of the scene shows single object with background from 3 viewpoints with small overlapping.

Experiments on Sparse Novel View Synthesis Task

We further evaluate our method on 3DGS-based sparse-input novel view synthesis (NVS) task on LLFF dataset. Each on of the forward-facing real-world scenes contains 3 views. In the uncertainty maps, areas that are more white indicate higher uncertainty.

Scene Display And More Visualization Results

BibTeX

@inproceedings{zhang2025monoinstance,
      title={MonoInstance: Enhancing Monocular Priors via Multi-view Instance Alignment for Neural Rendering and Reconstruction},
      author={Wenyuan Zhang and Yixiao Yang and Han Huang and Liang Han and Kanle Shi and Yu-Shen Liu and Zhizhong Han},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
      year={2025}
  }

MonoInstance: Enhancing Monocular Priors via Multi-view Instance Alignment for Neural Rendering and Reconstruction

CVPR 2025