NVPanoptix-3D (Matterport3D checkpoint)

🤗 Hugging Face   |   🚀 TAO Toolkit (coming soon)  

NVPanoptix-3D Demo

Description

NVPanoptix-3D is a 3D Panoptic Reconstruction model that reconstructs complete 3D indoor scenes from single RGB images, simultaneously performing 2D panoptic segmentation, depth estimation, 3D scene reconstruction, and 3D panoptic segmentation. Built upon Uni-3D (ICCV 2023) baseline architecture, this model enhances 3D understanding by replacing the backbone with VGGT (Visual Geometry Grounded Transformer) and integrating multi-plane occupancy-aware lifting from BUOL (CVPR 2023) for improved 3D scene re-projection. The 3D stage uses WarpConvNet sparse convolutions, providing native support for modern NVIDIA GPU architectures, and newer CUDA (12.x & 13.x). The model reconstructs complete 3D scenes with both object instances (things) and scene layout (stuff) in a unified framework. This model was trained on the Matterport3D dataset.

This model is ready for non-commercial use.

License/Terms of Use

GOVERNING TERMS: Use of this model is governed by NVIDIA License.  Additional Information:  Apache-2.0 License for https://github.com/mlpc-ucsd/Uni-3D?tab=readme-ov-file; https://github.com/facebookresearch/vggt/blob/main/LICENSE.txt for https://github.com/facebookresearch/vggt; Apache-2.0 License for https://github.com/NVlabs/WarpConvNet.

Deployment Geography

Global

Use Case

This model is intended for researchers and developers building 3D scene understanding applications for indoor environments, including robotics navigation, augmented reality, virtual reality, and architectural visualization.

How to use

Setup environment

# Setup NVPanoptix-3D env (CUDA 13.0):
conda create -n nvpanoptix python=3.10 -y

# Activate environment
source activate nvpanoptix 
# or
# conda activate nvpanoptix

# Clone repo
git clone https://huggingface.co/nvidia/nvpanoptix-3d-v1.1-matterport3d
cd nvpanoptix-3d-v1.1-matterport3d

# For download large checkpoints
git lfs install
git lfs pull

# Install dependencies
apt-get update && apt-get install -y git git-lfs ninja-build cmake libopenblas-dev

pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 --index-url https://download.pytorch.org/whl/cu130
pip install -r requirements.txt

# Install WarpConvNet (sparse 3D convolutions)
# Set CUDA version (if using in Dockerfile, use ENV; in shell, use export)
export CUDA=cu130

# Install core dependencies
pip install --no-deps cupy-cuda13x==13.6.0  # use cupy-cuda13x for CUDA 13.x

# Build torch-scatter from source against the current torch/cuda stack
FORCE_CUDA=1 pip install --no-build-isolation --no-cache-dir --no-binary=torch-scatter torch-scatter

# Install patched WarpConvNet for CUDA 13.x
git clone https://github.com/daocongtuyen2x/WarpConvNet.git
cd WarpConvNet
if [ -d .git ]; then git submodule sync --recursive && git submodule update --init --recursive; fi
pip install --no-build-isolation .

Quick Start

from model import NVPanoptix3DModel
from preprocessing import load_image
from visualization import save_outputs
from PIL import Image
import numpy as np

# Load model from local directory
model = NVPanoptix3DModel.from_pretrained("path/to/local/repo/directory")

# Or load from HF repo
# model = NVPanoptix3DModel.from_pretrained("nvidia/nvpanoptix-3d-v1.1-matterport3d")

# Load and preprocess image
image_path = "path/to/your/image.png"

# keep original image for visualization
orig_image = Image.open(image_path).convert("RGB")
orig_image = np.array(orig_image)

# load processed image for inference
image = load_image(image_path, target_size=(320, 240))

# Run inference
outputs = model.predict(image)

# Save results (2D segmentation, depth map, 3D mesh)
save_outputs(outputs, "output_dir/", original_image=orig_image)

# Access individual outputs
print(f"2D Panoptic: {outputs.panoptic_seg_2d.shape}")   # (120, 160)
print(f"2D Depth: {outputs.depth_2d.shape}")             # (120, 160)
print(f"3D Geometry: {outputs.geometry_3d.shape}")       # (256, 256, 256)
print(f"3D Semantic: {outputs.semantic_seg_3d.shape}")   # (256, 256, 256)

Release Date

Hugging Face: 03/26/2026 via https://huggingface.co/nvidia/nvpanoptix-3d-v1.1-matterport3d

References

Model Architecture

Architecture Type: Two-Stage Architecture (Transformer + Sparse Convolutional Network)

Network Architecture:

  • 2D Stage: Transformer-based (VGGT Backbone + Mask2Former-style Decoder)

  • 3D Stage: WarpConvNet Sparse 3D CNN Frustum Decoder

  • Number of parameters: 1.4*10^9

  • This model was developed based on: Uni-3D (ICCV 2023) with VGGT backbone replacement, BUOL occupancy-aware lifting integration, and WarpConvNet sparse 3D convolutions replacing MinkowskiEngine.

Input

Input Type: Image

Input Format:

  • Image: Red, Green, Blue (RGB)

Input Parameters:

  • Image: Two-Dimensional (2D)

Other Properties Related to Input:

  • RGB Image:
    • Standard size 240 x 320 (H x W), uint8 [0, 255]
    • Processed internally to ~N x 448 (height adjusted to be divisible by 14) for VGGT backbone
    • Minimum resolution: 240 x 320
    • Padded to ensure dimensions divisible by 32 for multi-scale processing

Outputs

Output Types: Mask, Depth Map, 3D Geometry/Segmentation

Output Formats:

  • 2D Segmentation: Binary masks with integer instance IDs
  • Depth Map: Floating point depth values in meters
  • 3D Geometry: Truncated Signed Distance Field (TSDF)
  • 3D Segmentation: Integer labels (instance IDs and semantic classes)

Output Parameters:

  • 2D Masks: Two-Dimensional (2D)
  • Depth Map: Two-Dimensional (2D)
  • 3D Geometry/Segmentation: Three-Dimensional (3D)

Other Properties Related to Output:

2D Outputs:

  • pred_logits: [Batch, 100, 14] - Classification scores for 100 queries across 13 semantic classes + background
  • pred_masks: [Batch, 100, H/2, W/2] - Binary segmentation masks for each query
  • pred_depths: [Batch, 1, H/2, W/2] - Per-pixel depth in meters, range [0.4, 6.0]
  • panoptic_seg: [H/2, W/2] - 2D panoptic segmentation with instance IDs
  • pose_enc: [Batch, 9] - Camera pose encoding from VGGT

3D Outputs:

  • geometry: [Batch, 256, 256, 256] - TSDF representing reconstructed 3D geometry
  • panoptic_seg_3d: [Batch, 256, 256, 256] - 3D panoptic segmentation with instance IDs
  • semantic_seg_3d: [Batch, 256, 256, 256] - 3D semantic segmentation with class labels
  • instance_info: List of dictionaries containing per-instance 3D meshes and metadata

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine(s):

  • TAO Toolkit Triton Apps

Supported Hardware Microarchitecture Compatibility:

  • Optimized for NVIDIA A100 80GB GPUs (Ampere architecture).
  • Requires GPU with high memory capacity (≥40GB recommended).
  • Compatible with NVIDIA Ampere (A100), Hopper (H100), and Blackwell (B200) GPU architectures via WarpConvNet sparse convolution support.

Preferred/Supported Operating System(s):

  • Preferred: Ubuntu 22.04.5 LTS (Jammy Jellyfish), tested with CUDA 13.0.

  • Supported: Other Ubuntu versions (20.04+, 22.04+) and Linux distributions with compatible CUDA 12.x & 13.x drivers.

The model requires NVIDIA GPU with ≥40GB memory for training and ≥30GB for inference. By leveraging NVIDIA hardware (GPU cores) and software frameworks (CUDA libraries), the model achieves efficient training and inference. The integration of this model into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment.

Model Version(s)

1.1

(Pre-trained NVPanoptix3D model with WarpConvNet backend, deployable to Triton Inference Server for inference)

Training, Testing, and Evaluation Datasets

Dataset Overview

Total Number of Datasets: 01 Dataset (Matterport3D)

Data Modality: Image, 3D Geometry.

Matterport3D

Link: https://niessner.github.io/Matterport/
Data Modality: Image, 3D Geometry
Image Training Data Size: Less than a Million Images
Data Collection Method: Automatic/Sensors - Real-world 3D scans using Matterport Pro camera
Labeling Method: Hybrid: Automatic/Sensors, Human - Semi-automatic with human verification
Properties: Matterport3D is a real-world dataset comprising 3D reconstructions of 90 indoor scenes. It provides RGB images, depth maps, camera poses, and semantic annotations across diverse environments such as homes, offices, and other building types. Each scene includes dense 3D point clouds and surface reconstructions annotated with category-level semantic labels. The dataset is divided into 34,737, 4,898, and 8,631 images for training, validation, and testing, corresponding to 61, 11, and 18 scenes, respectively.

Inference

Acceleration Engine: Triton
Test Hardware:

  • 1x NVIDIA A100 80GB
  • 1x NVIDIA H100 80GB

Configuration:

  • Precision: FP32

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month
48
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including nvidia/nvpanoptix-3d-v1.1-matterport3d

Paper for nvidia/nvpanoptix-3d-v1.1-matterport3d