Accepted to CVPR 2026

LightSplat: Fast and Memory-EfficientOpen-Vocabulary 3D Scene Understanding in Five Seconds

1AIGS, UNIST, 2GSAI, POSTECH
* Work done while at UNIST. Corresponding author.
LightSplat removes iterative optimization and dense per-Gaussian language features from open-vocabulary 3D scene understanding. It injects compact 2-byte semantic indices, filters unreliable masks with 3D support, and consolidates semantics at the object cluster level. The result is a training-free pipeline that remains interpretable, memory-efficient, and fast, completing feature distillation in roughly five seconds.
Distillation Time
~5 Seconds
Single-step, training-free semantic injection.
SpeedUp
50-400×
Versus recent open-vocabulary 3D scene understanding baselines.
Memory Footprint
64× Lower
Cluster-level semantics replace heavy Gaussian-level features.
Gaussian Storage
2 Bytes
Compact indexing keeps the representation lightweight and scalable.
LightSplat teaser figure comparing speed, accuracy, and memory overhead

Figure 1. Comprehensive comparison of speed, performance, and memory overhead. We evaluate recent open-vocabulary 3D scene understanding models in terms of distillation time (x-axis), segmentation performance (y-axis), and memory overhead (circle size). LightSplat achieves 50× faster feature distillation, higher accuracy, and 64× lower memory usage. LUDVIG's circle is shown at half size because it is too large to display at full scale.

Abstract

Open-vocabulary 3D scene understanding enables users to segment novel objects in complex 3D environments through natural language. However, existing approaches remain slow, memory-intensive, and overly complex due to iterative optimization and dense per-Gaussian feature assignments.

To address this, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. By assigning semantic indices only to salient regions and managing them with a lightweight index-feature mapping, LightSplat eliminates costly feature optimization and storage overhead.

We further ensure semantic consistency and efficient inference via single-step clustering that links geometrically and semantically related masks in 3D. We evaluate our method on LERF-OVS, ScanNet, and DL3DV-OVS across complex indoor-outdoor scenes, achieving state-of-the-art performance with up to 50-400× speedup and 64× lower memory.

Why Existing Methods Do Not Scale

The bottleneck is not just accuracy. Existing open-vocabulary 3D pipelines usually pay for semantics in the most expensive possible way: repeated optimization, dense language features on every Gaussian, and supervision that remains overly dependent on 2D rendering.

Problem 1
Optimization Bottleneck

Repeated rendering and CLIP alignment turn feature distillation into the bottleneck.

Problem 2
Feature Memory Load

Dense language features inflate storage and force expensive query-time comparisons.

Problem 3
2D-to-3D Drift

Blurry rendered features weaken geometry-consistent semantics in 3D.

How LightSplat Works

LightSplat directly links 2D semantics to 3D structure through compact mask indices. Instead of optimizing dense features on every Gaussian, it injects lightweight indices, filters unreliable masks, and groups related Gaussians into coherent 3D object clusters. This design enables efficient open-vocabulary 3D understanding without iterative training.

Overall Framework of LightSplat
Overall framework of LightSplat

Figure 2. Overall framework of LightSplat. From multi-view images, we extract SAM masks and their corresponding CLIP features, then align them to the 3D scene through indexed feature injection. 3D-aware mask filtering removes unreliable masks, while index-feature mapping and context-aware 3D clustering build compact object-level representations for efficient, training-free open-vocabulary 3D scene understanding.

Step 1
Multi-View Semantics

Collect SAM masks and CLIP features to build a clean 2D semantic inventory.

Step 2
Indexed Feature Injection

Assign only the most influential mask index instead of full language features.

Step 3
3D-Aware Mask Filtering

Remove weak masks to suppress view-dependent artifacts and stabilize 3D semantics.

Step 4
Context-Aware 3D Clustering

Combine geometric overlap and semantic similarity to form interpretable object-level clusters.

Fast Inference via Cluster-Feature Mapping
Fast inference via cluster-feature mapping

Figure 3. Fast inference via cluster-feature mapping. During inference, the text query is compared with a compact set of cluster features instead of all Gaussians or pixels, enabling fast retrieval with compact object-level representations.

Benchmark Results

Across LERF-OVS, DL3DV-OVS, and ScanNet, LightSplat combines strong 3D segmentation quality with low feature distillation cost, fast inference, and low memory overhead.

Table 1. Quantitative Comparison for 3D Object Selection on LERF-OVS
Best and second-best results in each column are highlighted. FD Time denotes feature distillation time; values over 20 minutes are rounded to the ones place.
Method mIoU mAcc @ 0.25 FD Time
Waldo Ramen Figurines Teatime Mean Waldo Ramen Figurines Teatime Mean
LangSplat 9.613.499.647.917.66 9.095.6314.298.479.37 100 min
LEGaussians 17.2114.1516.4021.9317.42 27.2728.1725.0033.9028.56 40 min
OpenGaussian 30.9624.0255.3858.2442.15 45.4528.1775.0076.2756.22 50 min
LUDVIG 28.0829.3347.4352.2839.28 51.3350.3667.0285.7263.61 10 min
Dr.Splat 39.0724.7053.3657.2043.58 63.6435.2180.3676.2763.87 4 min
Ours 34.9545.0750.6359.6547.58 59.0957.7576.7979.6668.32 4.2 s
Table 2. Quantitative Comparison for 3D Object Selection on DL3DV-OVS
Best and second-best results in each column are highlighted. FD Time denotes feature distillation time; values over 20 minutes are rounded to the ones place.
Method mIoU mAcc @ 0.25 FD Time
Park Shop Road Office Mean Park Shop Road Office Mean
LangSplat 8.799.287.9015.8010.44 0.0017.6516.6727.2715.40 220 min
LEGaussians 21.959.703.518.1710.83 33.3311.760.000.0011.27 40 min
OpenGaussian 29.6519.5141.596.7824.38 41.6735.2966.670.0035.91 60 min
LUDVIG 37.0333.5627.6818.5829.21 70.0056.7553.5747.2256.89 12 min
Ours 69.7835.5931.4343.1344.98 91.6747.0650.0054.5560.82 4.8 s
Table 3. Quantitative Comparison for 3D Semantic Segmentation on ScanNet
Best and second-best results in each column are highlighted. FD Time, Runtime, and Memory denote feature distillation time, average inference time per text query, and feature size per Gaussian, respectively.
Method 19 Classes 15 Classes 10 Classes FD Time Runtime(second) Memory(byte)
mIoU mAcc mIoU mAcc mIoU mAcc
LangSplat 2.6110.114.0813.226.3020.48 40 min2.136
LEGaussians 1.627.265.7214.239.8419.13 50 min1.732
OpenGaussian 29.4343.6232.6148.2641.2956.42 30 min0.00324
LUDVIG 28.4744.1731.4748.5440.4758.15 4 min0.0062048
Dr.Splat 28.0044.6038.2060.4047.2068.90 3 min-128
Ours 37.1158.6639.7860.9147.7868.21 4.1 s0.0022

Qualitative Results

Qualitative results show that LightSplat preserves clean object boundaries, stable object identity, and broad scene coverage across small objects, repeated instances, and large spatial regions.

LERF-OVS: Detailed 3D Object Selection
Qualitative comparison on LERF-OVS

Figure 4. Qualitative comparison on LERF-OVS. Across different scenes and text queries, LightSplat produces detailed object boundaries while remaining substantially faster than prior methods.

DL3DV-OVS: Robustness in Larger Scenes
Qualitative comparison on DL3DV-OVS

Figure 5. Qualitative comparison on DL3DV-OVS. In large and complex indoor-outdoor scenes, LightSplat provides reliable selections and clear object boundaries, even when many similar objects appear in the same scene.

ScanNet: Cleaner Semantic Segmentation
Qualitative comparison on ScanNet

Figure 6. Qualitative comparison on ScanNet. LightSplat more effectively captures both object-level semantics and large-area regions, showing robust performance across diverse real-world scenes and text queries.

Ablation Study

Each component contributes substantially to overall performance. Removing 3D-aware mask filtering, semantic-aware clustering, or geometry-aware clustering causes a large drop in accuracy, while FD Time remains nearly unchanged across variants.

Table 4. Ablation Study on LERF-OVS
Each component improves performance and also contributes to the fast FD Time of the full model. FD Time denotes feature distillation time.
Variant mIoU Acc@0.25 FD Time
Ours Full 47.58 68.32 4.55 s
w/o 3D Mask Filtering 29.31 25.96 4.54 s
w/o Semantic-Aware 19.56 18.97 4.42 s
w/o Geometry-Aware 2.01 2.27 4.44 s

BibTeX

@misc{bang2026lightsplatfastmemoryefficientopenvocabulary,
  title={LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds},
  author={Jaehun Bang and Jinhyeok Kim and Minji Kim and Seungheon Jeong and Kyungdon Joo},
  year={2026},
  eprint={2603.24146},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.24146}
}