Enhancing 3D semantic scene completion with refinement module

Zhang, Dunxing; Lu, Jiachen; Yang, Han; Bao, Lei; Song, Bo

doi:10.3389/fnbot.2026.1768219

ORIGINAL RESEARCH article

Front. Neurorobot., 06 March 2026

Volume 20 - 2026 | https://doi.org/10.3389/fnbot.2026.1768219

Enhancing 3D semantic scene completion with refinement module

1. Chair of Robotics, Artificial Intelligence and Real-time Systems, Technical University of Munich, Munich, Germany
2. National Science Center for Earthquake Engineering, Tianjin University, Tianjin, China
3. School of Civil Engineering, Tianjin University, Tianjin, China

A correction has been applied to this article in:

Correction: Enhancing 3D semantic scene completion with refinement module
1. Read correction

Abstract

We propose ESSC-RM, a plug-and-play Enhancing framework for Semantic Scene Completion with a Refinement Module, which can be seamlessly integrated into existing semantic scene completion (SSC) models. ESSC-RM operates in two phases: a baseline SSC network first produces a coarse voxel prediction, which is subsequently refined by a 3D U-Net–based Prediction Noise-Aware Module (PNAM) and Voxel-level Local Geometry Module (VLGM) under multiscale supervision. Experiments on SemanticKITTI show that ESSC-RM consistently improves semantic prediction performance. When integrated into CGFormer and MonoScene, the mean IoU increases from 16.87 to 17.27% and from 11.08 to 11.51%, respectively. These results demonstrate that ESSC-RM serves as a general refinement framework applicable to a wide range of SSC models. Project page: https://github.com/LuckyMax0722/ESSC-RM and https://github.com/LuckyMax0722/VLGSSC.

1 Introduction

Accurate 3D scene understanding is fundamental to autonomous driving, robotics, and embodied perception, where downstream tasks such as detection, reconstruction, mapping, and planning rely on complete geometric and semantic representations of the environment (Guo et al., 2019; Yurtsever et al., 2020; Cao et al., 2022; Ma et al., 2022; Zhao H. et al., 2024; Cao et al., 2024a). However, real-world sensors (LiDAR and RGB cameras) provide only sparse, noisy, and partial observations due to occlusions, limited resolution, restricted field of view, and missing depth information, resulting in incomplete voxelized scenes (Roldão et al., 2021; Cao et al., 2024b). To address this, 3D semantic scene completion (SSC) aims to jointly infer voxel occupancy and semantic labels, a task first formalized by SSCNet (Song et al., 2016).

Despite extensive progress in both LiDAR-based (Roldão et al., 2020; Yan et al., 2020; Xia et al., 2023; Jang et al., 2024) and vision-based SSC (Cao and de Charette, 2021; Li Y. et al., 2023; Jiang et al., 2024; Tang et al., 2023), a considerable gap remains between predictions and ground truth. LiDAR-based models suffer from sparsity; BEV-based methods (Yang et al., 2021) lose fine-grained details; RGB-based approaches degrade due to depth ambiguity and unclear 2D–3D projection (Lee et al., 2024); and distillation pipelines depend heavily on task-specific teacher designs (Xia et al., 2023). Moreover, SSC architectures differ substantially, making it difficult to develop a unified refinement strategy that generalizes across models without modifying their internal structures.

To bridge these limitations, this paper proposes ESSC-RM, a unified coarse-to-fine refinement framework that directly enhances the voxel predictions of arbitrary SSC models. ESSC-RM performs multi-scale geometric–semantic aggregation, integrates auxiliary priors, and introduces a model-agnostic refinement pipeline that requires no architectural modification to the baseline. It supports both end-to-end joint training and fully independent plug-and-play deployment.

The main contributions of this paper are as follows:

We introduce ESSC-RM, a general refinement framework designed to improve heterogeneous SSC baselines via coarse-to-fine multi-scale error reduction, applicable to both LiDAR-based and vision-based methods.
We develop two complementary training paradigms: a joint training mode that co-optimizes the refinement and baseline networks, and a separate training mode enabling true plug-and-play enhancement without modifying the original SSC architecture.
We propose a neighborhood-attention-based multi-scale aggregation module that adaptively fuses geometric and semantic features, improving voxel-level reasoning across scales.
We introduce a novel vision–language guidance module that injects text-derived semantic priors to compensate for missing geometric cues and ambiguous visual projections, enhancing cross-modal scene understanding.
Extensive experiments on SemanticKITTI (Behley et al., 2019) demonstrate that ESSC-RM consistently improves strong baselines such as CGFormer and MonoScene, validating its generality, flexibility, and effectiveness.

2 Related work

In this section, we review LiDAR- and camera-based 3D perception, then summarize advances in 3D SSC, and finally discuss recent progress in vision–language models (VLMs) and text-driven multimodal fusion.

2.1 LiDAR-based 3D perception

LiDAR provides accurate 3D geometry for autonomous driving perception, enabling detection, tracking, and mapping, and has become a core sensing modality (Guo et al., 2019; Yurtsever et al., 2020; Ma et al., 2022; Zhao H. et al., 2024; Wu et al., 2022; Lin and Wu, 2025).

Early point-based and voxel-based detectors—PointNet (Qi et al., 2016), VoxelNet (Zhou and Tuzel, 2017), SECOND (Yan et al., 2018), PointPillars (Lang et al., 2018), PointRCNN (Shi et al., 2018), PV-RCNN (Shi et al., 2019), and Voxel R-CNN (Deng et al., 2020)—established effective feature extraction paradigms. Tracking frameworks such as AB3DMOT (Weng et al., 2020; Cho and Kim, 2023) leverage motion models and geometric association. Semantic segmentation approaches including PointNet++ (Qi et al., 2017), RangeNet++ (Milioto et al., 2019), and Cylinder3D (Zhou et al., 2020) demonstrate point-based, projection-based, and cylindrical-voxel inference strategies.

2.2 Camera-based 3D perception

Camera-based perception offers a cost-efficient alternative with rich semantic cues. Monocular approaches extend 2D detectors (Brazil and Liu, 2019; Duan et al., 2019; Manhardt et al., 2018) or rely on pseudo-depth and geometric priors (Xu and Chen, 2018; Wang et al., 2018; Zia et al., 2014; Mousavian et al., 2016; Hu et al., 2018), yet remain affected by depth ambiguity. Stereo-based methods (Chang and Chen, 2018; Li et al., 2019; You et al., 2019; Chen et al., 2020) mitigate this by enforcing geometric consistency (Mao et al., 2023).

With multi-camera setups becoming standard, multi-view 3D detection methods have evolved rapidly. LSS-based pipelines (Philion and Fidler, 2020; Huang et al., 2021) lift image features to Bird's-Eye View (BEV), while transformer-based designs such as DETR3D (Wang et al., 2021) and BEVFormer (Li Z. et al., 2022) aggregate cross-view features using 3D object queries. Spatiotemporal attention mechanisms (Vaswani et al., 2017; Doll et al., 2022; Mao et al., 2023) further enhance robustness.

2.3 Semantic scene completion

SSC jointly predicts occupancy and voxel-level semantics. SSCNet (Song et al., 2016) established the task on indoor data (Silberman et al., 2012); outdoor datasets such as KITTI and SemanticKITTI (Geiger et al., 2012; Behley et al., 2019, 2021; Li et al., 2024) introduce sparsity and large-scale variability.

2.4 Vision–language models

Vision–language models (VLMs) provide strong semantic priors through aligned image–text representations (Liu et al., 2025). CLIP (Radford et al., 2021) and EVACLIP (Sun et al., 2023a,b) learn powerful contrastive embeddings, while LongCLIP (Zhang et al., 2024) and JinaCLIP (Xiao et al., 2024a; Koukounas et al., 2024) improve long-text modeling.

Models such as BLIP2 (Li J. et al., 2023), InstructBLIP (Dai et al., 2023), MiniGPT-4 (Zhu et al., 2024), and LLaVA (Liu H. et al., 2023; Liu et al., 2024) leverage frozen Large Language Models (LLMs) to build efficient multimodal reasoning pipelines (OpenAI et al., 2024). Text-conditioned segmentation models such as LSeg (Li B. et al., 2022) and Grounded-SAM (Ren et al., 2024) further highlight the utility of text in perception tasks (Liu S. et al., 2023; Kirillov et al., 2023).

2.5 Multimodal fusion and text modality

Multimodal fusion traditionally combines 3D geometry (LiDAR, stereo) with rich 2D semantics. With the emergence of LLMs and VLMs, text has become a scalable, low-cost semantic modality for describing road scenes (Li and Tang, 2024; Liu et al., 2025).

Attention-based fusion (Vaswani et al., 2017; Cao et al., 2021)—as in (Xu et al. 2020), (Cao et al. 2024c), and (Wang et al. 2026)—captures long-range cross-modal dependencies but can be computationally heavy. Learnable fusion strategies such as Text-IF (Yi et al., 2024) and VLScene (Wang et al., 2025) use trainable coefficients to balance visual and linguistic cues.

3 Methodology

ESSC-RM refines the coarse voxel predictions produced by any SSC backbone. We now present the problem formulation and describe the architecture components of our refinement module, including the 3D U-Net backbone, the progressive neighborhood attention module (PNAM), and the vision–language guidance module (VLGM), as illustrated in Figure 1.

Figure 1

3.1 Problem statement

Given an RGB image and a LiDAR point cloud at time t, 3D SSC aims to predict a dense semantic voxel grid defined in the vehicle coordinate system, where each voxel is either empty (c₀) or belongs to one of the C semantic classes {c₁, …, c_C} and H, W, Z denote the voxel grid dimensions. A standard SSC backbone learns , but the coarse prediction often exhibits broken surfaces, incomplete structures, and semantic confusions. We therefore introduce a refinement module g_ϕ that treats as a noisy discrete volume and outputs a refined prediction , where aux denotes additional cues (multi-scale voxel features and text semantics) extracted within the refinement module. The objective is to bring closer to the ground truth Y_t in both geometry and semantics while remaining compatible with heterogeneous SSC backbones.

3.2 Overall architecture

As shown in Figure 1, ESSC-RM has two decoupled parts:

SSC backbone: maps (I_t, P_t) to a coarse voxel grid .
Refinement module: operates purely in voxel space, refining into using multi-scale U-Net features, neighborhood attention, and vision–language guidance.

This separation allows us to plug in backbones of different quality while focusing the design of g_ϕ on correcting geometric and semantic errors at the voxel level using additional structural and semantic cues.

3.3 SSC backbone

ESSC-RM is model-agnostic and can refine the output of any SSC backbone. To demonstrate generality, we instantiate two monocular SSC models with different coarse prediction qualities: CGFormer (Tang et al., 2023) and MonoScene (Cao and de Charette, 2021). CGFormer represents a strong backbone with accurate voxel lifting, while MonoScene produces notably noisier volumes, providing a more challenging setting for refinement. All architectural details follow the original papers, as our refinement module does not modify or depend on the internal design of the backbone.

3.4 3D U-Net refinement backbone

The refinement module receives the coarse discrete volume and must (i) map it into a continuous feature space; (ii) aggregate multi-scale contextual information; and (iii) reconstruct a refined voxel grid . To accomplish these steps, we adopt a three-dimensional U-shaped neural network (3D U-Net) backbone (Çiçek et al., 2016; Ronneberger et al., 2015), whose overall encoder–bottleneck–decoder structure is illustrated in Figure 2. The specific computational blocks that constitute the encoder and decoder, namely the feature encoding block (FEB) and the feature aggregation block (FAB), are detailed in Figure 3.

Figure 2

Figure 3

3.4.1 Voxel embedding and encoder–decoder

We first embed the discrete labels of into a continuous feature map:

where . A 1 × 1 × 1 3D convolution then produces the input feature:

The encoder uses four stacked feature encoding blocks (FEBs; Figure 3) to extract multi-scale features F^1:s at progressively lower resolutions. For a voxel grid of size H×W×Z and feature dimension G, the encoder outputs

A bottleneck processes F^1:16, and the decoder then upsamples via four stacked feature aggregation blocks (FABs), followed by a shared prediction head that produces voxel logits at multiple scales:

where C is the number of semantic classes. At inference time, we use

as the final refined semantic voxel prediction.

3.4.2 Feature encoding block (FEB)

Each FEB refines features at a given scale and produces both a skip feature and a downsampled feature. As in Figure 3, an FEB applies two 3D convolutions with InstanceNorm3D (Ulyanov et al., 2016) and LeakyReLU (Xu et al., 2015), followed by a residual skip and a stride-2 convolution:

3.4.3 Feature aggregation block (FAB) and multi-scale supervision

Each FAB upsamples low-resolution features and fuses them with encoder skip features:

Following PaSCo (Cao A.-Q. et al., 2024), each decoder feature map is mapped to logits by a 1 × 1 × 1 3D convolution:

and all scales are supervised during training. This encourages coarse-to-fine refinement and stabilizes optimization.

3.5 Progressive neighborhood attention module (PNAM)

Purely convolutional decoders aggregate context only within fixed local windows, limiting their ability to capture long-range and structure-aware voxel relations. To address this, we integrate the Progressive Neighborhood Attention Module (PNAM) (Liu T. et al., 2023) into the decoder of our refinement network.

As illustrated in Figure 4, the FABs at scales 1:2, 1:4, and 1:8 are replaced with PNA-based FABs, while the finest-scale FAB remains convolutional for efficiency. PNAM enhances multi-scale voxel reasoning by combining global self-attention (Vaswani et al., 2017) with localized neighborhood aggregation (Hassani and Shi, 2022; Hassani et al., 2023, 2024).

Figure 4

3.5.1 PNA-based feature aggregation block

As illustrated in Figure 5, a PNA-based FAB consists of two branches: (1) a self-attention (SA) branch operating on F_up; and (2) a neighborhood cross-attention (NCA) branch operating between F_skip and F_up.

Figure 5

Given the upsampled feature and the corresponding skip feature , the two attention responses are computed as:

for ℓ ∈ {2, 4, 8}. The outputs are fused and refined via normalization and a lightweight feed-forward network (FFN):

3.5.2 Self-attention (SA)

SA refines the upsampled voxel features by capturing long-range dependencies. Following the standard multi-head attention formulation (Vaswani et al., 2017), we use 1³ and depthwise 3³ convolutions to compute Q, K, V, followed by attention and a residual FFN. This propagates global geometric–semantic cues, compensating for missing structures in the coarse prediction.

3.5.3 Neighborhood cross-attention (NCA)

NCA enforces local geometric consistency. Inspired by the NATTEN family of neighborhood attention operators (Hassani and Shi, 2022; Hassani et al., 2023, 2024), it restricts attention to a 3D neighborhood window, enabling each voxel to aggregate high-confidence structural cues from spatially adjacent voxels. This makes PNAM particularly effective at restoring fine structures such as object boundaries and thin geometry.

Overall, PNAM strengthens the refinement network's ability to jointly model global context and local voxel continuity across scales.

3.6 Vision–language guidance module (VLGM)

Even with stronger voxel–voxel reasoning, SSC remains ambiguous in occluded or sparsely observed regions. To inject high-level scene priors—such as road layout, object co-occurrence patterns, or typical urban structures—we introduce the Vision–Language Guidance Module (VLGM). As illustrated in Figure 6, the module leverages a frozen vision–language model (VLM) to produce a free-form scene description, whose textual semantics are encoded and fused into the voxel refinement pipeline.

Figure 6

3.6.1 Text acquisition and semantic encoding

Given an input image I and prompt P, a frozen VLM such as LLaVA (Liu H. et al., 2023; Liu et al., 2024) or InstructBLIP (Dai et al., 2023) generates a scene description

which is precomputed offline to avoid training overhead.

To capture different levels of textual semantics, we employ two complementary encoders. (1) JinaCLIP (Xiao et al., 2024a; Koukounas et al., 2024) extracts a global embedding

providing holistic scene cues; and (2) A Q-Former (Li J. et al., 2023) produces token-level embeddings

which enable fine-grained cross-modal alignment. This design follows instruction-style prompting practices used in (Dai et al. 2023); (Liu et al. 2025).

3.6.2 Text–voxel fusion modules

To integrate text cues into voxel refinement, we build a Text U-Net by inserting lightweight fusion blocks after each FEB and FAB. Each fusion block consists of two components:

3.6.2.1 Semantic interaction guidance module (SIGM)

Following Text-IF (Yi et al., 2024), global JinaCLIP features are mapped to affine parameters (γ_m, β_m) via MLPs. Voxel features are modulated as

injecting scene-level priors that guide early geometric reasoning.

3.6.2.2 Dual cross-attention module (DCAM)

Inspired by BLIP-2 (Li J. et al., 2023), SAM (Kirillov et al., 2023), and MultiRAtt-RSSC (Cai et al., 2024), DCAM alternates self- and cross-attention between Q-Former tokens and voxel features. Text self-attention yields , followed by text-to-voxel cross-attention producing , and voxel-to-text cross-attention generating . A residual update produces

SIGM injects global scene priors (e.g., “urban street with parked vehicles”), while DCAM provides fine-grained token-level alignment. As visualized in Figure 6, the two components operate synergistically to improve geometric completeness and semantic coherence, especially in occluded and ambiguous regions.

3.7 Loss function

ESSC-RM performs coarse-to-fine refinement across multiple spatial scales. We therefore supervise both voxel-wise predictions and scene-level consistency using two complementary terms: a class-weighted cross-entropy loss and the scene–class affinity loss (SCAL) (Cao and de Charette, 2021; Tang et al., 2023). This combination stabilizes multi-scale refinement while encouraging globally coherent semantics.

3.7.1 Cross-entropy loss

At each refinement scale l, voxel predictions are supervised using a class-weighted cross-entropy:

where ŷ′ denotes refinement logits and w_c compensates for class imbalance (Roldão et al., 2020). Aggregating all scales yields:

3.7.2 Scene–class affinity loss (SCAL)

To promote globally consistent refinement—particularly under sparsity or ambiguous projections—we adopt SCAL (Cao and de Charette, 2021), which optimizes class-wise precision (P_c), recall (R_c), and specificity (S_c). Let p_i denote the ground-truth class for voxel i, and the predicted probability for class c. Using Iverson brackets ⟦·⟧, the metrics are:

The per-scale affinity loss is:

SCAL is applied to both semantic and geometric predictions across all refinement scales:

3.7.3 Overall objective

The total training loss is:

with all coefficients set to 1 in our experiments, providing a balanced supervision over voxel-wise accuracy, geometric completion, and scene-level semantic consistency.

4 Experiment

This section evaluates ESSC-RM on the SemanticKITTI benchmark (Behley et al., 2019, 2021). We first describe the experimental setup (datasets, metrics, and implementation), then report quantitative and qualitative results on strong and weak semantic scene completion baselines (CGFormer and MonoScene). Comprehensive ablation studies that analyze the refinement framework, the neighborhood-attention-based aggregation module, and the vision–language guidance module are provided in the Supplementary material.

4.1 Experimental setup

4.1.1 Datasets

We adopt the SemanticKITTI semantic scene completion benchmark (Behley et al., 2019, 2021), which extends the KITTI odometry dataset (Geiger et al., 2012) with dense semantic labels for each LiDAR scan. The dataset contains 22 outdoor sequences; following the official split, sequences 00–07 and 09–10 are used for training, 08 for validation, and 11–21 as a hidden test set.

For semantic scene completion, a 3D volume around the ego-vehicle is considered: 51.2m in front, 25.6m to each side (total width 51.2m), and 6.4m in height (Behley et al., 2019). This volume is voxelized into a 256 × 256 × 32 grid with voxel size 0.2m³. Each voxel is assigned one of 20 classes (19 semantic classes and 1 free-space), obtained by voxelizing aggregated, registered semantic point clouds (Li Y. et al., 2023).

We conduct all experiments on SemanticKITTI, following its established voxelization protocol and official evaluation scripts, which provides a standardized testbed for semantic scene completion.

4.1.2 Evaluation metrics

We follow standard practice (Cao and de Charette, 2021; Li Y. et al., 2023; Tang et al., 2023) and report intersection-over-union (IoU) for 3D scene completion (SC) and mean intersection-over-union (mIoU) for semantic scene completion (SSC).

For SC, evaluation is binary (occupied vs. free) and uses IoU over the occupancy grid:

where TP, FP, and FN denote true positives, false positives, and false negatives on the occupancy grid.

For SSC, we evaluate per-class IoU over C = 19 semantic classes and report mean IoU:

where TP_c, FP_c, and FN_c are computed for class c, and evaluation is carried out in known space as in (Roldão et al., 2021). IoU primarily reflects geometric completion quality, whereas mIoU captures voxel-wise semantic accuracy; both are reported to assess overall scene understanding.

4.1.3 Implementation details

We consider two training paradigms for ESSC-RM: (1) joint training, where the semantic scene completion backbone is switched to inference mode while the refinement module is trained on-the-fly from its predictions; and (2) separate training, where semantic scene completion predictions are pre-computed and stored, and the refinement module is trained purely as a plug-and-play post-processor without modifying the original semantic scene completion architecture.

Unless otherwise stated, experiments are conducted on two NVIDIA RTX A5000 GPUs, with 10 epochs and a batch size of 1 per GPU. We use AdamW (Loshchilov and Hutter, 2017) with β₁ = 0.9, β₂ = 0.99, and a peak learning rate of 5 × 10⁻⁵. A cosine schedule (Smith and Topin, 2017) with 5% warm-up is applied. The refinement module follows a 3D U-Net (Çiçek et al., 2016) backbone; encoder and decoder feature-enhancement blocks (FEB/FAB) are adapted from SemCity (Lee et al., 2024), and neighborhood-attention-based variants from NATTEN (Hassani and Shi, 2022; Hassani et al., 2023, 2024) and PNA (Liu T. et al., 2023). The vision–language guidance module (VLGM) uses frozen vision–language models [InstructBLIP (Li J. et al., 2023; Dai et al., 2023) and LLaVA (Liu H. et al., 2023; Liu et al., 2024)] together with text–voxel fusion modules inspired by Text-IF (Yi et al., 2024) and MultiAtt-RSSC (Cai et al., 2024). Following PaSCo (Cao A.-Q. et al., 2024) and HybridOcc (Zhao X. et al., 2024), we apply coarse-to-fine multi-level supervision in the decoder. Training losses are described in Section 3.7.

4.2 Evaluation results

We evaluate ESSC-RM as a refinement module on strong and weak SSC baselines and analyze its efficiency and qualitative behavior.

4.2.1 Quantitative results

ESSC-RM is designed to prioritize voxel-wise semantic correctness (mIoU) over boundary-sensitive binary occupancy smoothness (IoU); therefore, minor IoU drops may accompany consistent mIoU gains.

4.2.1.1 3D SSC performance

Table 1 reports SSC performance on SemanticKITTI, including representative image-based SSC baselines and our ESSC-RM variants. Among the listed baselines without ESSC-RM (upper block), DepthSSC (Yao et al., 2024) achieves the best SC-IoU (45.84%), while Symphonize (Jiang et al., 2024) attains the highest mIoU (14.89%). In addition to these method-level comparisons, we evaluate ESSC-RM as a plug-and-play refinement module on top of two representative SSC backbones: CGFormer (Tang et al., 2023) as a strong baseline (45.99% IoU, 16.87% mIoU) and MonoScene (Cao and de Charette, 2021) as a widely used weaker baseline (36.86% IoU, 11.08% mIoU). Due to training and storage overhead of voxel-level refinement, we instantiate ESSC-RM on these two backbones to demonstrate generality across different performance regimes; extending the plug-in evaluation to additional backbones is left for future work (see Section 5).

Table 1

Methods	IoU	mIoU	Car (3.92%)	Bicycle (0.03%)	Motorcycle (0.03%)	Truck (0.16%)	Other-vehicle (0.20%)	Person (0.07%)	Bicyclist (0.07%)	Motorcyclist (0.05%)	Road (15.30%)	Parking (1.12%)	Sidewalk (11.13%)	Other-ground (0.56%)	Building (14.10%)	Fence (3.90%)	Vegetation (39.3%)	Trunk (0.51%)	Terrain (9.17%)	Pole (0.29%)	Traffic-sign (0.08%)
Baselines (without ESSC-RM)
TPVFormer (Huang et al., 2023)	35.61	11.36	23.81	0.36	0.05	8.08	4.35	0.51	0.89	0.00	56.50	20.60	25.87	0.85	13.88	5.94	16.92	2.26	30.38	3.14	1.52
OccFormer (Zhang et al., 2023)	36.50	13.46	25.09	0.81	1.19	25.53	8.52	2.78	2.82	0.00	58.85	19.61	26.88	0.31	14.40	5.61	19.63	3.93	32.62	4.26	2.86
IAMSSC (Xiao et al., 2024b)	44.29	12.45	26.26	0.60	0.15	8.74	5.06	1.32	3.46	0.01	54.55	16.02	25.85	0.70	17.38	6.86	24.63	4.95	30.13	6.35	3.56
VoxFormer-S (Li Y. et al., 2023)	44.02	12.35	25.79	0.59	0.51	5.63	3.77	1.78	3.32	0.00	54.76	15.50	26.35	0.70	17.65	7.64	24.39	5.08	29.96	7.11	4.18
DepthSSC (Yao et al., 2024)	45.84	13.28	25.94	0.35	1.16	6.02	7.50	2.58	6.32	0.00	55.38	18.76	27.04	0.92	19.23	8.46	26.37	4.52	30.19	7.42	4.09
Symphonize (Jiang et al., 2024)	41.92	14.89	28.68	2.54	2.82	20.44	13.89	3.52	2.24	0.00	56.37	15.28	27.58	0.95	21.64	8.40	25.72	6.60	30.87	9.57	5.76
HASSC-S (Wang et al., 2024)	44.82	13.48	27.23	0.92	0.86	9.91	5.61	2.80	4.71	0.00	57.05	15.90	28.25	1.04	19.05	6.58	25.48	6.15	32.94	7.68	4.05
H2GFormer-S (Wang and Tong, 2024)	44.57	13.73	28.21	0.50	0.47	10.00	7.39	1.54	2.88	0.00	56.08	17.83	29.12	0.45	19.74	7.24	26.25	6.80	34.42	7.88	4.68
MonoScene and ESSC-RM variants
MonoScene (Cao and de Charette, 2021)	36.86	11.08	23.26	0.61	0.45	6.98	1.48	1.86	1.20	0.00	56.52	14.27	26.72	0.46	14.09	5.84	17.89	2.81	29.64	4.14	2.25
MonoScene + 3D U-Net	35.70	11.47	23.46	0.41	0.87	10.95	3.69	2.98	1.64	0.00	56.24	14.95	26.63	1.42	13.11	6.19	16.75	2.73	29.57	3.77	2.62
MonoScene + VLGM	35.62	11.49	22.76	0.44	0.71	12.45	3.12	3.04	1.64	0.00	56.48	14.35	26.64	1.42	13.55	6.28	16.44	2.97	29.50	3.85	2.65
MonoScene + PNAM	36.44	11.51	23.11	0.40	0.73	11.38	3.59	2.95	1.69	0.00	56.27	14.65	26.71	1.45	13.48	6.20	17.08	2.96	29.45	3.84	2.69
CGFormer and ESSC-RM variants
CGFormer (Tang et al., 2023)	45.99	16.87	34.32	4.61	2.71	19.44	7.67	2.38	4.08	0.00	65.51	20.82	32.31	0.16	23.52	9.20	26.93	8.83	39.54	10.67	7.84
CGFormer + 3D U-Net	43.53	17.17	33.99	5.28	3.11	22.39	8.22	2.65	4.05	0.00	65.29	20.26	32.14	0.13	23.11	8.93	26.84	11.17	38.99	11.93	7.84
CGFormer + VLGM	43.20	17.21	34.33	5.24	3.01	22.33	7.81	2.70	4.12	0.00	65.52	20.79	32.31	0.13	23.27	8.95	26.69	10.73	39.29	11.93	7.82
CGFormer + PNAM	44.33	17.27	34.11	5.69	2.94	23.71	8.36	2.64	4.37	0.00	65.27	20.87	31.90	0.16	22.70	9.08	26.63	11.42	38.91	11.78	7.66

Quantitative results on the SemanticKITTI validation set.

The upper block lists baseline stereo-based SSC methods without ESSC-RM. The middle and lower blocks show MonoScene- and CGFormer-based ESSC-RM variants, respectively. Within each block, the best and second-best results are shown in bold and underlined, respectively.

To assess the generality of ESSC-RM, we plug it on top of both CGFormer and MonoScene, progressively adding (i) a plain 3D U-Net refinement head; (ii) the proposed neighborhood-attention-based refinement module (PNAM); and (iii) the vision–language guidance module (VLGM). The MonoScene and CGFormer blocks in Table 1 summarize these ablation results.

4.2.1.2 ESSC-RM on CGFormer

As shown in the CGFormer block of Table 1, adding a 3D U-Net refinement head improves mIoU from 16.87 to 17.17%. Equipping the refinement with VLGM further increases mIoU to 17.21%, while PNAM achieves the best mIoU of 17.27% with only a modest IoU drop. The gains are more apparent on small and medium-scale categories (e.g., truck, bicycle, trunk, pole), suggesting that coarse-to-fine decoding and neighborhood-aware aggregation help correct local ambiguities and recover thin structures that are challenging for the backbone alone.

Despite consistent improvements, the absolute mIoU gain on CGFormer remains moderate (from 16.87 to 17.27%, +0.40, i.e., ~2% relative). This is mainly because ESSC-RM performs refinement in the voxel-prediction space by design: it takes the discrete semantic occupancy predicted by the backbone, embeds it into a continuous feature map, and refines it via a 3D U-Net style encoder–decoder (with PNAM/VLGM as optional enhancements). Consequently, the global occupancy layout and object extents remain largely inherited from the backbone prediction, while ESSC-RM mainly improves local semantic consistency and boundary delineation (e.g., thin objects and class-confusing regions), which naturally limits the headroom when the backbone output is already geometrically plausible.

4.2.1.3 ESSC-RM on MonoScene

The MonoScene block of Table 1 shows that ESSC-RM also improves the weaker MonoScene baseline. VLGM increases mIoU from 11.08 to 11.49%, and PNAM further pushes it to 11.51% with comparable IoU. These consistent gains across CGFormer and MonoScene support the plug-and-play nature of ESSC-RM, indicating that the refinement is not tied to a specific SSC backbone.

On MonoScene, ESSC-RM improves mIoU from 11.08 to 11.51% (+0.43, ~4% relative). Since the refinement module does not introduce additional sensor-level geometric observations beyond the backbone output, its improvement is mainly achieved by enforcing multi-scale voxel consistency and reducing local misclassifications. When large missing structures are completely absent in , post-hoc voxel refinement cannot fully recover them, whereas it remains effective at sharpening boundaries and improving local semantic coherence.

4.2.1.4 SC-IoU trade-off

Although refinement improves mIoU, SC-IoU (occupied vs. free) can slightly decrease in some cases; this is an expected design trade-off rather than a flaw. For example, on CGFormer, ESSC-RM increases mIoU by +0.40 (16.87% → 17.27%) while SC-IoU decreases by 1.66 (45.99% → 44.33%). This behavior is expected because SC-IoU is a binary occupancy metric that is particularly sensitive to boundary voxels: semantic refinement around thin structures and object borders may flip a small fraction of occupied/free decisions, increasing FP/FN near boundaries even when per-class semantics improve. As SC-IoU aggregates over the entire occupancy grid, such boundary perturbations can lead to a measurable IoU change, reflecting a mild trade-off between semantic correction (mIoU) and boundary-sensitive binary occupancy under discrete voxel predictions.

4.2.1.5 Refinement module efficiency

We further analyze the computational overhead of ESSC-RM on top of CGFormer (Table 2). CGFormer itself has 122.42 M parameters, requires about 19.3 GB memory during training and 6.55 GB at inference, and runs at approximately 205 ms per frame. The 3D U-Net refinement head adds only 13.36 M parameters and can be trained jointly with CGFormer on a 24 GB GPU when the backbone is set to inference mode. VLGM and PNAM increase parameter counts and inference time more noticeably, but remain practical for offline refinement or two-stage pipelines.

Table 2

Model	IoU	mIoU	Params (M)	Train memory (M)	Infer. memory (M)	Infer. time (ms)
CGFormer	45.99	16.87	122.42	19,330	6,550	205
+3D U-Net	43.53	17.17	13.36	12,726	4,904	215
+VLGM	43.20	17.21	43.96	18,942	5,382	340
+PNAM	44.33	17.27	9.59	20,664	5,042	265

Ablation study on the efficiency of the refinement module with CGFormer as backbone.

Bold values indicate the best performance (highest is best) within each comparison group.

4.2.2 Qualitative results

Figures 7, 8 present qualitative results of ESSC-RM applied respectively to CGFormer and MonoScene on the SemanticKITTI validation set. Each row displays the input RGB image, ground truth, the prediction of the baseline model, and the refined outputs after integrating the 3D U-Net, PNAM, and VLGM modules.

Figure 7

Figure 8

Across both baselines, the refinement module consistently reduces holes and misclassifications in occluded or boundary regions, restores missing vegetation and structures at scene edges, and produces smoother and more coherent semantic layouts. On large-scale structures such as roads and buildings, PNAM and VLGM further improve geometric regularity, yielding cleaner contours and more stable surface predictions. For small-scale objects like traffic signs and poles, text-derived priors in VLGM highlight distinctive semantic regions, while PNAM enhances local aggregation and sharpens object boundaries.

These results demonstrate that ESSC-RM provides robust and generalizable refinement across different SSC backbones.

5 Conclusion

In summary, while ESSC-RM improves semantic scene completion by refining voxel predictions with PNAM and VLGM, several challenges remain. First, the refinement module still relies on 3D convolutions and attention, leading to non-negligible latency and memory overhead. Second, our evaluation is currently centered on SemanticKITTI, and broader generalization remains constrained by differences in voxel resolution, label taxonomy, and scene layout, which may require re-training or lightweight structural adaptation. Moreover, while we validate the plug-and-play behavior on two representative SSC backbones (CGFormer and MonoScene), extending the plug-in evaluation to additional backbones is still limited by the training and storage overhead of voxel-level refinement. Third, PNAM and VLGM are incorporated as largely independent components without a unified fusion mechanism, and emphasizing semantic correction may slightly compromise geometric completeness, resulting in minor degradations in SC-IoU.

Future work will therefore explore lightweight and efficient representations (e.g., sparse convolution, tri-plane features, and Gaussian voxelization), knowledge distillation for compact deployment, as well as structured pruning and quantization-aware optimization to further reduce latency and memory footprint. We will also investigate adapter-based transfer across datasets, and broaden plug-in evaluation across diverse SSC backbones to further substantiate generality. In addition, we will study adaptive fusion layers that more tightly couple local geometric attention with textual priors. Finally, integrating generative priors (e.g., CVAE- or diffusion-based models) to pre-complete sparse voxels, together with extensive evaluation on diverse real-world benchmarks, may further improve the robustness, practicality, and scalability of ESSC-RM.

Statements

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found at: https://semantic-kitti.org/.

Author contributions

DZ: Data curation, Writing – original draft, Writing – review & editing. JL: Methodology, Writing – original draft, Writing – review & editing. HY: Funding acquisition, Methodology, Supervision, Writing – review & editing. LB: Funding acquisition, Methodology, Supervision, Writing – review & editing. BS: Funding acquisition, Methodology, Supervision, Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This research was funded by the Natural Science Foundation of Tianjin (No.24PTLYHZ00290).

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling editor HC declared a shared affiliation with the authors DZ and JL at the time of review.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fnbot.2025.1768219/full#supplementary-material

References

1
BehleyJ.GarbadeM.MiliotoA.QuenzelJ.BehnkeS.GallJ.et al. (2021). Towards 3D LiDAR-based semantic scene understanding of 3D point cloud sequences: the SemanticKITTI dataset. Int. J. Rob. Res. 40, 959–967. doi: 10.1177/02783649211006735
- CrossRef
- Google Scholar
2
BehleyJ.GarbadeM.MiliotoA.QuenzelJ.BehnkeS.StachnissC.et al. (2019). A dataset for semantic segmentation of point cloud sequences. arXiv [Preprint]. arXiv:1904.01416. doi: 10.48550/arXiv.1904.01416
- CrossRef
- Google Scholar
3
BrazilG.LiuX. (2019). M3D-RPN: monocular 3D region proposal network for object detection. arXiv [Preprint]. arXiv:1907.06038. doi: 10.48550/arXiv.1907.06038
- CrossRef
- Google Scholar
4
CaiJ.MengK.YangB.ShaoG. (2024). Multimodal remote sensing scene classification using VLMs and dual-cross attention networks. arXiv [Preprint]. arXiv:2412.02531. doi: 10.48550/ARXIV.2412.02531
- CrossRef
- Google Scholar
5
CaoA.de CharetteR. (2021). MonoScene: monocular 3D semantic scene completion. arXiv [Preprint]. arXiv:2112.00726. doi: 10.48550/arXiv.2112.00726
- CrossRef
- Google Scholar
6
CaoA.-Q.DaiA.de CharetteR. (2024). “PaSCo: urban 3D panoptic scene completion with uncertainty awareness,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Google Scholar
7
CaoH.ChenG.LiZ.HuY.KnollA. (2022). NeuroGrasp: multimodal neural network with euler region regression for neuromorphic vision-based grasp pose estimation. IEEE Trans. Instrum. Meas. 71, 1–11. doi: 10.1109/TIM.2022.3179469
- CrossRef
- Google Scholar
8
CaoH.ChenG.XiaJ.ZhuangG.KnollA. (2021). Fusion-based feature attention gate component for vehicle detection based on event camera. IEEE Sens. J. 21, 24540–24548. doi: 10.1109/JSEN.2021.3115016
- CrossRef
- Google Scholar
9
CaoH.ChenG.ZhaoH.JiangD.ZhangX.TianQ.et al. (2024a). SDPT: semantic-aware dimension-pooling transformer for image segmentation. IEEE Trans. Intell. Transp. Syst. 25, 15934–15946. doi: 10.1109/TITS.2024.3417813
- CrossRef
- Google Scholar
10
CaoH.QuZ.ChenG.LiX.ThieleL.KnollA.et al. (2024b). GhostViT: expediting vision transformers via cheap operations. IEEE Trans. Artif. Intell. 5, 2517–2525. doi: 10.1109/TAI.2023.3326795
- CrossRef
- Google Scholar
11
CaoH.ZhangZ.XiaY.LiX.XiaJ.ChenG.et al. (2024c). “Embracing events and frames with hierarchical feature refinement network for object detection,” in European Conference on Computer Vision, eds. A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol (Milan: Springer), 161–177. doi: 10.1007/978-3-031-72907-2_10
- CrossRef
- Google Scholar
12
ChangJ.ChenY. (2018). Pyramid stereo matching network. arXiv [Preprint]. arXiv:1803.08669. doi: 10.48550/arXiv.1803.08669
- CrossRef
- Google Scholar
13
ChenY.LiuS.ShenX.JiaJ. (2020). DSGN: deep stereo geometry network for 3D object detection. arXiv [Preprint]. arXiv:2001.03398. doi: 10.48550/arXiv.2001.03398
- CrossRef
- Google Scholar
14
ChoM.KimE. (2023). 3D LiDAR multi-object tracking with short-term and long-term multi-level associations. Remote Sens. 15:5486. doi: 10.3390/rs15235486
- CrossRef
- Google Scholar
15
ÇiçekÖ.AbdulkadirA.LienkampS. S.BroxT.RonnebergerO. (2016). 3D U-Net: learning dense volumetric segmentation from sparse annotation. arXiv [Preprint]. arXiv:1606.06650. doi: 10.48550/arXiv.1606.06650
- CrossRef
- Google Scholar
16
DaiW.LiJ.LiD.TiongA.ZhaoJ.WangW.et al. (2023). InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500.
- Google Scholar
17
DengJ.ShiS.LiP.ZhouW.ZhangY.LiH.et al. (2020). Voxel R-CNN: towards high performance voxel-based 3D object detection. arXiv [Preprint]. arXiv:2012.15712. doi: 10.48550/arXiv.2012.15712
- CrossRef
- Google Scholar
18
DollS.SchulzR.SchneiderL.BenzinV.MarkusE.LenschH. P.et al. (2022). “SpatialDETR: robust scalable transformer-based 3D object detection from multi-view camera images with global cross-sensor attention,” in European Conference on Computer Vision (ECCV) (Tel Aviv-Yafo: ACM). doi: 10.1007/978-3-031-19842-7_14
- CrossRef
- Google Scholar
19
DuanK.BaiS.XieL.QiH.HuangQ.TianQ.et al. (2019). CenterNet: keypoint triplets for object detection. arXiv [Preprint]. arXiv:1904.08189. doi: 10.48550/arXiv.1904.08189
- CrossRef
- Google Scholar
20
GeigerA.LenzP.UrtasunR. (2012). “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition (Providence, RI: IEEE), 3354–3361. doi: 10.1109/CVPR.2012.6248074
- CrossRef
- Google Scholar
21
GuoY.WangH.HuQ.LiuH.LiuL.BennamounM.et al. (2019). Deep learning for 3D point clouds: a survey. arXiv [Preprint]. arXiv:1912.12033. doi: 10.48550/arXiv.1912.12033
22
HassaniA.HwuW.-M.ShiH. (2024). Faster neighborhood attention: reducing the O(n²) cost of self-attention at the threadblock level. arXiv [Preprint]. arXiv:2403.04690.
- Google Scholar
23
HassaniA.ShiH. (2022). Dilated neighborhood attention transformer. arXiv preprint arXiv: 2209.15001. Available online at: https://arxiv.org/abs/2209.15001 (Accessed March 11, 2025).
- Google Scholar
24
HassaniA.WaltonS.LiJ.LiS.ShiH. (2023). “Neighborhood attention transformer,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Vancouver, BC: IEEE). doi: 10.1109/CVPR52729.2023.00599
- CrossRef
- Google Scholar
25
HuH.CaiQ.WangD.LinJ.SunM.KrähenbühlP.et al. (2018). Joint monocular 3D vehicle detection and tracking. arXiv [Preprint]. arXiv:1811.10742. doi: 10.48550/arXiv.1811.10742
- CrossRef
- Google Scholar
26
HuangJ.HuangG.ZhuZ.DuD. (2021). BEVDet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv [Preprint]. arXiv:2112.11790. doi: 10.48550/arXiv.2112.11790
- CrossRef
- Google Scholar
27
HuangY.-K.ZhengW.ZhangY.ZhouJ.LuJ. (2023). “Tri-perspective view for vision-based 3d semantic occupancy prediction,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9223–9232. Available online at: https://api.semanticscholar.org/CorpusID:256868375 (Accessed March 11, 2025).
- Google Scholar
28
JangH.-K.KimJ.KweonH.YoonK.-J. (2024). TALoS: enhancing semantic scene completion via test-time adaptation on the line of sight. arXiv [Preprint]. arXiv:2410.15674.
- Google Scholar
29
JiangH.ChengT.GaoN.ZhangH.LinT.LiuW.et al. (2024). “Symphonize 3D semantic scene completion with contextual instance queries,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 20258–20267. doi: 10.1109/CVPR52733.2024.01915
- CrossRef
- Google Scholar
30
KirillovA.MintunE.RaviN.MaoH.RollandC.GustafsonL.et al. (2023). Segment anything. arXiv [Preprint]. arXiv:2304.02643. doi: 10.48550/arXiv.2304.02643
- CrossRef
- Google Scholar
31
KoukounasA.MastrapasG.WangB.AkramM. K.EslamiS.GüntherM.et al. (2024). jina-clip-v2: multilingual multimodal embeddings for text and images. arXiv [Preprint]. arXiv:2412.08802.
- Google Scholar
32
LangA. H.VoraS.CaesarH.ZhouL.YangJ.BeijbomO.et al. (2018). PointPillars: fast encoders for object detection from point clouds. arXiv [Preprint]. arXiv:1812.05784. doi: 10.48550/arXiv.1812.05784
- CrossRef
- Google Scholar
33
LeeJ.LeeS.JoC.ImW.SeonJ.YoonS.-E. (2024). “SemCity: semantic scene generation with triplane diffusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Google Scholar
34
LiB.WeinbergerK. Q.BelongieS. J.KoltunV.RanftlR. (2022). Language-driven semantic segmentation. arXiv [Preprint]. arXiv:2201.03546. doi: 10.48550/arXiv.2201.03546
- CrossRef
- Google Scholar
35
LiJ.LiD.SavareseS.HoiS. (2023). “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” in Proceedings of the 40th International Conference on Machine Learning (ICML).
- Google Scholar
36
LiP.ChenX.ShenS. (2019). Stereo R-CNN based 3D object detection for autonomous driving. arXiv [Preprint]. arXiv:1902.09738. doi: 10.48550/arXiv.1902.09738
- CrossRef
- Google Scholar
37
LiS.TangH. (2024). Multimodal alignment and fusion: a survey. arXiv [Preprint]. arXiv:2411.17040. doi: 10.48550/ARXIV.2411.17040
- CrossRef
- Google Scholar
38
LiY.LiS.LiuX.GongM.LiK.ChenN.et al. (2024). SSCBench: A large-scale 3D semantic scene completion benchmark for autonomous driving. arXiv [Preprint]. arXiv:2306.09001.
- Google Scholar
39
LiY.YuZ.ChoyC. B.XiaoC.1lvarezJ. M.FidlerS.et al. (2023). “VoxFormer: sparse voxel transformer for camera-based 3D semantic scene completion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9087–9098.
- Google Scholar
40
LiZ.WangW.LiH.XieE.SimaC.LuT.et al. (2025). BEVFormer: learning bird's-eye-view representation from LiDAR-camera via spatiotemporal transformers. IEEE Trans. Pattern Anal. Mach. Intellig.47, 2020–2036. doi: 10.1109/TPAMI.2024.3515454
41
LinS.-L.WuJ.-Y. (2025). Enhancing LiDAR-based 3D classification through an improved deep learning framework with residual connections. IEEE Access13, 42836–42849. doi: 10.1109/ACCESS.2025.3547942
- CrossRef
- Google Scholar
42
LiuH.LiC.LiY.LeeY. J. (2024). “Improved baselines with visual instruction tuning,” in 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 26286–26296. doi: 10.1109/CVPR52733.2024.02484
- CrossRef
- Google Scholar
43
LiuH.LiC.WuQ.LeeY. J. (2023). Visual instruction tuning. arXiv [Preprint]. arXiv:2304.08485.
- Google Scholar
44
LiuP.LiuH.LiuH.LiuX.NiJ.MaJ. (2025). VLM-E2E: Enhancing end-to-end autonomous driving with multimodal driver attention fusion. arXiv [Preprint]. arXiv:2502.18042.
- Google Scholar
45
LiuS.ZengZ.RenT.LiF.ZhangH.YangJ.et al. (2023). Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv [Preprint]. arXiv:2303.05499. doi: 10.48550/arXiv.2303.05499
- CrossRef
- Google Scholar
46
LiuT.WeiY.ZhangY. (2023). “Progressive neighborhood aggregation for semantic segmentation refinement,” in Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, AAAI'23/IAAI'23/EAAI'23 (Washington, DC: AAAI Press).
- Google Scholar
47
LoshchilovI.HutterF. (2017). Fixing weight decay regularization in adam. arXiv [Preprint]. arXiv:1711.05101. doi: 10.48550/arXiv.1711.05101
- CrossRef
- Google Scholar
48
MaX.OuyangW.SimonelliA.RicciE. (2022). 3D object detection from images for autonomous driving: a survey. arXiv [Preprint]. arXiv:2202.02980. doi: 10.48550/arXiv.2202.02980
49
ManhardtF.KehlW.GaidonA. (2018). ROI-10D: monocular lifting of 2D detection to 6D pose and metric shape. arXiv [Preprint]. arXiv:1812.02781. doi: 10.48550/arXiv.1812.02781
- CrossRef
- Google Scholar
50
MaoJ.ShiS.WangX.LiH. (2023). 3D object detection for autonomous driving: a comprehensive survey. Int. J. Comput. Vision131, 1909–1963. doi: 10.1007/s11263-023-01790-1
- CrossRef
- Google Scholar
51
MiliotoA.VizzoI.BehleyJ.StachnissC. (2019). “RangeNet ++: fast and accurate LiDAR semantic segmentation,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (Macau: IEEE), 4213–4220. doi: 10.1109/IROS40897.2019.8967762
- CrossRef
- Google Scholar
52
MousavianA.AnguelovD.FlynnJ.KoseckaJ. (2016). 3D bounding box estimation using deep learning and geometry. arXiv [Preprint]. arXiv:1612.00496. doi: 10.48550/arXiv.1612.00496
- CrossRef
- Google Scholar
53
OpenAIAchiam, J.AdlerS.AgarwalS.AhmadL.AkkayaI.et al. (2024). GPT-4 technical report. arXiv [Preprint]. arXiv:2303.08774.
- Google Scholar
54
PhilionJ.FidlerS. (2020). Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. arXiv [Preprint]. arXiv:2008.05711. doi: 10.48550/arXiv.2008.05711
- CrossRef
- Google Scholar
55
QiC. R.SuH.MoK.GuibasL. J. (2016). PointNet: deep learning on point sets for 3D classification and segmentation. arXiv [Preprint]. arXiv:1612.00593. doi: 10.48550/arXiv.1612.00593
- CrossRef
- Google Scholar
56
QiC. R.YiL.SuH.GuibasL. J. (2017). PointNet++: deep hierarchical feature learning on point sets in a metric space. arXiv [Preprint]. arXiv:1706.02413. doi: 10.48550/arXiv.1706.02413
- CrossRef
- Google Scholar
57
RadfordA.KimJ. W.HallacyC.RameshA.GohG.AgarwalS.et al. (2021). Learning transferable visual models from natural language supervision. arXiv [Preprint]. arXiv:2103.00020. doi: 10.48550/arXiv.2103.00020
- CrossRef
- Google Scholar
58
RenT.LiuS.ZengA.LinJ.LiK.CaoH.et al. (2024). Grounded SAM: assembling open-world models for diverse visual tasks. arXiv [Preprint]. arXiv:2401.14159.
- Google Scholar
59
RoldãoL.de CharetteR.Verroust-BlondetA. (2020). LMSCNet: lightweight multiscale 3D semantic completion. arXiv [Preprint]. arXiv:2008.10559. doi: 10.48550/arXiv.2008.10559
- CrossRef
- Google Scholar
60
RoldãoL.de CharetteR.Verroust-BlondetA. (2021). 3D semantic scene completion: a survey. arXiv [Preprint]. arXiv:2103.07466. doi: 10.48550/arXiv.2103.07466
- CrossRef
- Google Scholar
61
RonnebergerO.FischerP.BroxT. (2015). U-Net: convolutional networks for biomedical image segmentation. arXiv [Preprint]. arXiv:1505.04597. doi: 10.48550/arXiv.1505.04597
- CrossRef
- Google Scholar
62
ShiS.GuoC.JiangL.WangZ.ShiJ.WangX.et al. (2019). PV-RCNN: point-voxel feature set abstraction for 3D object detection. arXiv [Preprint]. arXiv:1912.13192. doi: 10.48550/arXiv.1912.13192
- CrossRef
- Google Scholar
63
ShiS.WangX.LiH. (2018). PointRCNN: 3D object proposal generation and detection from point cloud. arXiv [Preprint]. arXiv:1812.04244. doi: 10.48550/arXiv.1812.04244
- CrossRef
- Google Scholar
64
SilbermanN.HoiemD.KohliP.FergusR. (2012). “Indoor segmentation and support inference from RGBD images,” in Computer Vision-ECCV 2012, eds. A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid (Berlin; Heidelberg: Springer Berlin Heidelberg), 746–760. doi: 10.1007/978-3-642-33715-4_54
- CrossRef
- Google Scholar
65
SmithL. N.TopinN. (2017). Super-convergence: very fast training of residual networks using large learning rates. arXiv [Preprint]. arXiv:1708.07120. doi: 10.48550/arXiv.1708.07120
- CrossRef
- Google Scholar
66
SongS.YuF.ZengA.ChangA. X.SavvaM.FunkhouserT. A.et al. (2016). Semantic scene completion from a single depth image. arXiv [Preprint]. arXiv:1611.08974. doi: 10.48550/arXiv.1611.08974
- CrossRef
- Google Scholar
67
SunQ.FangY.WuL.WangX.CaoY. (2023a). EVA-CLIP: improved training techniques for CLIP at scale. arXiv [Preprint]. arXiv:2303.15389. doi: 10.48550/arXiv.2303.15389
- CrossRef
- Google Scholar
68
SunQ.WangJ.YuQ.CuiY.ZhangF.ZhangX.et al. (2023b). EVA-CLIP-18B:scaling CLIP to 18 billion parameters. arXiv [Preprint]. arXiv:2402.04252.
- Google Scholar
69
TangJ.ZhengG.ShiC.YangS. (2023). “Contrastive grouping with transformer for referring image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 23570–23580. doi: 10.1109/CVPR52729.2023.02257
- CrossRef
- Google Scholar
70
UlyanovD.VedaldiA.LempitskyV. S. (2016). Instance normalization: the missing ingredient for fast stylization. arXiv [Preprint]. arXiv:1607.08022. doi: 10.48550/arXiv.1607.08022
- CrossRef
- Google Scholar
71
VaswaniA.ShazeerN.ParmarN.UszkoreitJ.JonesL.GomezA. N.et al. (2017). Attention is all you need. arXiv [Preprint]. arXiv:1706.03762. doi: 10.48550/arXiv.1706.03762
- CrossRef
- Google Scholar
72
WangM.PiH.LiR.QinY.TangZ.LiK. (2025). “VLScene: vision-language guidance distillation for camera-based 3D semantic scene completion,” in Proceedings of the AAAI Conference on Artificial Intelligence, 7808–7816.
- Google Scholar
73
WangM.WuF.QinY.LiR.TangZ.LiK. (2026). Vision-based 3D semantic scene completion via capturing dynamic representations. Knowledge Based Syst. 331:114550. doi: 10.1016/j.knosys.2025.114550
- CrossRef
- Google Scholar
74
WangS.YuJ.LiW.LiuW.LiuX.ChenJ.et al. (2024). “Not all voxels are equal: Hardness-aware semantic scene completion with self-distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Google Scholar
75
WangY.ChaoW.GargD.HariharanB.CampbellM.WeinbergerK. Q.et al. (2018). Pseudo-LiDAR from visual depth estimation: bridging the gap in 3D object detection for autonomous driving. arXiv [Preprint]. arXiv:1812.07179. doi: 10.48550/arXiv.1812.07179
- CrossRef
- Google Scholar
76
WangY.GuiziliniV.ZhangT.WangY.ZhaoH.SolomonJ.et al. (2021). DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. arXiv [Preprint]. arXiv:2110.06922. doi: 10.48550/arXiv.2110.06922
- CrossRef
- Google Scholar
77
WangY.TongC. (2024). H2GFormer: horizontal-to-global voxel transformer for 3D semantic scene completion. Proc. AAAI Conf. Artif. Intell. 38, 5722–5730. doi: 10.1609/aaai.v38i6.28384
- CrossRef
- Google Scholar
78
WengX.WangJ.HeldD.KitaniK. (2020). AB3DMOT: a baseline for 3D multi-object tracking and new evaluation metrics. arXiv [Preprint]. arXiv:2008.08063. doi: 10.48550/arXiv.2008.08063
- CrossRef
- Google Scholar
79
WuD.LiangZ.ChenG. (2022). Deep learning for LiDAR-only and LiDAR-fusion 3D perception: a survey. Intell. Robot. 2, 105–129. doi: 10.20517/ir.2021.20
- CrossRef
- Google Scholar
80
XiaZ.LiuY.-C.LiX.ZhuX.MaY.LiY.et al. (2023). “SCPNet: semantic scene completion on point cloud,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 17642–17651.
- Google Scholar
81
XiaoH.MastrapasG.WangB. (2024a). “Jina CLIP: your CLIP model is also your text retriever,” in ICML 2024 Workshop on Multi-modal Foundation Models Meets Embodied AI.
- Google Scholar
82
XiaoH.XuH.KangW.LiY. (2024b). Instance-aware monocular 3D semantic scene completion. IEEE Trans. Intell. Transp. Syst. 25, 6543–6554. doi: 10.1109/TITS.2023.3344806
- CrossRef
- Google Scholar
83
XuB.ChenZ. (2018). “Multi-level fusion based 3D object detection from monocular images,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (Salt Lake City, UT: IEEE), 2345–2353. doi: 10.1109/CVPR.2018.00249
- CrossRef
- Google Scholar
84
XuB.WangN.ChenT.LiM. (2015). Empirical evaluation of rectified activations in convolutional network. arXiv [Preprint]. arXiv:1505.00853. doi: 10.48550/arXiv.1505.00853
- CrossRef
- Google Scholar
85
XuX.WangT.YangY.ZuoL.ShenF.ShenH. T.et al. (2020). Cross-modal attention with semantic consistence for image–text matching. IEEE Trans. Neural Netw. Learn. Syst. 31, 5412–5425. doi: 10.1109/TNNLS.2020.2967597
86
YanX.GaoJ.LiJ.ZhangR.LiZ.HuangR.et al. (2020). Sparse single sweep LiDAR point cloud segmentation via learning contextual shape priors from scene completion. arXiv [Preprint]. arXiv:2012.03762. doi: 10.48550/arXiv.2012.03762
- CrossRef
- Google Scholar
87
YanY.MaoY.LiB. (2018). Second: sparsely embedded convolutional detection. Sensors18:3337. doi: 10.3390/s18103337
88
YangX.ZouH.KongX.HuangT.LiuY.LiW.et al. (2021). Semantic segmentation-assisted scene completion for LiDAR point clouds. arXiv [Preprint]. arXiv:2109.11453. doi: 10.48550/arXiv.2109.11453
- CrossRef
- Google Scholar
89
YaoJ.ZhangJ.PanX.WuT.XiaoC. (2023). “DepthSSC: monocular 3D semantic scene completion via depth-spatial alignment and voxel adaptation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2154–2163.
- Google Scholar
90
YiX.XuH.ZhangH.TangL.MaJ. (2024). “Text-IF: leveraging semantic text guidance for degradation-aware and interactive image fusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Available online at: https://openaccess.thecvf.com/content/CVPR2024/html/Yi_Text-IF_Leveraging_Semantic_Text_Guidance_for_Degradation-Aware_and_Interactive_Image_CVPR_2024_paper.html
- Google Scholar
91
YouY.WangY.ChaoW.GargD.PleissG.HariharanB.et al. (2019). Pseudo-LiDAR++: accurate depth for 3D object detection in autonomous driving. arXiv [Preprint]. arXiv:1906.06310. doi: 10.48550/arXiv.1906.06310
- CrossRef
- Google Scholar
92
YurtseverE.LambertJ.CarballoA.TakedaK. (2020). A survey of autonomous driving: common practices and emerging technologies. IEEE Access8, 58443–58469. doi: 10.1109/ACCESS.2020.2983149
- CrossRef
- Google Scholar
93
ZhangB.ZhangP.DongX.ZangY.WangJ. (2024). “Long-CLIP: unlocking the long-text capability of CLIP,” in Computer Vision - ECCV 2024 (Lecture Notes in Computer Science, Vol. 15109) (Springer). doi: 10.1007/978-3-031-72983-6_18
- CrossRef
- Google Scholar
94
ZhangY.ZhuZ.DuD. (2023). “OccFormer: dual-path transformer for vision-based 3D semantic occupancy prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 9399–9409.
- Google Scholar
95
ZhaoH.LiX.XuC.XuB.LiuH. (2024). “A survey of automatic driving environment perception,” in 2024 IEEE 24th International Conference on Software Quality, Reliability, and Security Companion (QRS-C) (Cambridge: IEEE), 1038–1047. doi: 10.1109/QRS-C63300.2024.00137
- CrossRef
- Google Scholar
96
ZhaoX.ChenB.SunM.YangD.WangY.ZhangX.et al. (2024). HybridOcc: NeRF enhanced transformer-based multi-camera 3D occupancy prediction. IEEE Robot. Autom. Lett. 9, 7867–7874. doi: 10.1109/LRA.2024.3416798
- CrossRef
- Google Scholar
97
ZhouH.ZhuX.SongX.MaY.WangZ.LiH.et al. (2020). Cylinder3D: an effective 3D framework for driving-scene LiDAR semantic segmentation. arXiv [Preprint]. arXiv:2008.01550. doi: 10.48550/arXiv.2008.01550
- CrossRef
- Google Scholar
98
ZhouY.TuzelO. (2017). VoxelNet: end-to-end learning for point cloud based 3D object detection. arXiv [Preprint]. arXiv:1711.06396. doi: 10.48550/arXiv.1711.06396
- CrossRef
- Google Scholar
99
ZhuD.ChenJ.ShenX.LiX.ElhoseinyM. (2024). “MiniGPT-4: enhancing vision-language understanding with advanced large language models,” in International Conference on Learning Representations (ICLR).
- Google Scholar
100
ZiaM. Z.StarkM.SchindlerK. (2014). “Are cars just 3D boxes? Jointly estimating the 3D shape of multiple objects,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (Columbus, OH: IEEE), 3678–3685. doi: 10.1109/CVPR.2014.470
- CrossRef
- Google Scholar

Summary

Keywords

plug-and-play, PNAM, refinement, semantic scene completion, vison-language guidance

Citation

Zhang D, Lu J, Yang H, Bao L and Song B (2026) Enhancing 3D semantic scene completion with refinement module. Front. Neurorobot. 20:1768219. doi: 10.3389/fnbot.2026.1768219

Received

15 December 2025

Revised

31 December 2025

Accepted

29 January 2026

Published

06 March 2026

Volume

20 - 2026

Edited by

Hu Cao, Technical University of Munich, Germany

Reviewed by

Qi Zhang, City University of Macau, Macao SAR, China

Lei Zhu, Guangzhou University, China

Simone Mosco, University of Padua, Italy

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Han Yang, professorhansolo@tju.edu.cn

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

ORIGINAL RESEARCH article

Enhancing 3D semantic scene completion with refinement module

Abstract

1 Introduction

2 Related work

2.1 LiDAR-based 3D perception

2.2 Camera-based 3D perception

2.3 Semantic scene completion

2.4 Vision–language models

2.5 Multimodal fusion and text modality

3 Methodology

3.1 Problem statement

3.2 Overall architecture

3.3 SSC backbone

3.4 3D U-Net refinement backbone

3.4.1 Voxel embedding and encoder–decoder

3.4.2 Feature encoding block (FEB)

3.4.3 Feature aggregation block (FAB) and multi-scale supervision

3.5 Progressive neighborhood attention module (PNAM)

3.5.1 PNA-based feature aggregation block

3.5.2 Self-attention (SA)

3.5.3 Neighborhood cross-attention (NCA)

3.6 Vision–language guidance module (VLGM)

3.6.1 Text acquisition and semantic encoding

3.6.2 Text–voxel fusion modules

3.6.2.1 Semantic interaction guidance module (SIGM)

3.6.2.2 Dual cross-attention module (DCAM)

3.7 Loss function

3.7.1 Cross-entropy loss

3.7.2 Scene–class affinity loss (SCAL)

3.7.3 Overall objective

4 Experiment

4.1 Experimental setup

4.1.1 Datasets

4.1.2 Evaluation metrics

4.1.3 Implementation details

4.2 Evaluation results

4.2.1 Quantitative results

4.2.1.1 3D SSC performance

4.2.1.2 ESSC-RM on CGFormer

4.2.1.3 ESSC-RM on MonoScene

4.2.1.4 SC-IoU trade-off

4.2.1.5 Refinement module efficiency

4.2.2 Qualitative results

5 Conclusion

Statements

Data availability statement

Author contributions

Funding

Conflict of interest

Generative AI statement

Publisher’s note

Supplementary material

References

Summary

Outline

Figures

Cite article

Share article

Article metrics