3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding

Abstract

Road topology understanding requires reasoning over directed connectivity, not only detecting map elements. Mask-based pipelines provide a complementary alternative to parametric/query-only formulations, but prior versions were limited by raster discretization artifacts and weak 3D prediction.

TopoMaskV3 makes the mask pathway a standalone 3D predictor by introducing two dense heads: a dense offset field for sub-grid centerline correction and a dense height map for direct z estimation. Beyond architecture, the study introduces a stronger evaluation protocol with geographically disjoint splits to reduce overlap-driven memorization and a long-range (+/-100 m) benchmark to test robustness at extended distance.

Under this rigorous setting, TopoMaskV3 (Fusion) achieves state-of-the-art results on the disjoint Near split, reaching 28.5 OLSl. The findings emphasize both the effectiveness of a standalone mask-based 3D design and the necessity of benchmark protocols that measure true generalization.

Disjoint and Long-Range Benchmark

TopoMaskV3 releases benchmark resources for geographically disjoint splits and long-range (+/-100 m) evaluation through the OpenLane-V2 fork.

Resources

Repository: OpenLane-V2 Fork (different_splits branch)
Data Downloads: Google Drive Benchmark Folder
Key Packages: openlanev2_sA_test_anno.zip, openlanev2_sA_anno_100.zip, openlanev2_sA_pkl_files.zip

Paper Novelties

🔧

Standalone 3D Mask-Based Predictor

TopoMaskV3 extends the mask paradigm with dense offset and dense height heads, enabling robust end-to-end 3D centerline prediction from the mask path.

🌍

Rigorous Generalization Benchmark

The study adapts geographically disjoint split methodology to road topology evaluation, reducing overlap-driven memorization and measuring true structural generalization.

📏

Long-Range Challenge

Evaluation scope is extended from +/-50 m to +/-100 m, providing a stricter test of robustness for long-horizon topology reasoning and early planning decisions.

📡

Key Insight on Fusion

Comprehensive analysis of output fusion (Mask vs Bezier) and sensor fusion (Camera vs LiDAR) shows both are crucial for robust long-range generalization.

Quick Recap: Quad-Direction Label

TopoMaskV3 builds on the quad-direction label representation introduced in TopoMaskV2. Each centerline instance is assigned one dominant direction class (up/down/left/right), so the mask output carries an explicit flow cue before vector reconstruction.

In the reconstruction stage, this label controls direction-aware extraction and ordered point generation, which reduces ambiguous trajectories in intersections and complex road layouts. TopoMaskV3 keeps this flow-aware design and couples it with dense offset and dense height predictions for stronger 3D centerline quality.

Figure 1: Quad-Direction Label Encoding. A centerline is assigned one of four direction labels by majority voting over consecutive points; tie cases are resolved from the start-to-end orientation. The resulting label acts as a compact flow token for direction-aware extraction, ordering, and reconstruction.

Method Overview

TopoMaskV3 follows an instance-query transformer design on BEV features built from multi-camera RGB inputs. Each sparse query represents one candidate centerline instance and predicts the attributes needed for direct mask-to-3D reconstruction.

🏗️ Multi-View to BEV

Perspective-view camera features are projected into a unified BEV representation, which provides a common spatial canvas for topology prediction.

🧭 Query-Level Predictions

For each query, the decoder predicts quad-direction class, mask probabilities, dense 2D offsets, and dense height values to encode flow and geometry jointly.

📈 Direction-Aware 3D Output

Predicted fields are converted into refined points and then reconstructed into ordered 3D centerlines using direction-aware extraction and curve regularization.

Figure 2: TopoMaskV3 Architecture Overview. Multi-camera images are encoded into BEV features and decoded with sparse instance queries. Each query predicts quad-direction labels, mask probabilities, dense offsets, and dense heights; a quad-direction-aware reconstruction stage then converts these dense outputs into ordered 3D centerline instances.

Figure 3 details the decoder heads and how they are used at output level. The primary path is the standalone mask-based route (classification + mask + offset + height), while the Bézier head is optional and mainly used for BDA variants or output fusion experiments.

Classification head: predicts the quad-direction label used for flow definition, point ordering, and axis selection during curve fitting.
Mask head: predicts the rasterized centerline support region for extracting candidate points.
Offset head: predicts a dense 2D field for sub-grid correction of discretization artifacts.
Height head: predicts dense height values to lift refined 2D points into 3D.
Bézier head (optional): predicts 3D control points for a parametric path that can be fused with the mask-based output.

Figure 3: TopoMaskV3 Decoder Heads. Each sparse query is processed by five parallel heads: classification, mask, offset, height, and optional Bézier regression. The standalone inference path uses class + mask + offset + height; when enabled, the Bézier path provides an auxiliary parametric output that can be fused with the mask-based prediction.

Key Contribution: Offset and Height Mechanisms

TopoMaskV3 makes the mask pathway a standalone 3D predictor by learning two dense geometric fields: a 2D offset field for sub-grid centerline correction and a height map for direct z estimation.

Both mechanisms use closest-point supervision to connect raster pixels with continuous 3D centerlines, then feed a direction-aware reconstruction pipeline for smooth and ordered outputs.

Figure 4: Offset and Height Refinement Mechanism. (a-b) Standard mask extraction produces quantization artifacts because true centerlines rarely align with grid centers. (c) Multi-point proposal learns dense offsets from each foreground pixel to its closest point on the continuous centerline (one-to-many supervision). (d) Single-point proposal refines extracted centerpoints with predicted offsets (one-to-one refinement). The height head follows the same closest-point association to assign z values and lift refined points into 3D.

🎯 Offset Prediction

A dense 2D offset vector is predicted per BEV cell. This corrects discretization error from raster masks and recovers sub-grid centerline geometry before curve fitting.

📏 Height Prediction

A dense height map predicts normalized z for the same BEV support region. Sampling height together with refined (x, y) points directly yields 3D centerline candidates.

📈 Combined Effect

Offset and height are complementary: one improves lateral localization, the other restores vertical structure. Their combination provides the strongest gains in the ablations below.

Experimental Results

Ablation Study

We first isolate the contribution of the dense offset and dense height heads. The ablation shows that both components are individually helpful and jointly strongest, confirming their complementary roles in lateral correction and 3D lifting.

Offset and Height Ablation

Configuration	DETl	DETl_ch	TOPll	OLSl
No Prediction	31.1	31.7	22.5	36.8
Only Offset	32.5	33.1	23.8	38.2
Only Height	32.6	37.2	23.4	39.4
Offset + Height	33.1	37.9	25.0	40.3

The combined setting gives the best result and improves over no prediction by +2.0 DETl, +6.2 DETl_ch, +2.5 TOPll, and +3.5 OLSl.

Benchmark Performance

Under the geographically disjoint Near split, TopoMaskV3 (Fusion) reaches state-of-the-art OLSl while also improving topology-related metrics over prior methods.

SOTA on Geographically Disjoint Near Split

Method	DETl	DETl_ch	TOPll	OLSl
TopoNet	18.9	23.5	12.7	26.0
TopoMLP	15.6	22.4	14.5	25.3
TopoLogic	16.9	22.7	15.5	26.3
TopoMaskV2 (M)	16.4	20.1	10.9	23.2
TopoMaskV2 (F)	18.5	23.8	11.7	25.5
TopoBDA	20.8	24.9	13.0	27.3
TopoMaskV3 (M)	19.3	25.6	13.6	27.3
TopoMaskV3 (F)	20.5	26.2	15.1	28.5

Best values are in bold and second-best are underlined. TopoMaskV3 (Fusion) achieves the best OLSl (28.5) on the disjoint Near split.

Fusion and Sensor Study

We further analyze output-level fusion (Mask/Bezier/Fusion) and sensor setup (camera-only vs camera+LiDAR) across split conditions and range settings (+/-50 m and +/-100 m).

Camera vs Camera+LiDAR Across Range and Split

Output	Original Split (Overlap)				Near Split (Disjoint)
	Camera		Camera + LiDAR		Camera		Camera + LiDAR
	±50 m	±100 m	±50 m	±100 m	±50 m	±100 m	±50 m	±100 m
Bezier	43.5	32.5	51.6	50.0	27.8	16.5	32.0	24.2
Mask	40.8	31.0	47.7	46.9	26.4	16.7	31.0	23.7
Fusion	42.5	32.9	50.0	48.6	27.9	17.4	32.0	24.5

Best values are in bold and second-best are underlined. Fusion and camera+LiDAR generally provide the most stable performance, especially under extended range and disjoint evaluation.

Key Findings from This Analysis

Memorization effect: Moving from Original (overlap) to Near (disjoint) produces large drops across outputs, indicating overlap-driven inflation in the original split (e.g., Bézier, camera, ±100 m: 32.5 → 16.5).
Fusion vs Bézier across split types: In disjoint conditions (Near), Fusion is consistently stronger than Bézier across all four sensor-range settings, while in the overlap-heavy Original split Bézier leads in three of four settings. This pattern indicates stronger memorization sensitivity in the Bézier output type.
Mask vs Bézier robustness (Original → Near): Mask shows consistently smaller reductions across all settings. Camera ±50 m: Bézier -36.1% vs Mask -35.3%; Camera ±100 m: -49.2% vs -46.1%; Camera+LiDAR ±50 m: -38.0% vs -35.0%; Camera+LiDAR ±100 m: -51.6% vs -49.5%.
LiDAR overlap sensitivity (Fusion): LiDAR gains are larger on the overlap-heavy Original split than on the disjoint Near split: +47.7% vs +40.8% at ±100 m, and +17.6% vs +14.7% at ±50 m. This indicates LiDAR-equipped models also benefit from overlap-driven memorization and therefore show an overfitting tendency on the Original split.

Conclusion

TopoMaskV3 matures the mask-based road-topology paradigm into a practical standalone 3D predictor by introducing dense offset and dense height prediction heads for sub-grid localization and direct elevation estimation.

State-of-the-art under strict evaluation: On the geographically disjoint Near split, TopoMaskV3 (Fusion) reaches 28.5 OLS_l, while the standalone mask-only variant reaches 27.3 OLS_l.
Benchmark contribution: The proposed geographically distinct splits and long-range benchmark provide a stricter and more realistic testbed by reducing overlap-driven score inflation.
Generalization insight: Mask and Fusion are more stable across disjoint conditions, whereas Bézier shows stronger dependence on overlap-heavy settings, consistent with output-type memorization behavior.
Sensor-fusion insight: LiDAR is especially valuable at long range, but larger relative gains on the overlap-heavy Original split indicate that LiDAR-equipped models can also benefit from memorization.
Current limitation and direction: TopoMaskV3 still uses a post-processing stage (not fully end-to-end), but introduces a representation that is fundamentally complementary to parametric curve-based formulations.

Citations

@article{kalfaoglu_topomaskv3_2026,
  title={TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding},
  journal={arXiv preprint arXiv:2603.01558},
  author={Kalfaoglu, Muhammet Esat and Ozturk, Halil Ibrahim and Kilinc, Ozsel and Temizel, Alptekin},
  year={2026}
}

@article{kalfaoglu_topomaskv2_2024,
  title={TopoMaskV2: Enhanced Instance-Mask-Based Formulation for the Road Topology Problem},
  journal={arXiv preprint arXiv:2409.11325},
  author={Kalfaoglu, Muhammet Esat and Ozturk, Halil Ibrahim and Kilinc, Ozsel and Temizel, Alptekin},
  year={2024}
}