TopoBDA

Towards Bezier Deformable Attention for Road Topology Understanding

Accepted at Neurocomputing (Elsevier)
Muhammet Esat Kalfaoglu* Halil Ibrahim Ozturk Ozsel Kilinc Alptekin Temizel
Middle East Technical University β€’ Togg/Trutek AI Team

Abstract

Understanding road topology is crucial for autonomous driving. This paper introduces TopoBDA (Topology with Bezier Deformable Attention), a novel approach that enhances road topology comprehension by leveraging Bezier Deformable Attention (BDA). TopoBDA processes multi-camera 360-degree imagery to generate Bird's Eye View (BEV) features, which are refined through a transformer decoder employing BDA. BDA utilizes Bezier control points to drive the deformable attention mechanism, improving the detection and representation of elongated and thin polyline structures, such as lane centerlines.

Additionally, TopoBDA integrates two auxiliary components: an instance mask formulation loss and a one-to-many set prediction loss strategy, to further refine centerline detection and enhance road topology understanding. Experimental evaluations on the OpenLane-V2 dataset demonstrate that TopoBDA outperforms existing methods, achieving state-of-the-art results in centerline detection and topology reasoning. TopoBDA also achieves the best results on the OpenLane-V1 dataset in 3D lane detection. Further experiments on integrating multi-modal dataβ€”such as LiDAR, radar, and SDMapβ€”show that multimodal inputs can further enhance performance in road topology understanding.

πŸ”¬ LiDAR Data Integration

For LiDAR integration experiments with OpenLane-V2, we have created a specialized fork with LiDAR point cloud data integration and visualization capabilities.

πŸ“š Resources

Research Highlights

πŸ”—

Novel MPDA Integration to Bezier Structures

First-time integration of Multi-Point Deformable Attention (MPDA) into Bezier keypoint-dependent transformer decoders, enhancing centerline detection for elongated polyline structures.

🎯

Bezier Deformable Attention (BDA)

Novel attention mechanism utilizing Bezier control points as reference points, achieving superior performance with reduced computational complexity compared to traditional MPDA approaches.

πŸ”§

Instance Mask Formulation

Indirect auxiliary supervision through instance mask prediction and Mask-L1 mix matcher, improving centerline detection without inference overhead.

🌐

Multi-modal Fusion Analysis

First comprehensive evaluation of camera, LiDAR, radar, and SDMap integration for road topology understanding, achieving state-of-the-art multi-modal performance.

⚑

Comprehensive Attention Mechanism Analysis

Systematic comparative analysis of attention mechanisms for road topology understanding, evaluating computational complexity, runtime efficiency, and performance across Standard, Masked, Single-Point, Multi-Point, and Bezier Deformable Attention with detailed FLOPS and parameter analysis.

Method Overview

TopoBDA Architecture - Multi-camera BEV feature extraction with Bezier Deformable Attention
Figure 1: TopoBDA Architecture. Overview of the TopoBDA architecture. The TopoBDA architecture is based on the instance query concept. The extracted BEV features from the multiple camera images are fed into the transformer decoder. The decoder outputs Bezier control points for each query, which are then converted into centerline instances via matrix multiplication. Additionally, each centerline query predicts instance masks, but only during training.

Key Innovation: Bezier Deformable Attention

Attention Evolution

Attention Mechanism Evolution: SPDA to MPDA to BDA comparison
Figure 2: Attention Mechanism Evolution. Comparison of Single-Point Deformable Attention (SPDA), Multi-Point Deformable Attention (MPDA), and Bezier Deformable Attention (BDA). Points denote the reference positions (anchors) for each attention head, while arrows indicate the learned offsets that shift attention from these anchors to the actual sampling locations where features are aggregated. SPDA uses identical reference positions across all heads, whereas MPDA and BDA employ distinct reference positions per head, improving attention efficiency for polyline structures. Although MPDA and BDA share the same underlying mechanism, they differ in how multiple reference points $(p_x, p_y)$ are selected. BDA directly utilizes Bezier points as reference positions, while MPDA requires conversion of Bezier points into polyline points and utilizes polyline points as reference positions.

Implementation Efficiency

BDA vs MPDA Implementation Efficiency Comparison
Figure 3: BDA vs MPDA Implementation. Comparison of Multi-Point Deformable Attention (MPDA) and Bezier Deformable Attention (BDA): MPDA necessitates an additional matrix multiplication block within each transformer decoder to convert predicted Bezier control points into polyline points for use as reference points. Despite their different input utilizations as reference points, the mechanisms of MPDA and BDA blocks are fundamentally the same: each attention head operates on a distinct reference pointβ€”polyline points in MPDA and Bezier control points in BDA.

Methodology

πŸ—οΈ Instance Query Architecture

Multi-camera 360-degree imagery processed through BEV feature extraction, refined via transformer decoder with sparse query approach where each query represents a centerline instance rather than individual points.

πŸ“ Bezier Control Point Regression

Compact polyline representation through Bezier control points, enabling efficient centerline modeling via matrix multiplication operations while reducing computational complexity at regression heads.

🎯 Auxiliary Training Components

Instance mask prediction and Mask-L1 mix matcher during training enhance centerline detection accuracy. One-to-many set prediction loss improves training convergence without inference overhead.

TopoBDA Layers with Bezier Deformable Attention - Iterative refinement across decoder layers
Figure 6: BDA Layers Visualization. This figure visualizes the layers of TopoBDA, each driven by Bezier Deformable Attention (BDA) using control points predicted through iterative refinement. Note that iterative refinement is not applicable to the first layer, which uses direct prediction.

Experimental Results

πŸ† State-of-the-Art Performance on OpenLane-V2 (Subset-A)

Method Sensor DETl DETt TOPll TOPlt OLS
TopoNet C 28.6 48.6 10.9 23.9 39.8
TopoMLP C 28.5 49.5 21.7 26.9 44.1
TopoFormer C 34.7 48.2 24.1 29.5 46.3
TopoMaskV2 C 34.5 53.8 24.5 35.6 49.4
TopoBDA (Ours) C 38.9 54.3 27.6 37.3 51.7
TopoBDA (Ours) C + L 47.3 54.0 35.5 41.9 56.4
TopoBDA (Ours) C + SD 42.7 52.4 34.3 41.7 54.6
TopoBDA (Ours) C + L + SD 52.0 52.4 38.5 45.3 58.4

πŸ† State-of-the-Art Performance on OpenLane-V1 (3D Lane Detection)

Distance Method Backbone F1-Score ↑ X-error near (m) ↓ X-error far (m) ↓ Z-error near (m) ↓ Z-error far (m) ↓
1.5m PersFormer ResNet-50 52.7 0.307 0.319 0.083 0.117
Anchor3DLane ResNet-50 57.5 0.229 0.243 0.079 0.106
GroupLane ResNet-50 60.2 0.371 0.476 0.220 0.357
LaneCPP EffNet-B7 60.3 0.264 0.310 0.077 0.117
LATR ResNet-50 61.9 0.219 0.259 0.075 0.104
PVALane ResNet-50 62.7 0.232 0.259 0.092 0.118
TopoBDA (Ours) ResNet-50 63.9 0.224 0.243 0.069 0.101
0.5m PersFormer ResNet-50 43.2 0.229 0.245 0.078 0.106
DV-3DLane ResNet-34 52.9 0.173 0.212 0.069 0.098
LATR ResNet-50 54.0 0.171 0.201 0.072 0.099
TopoBDA (Ours) ResNet-50 57.9 0.157 0.179 0.067 0.087

Instance Mask Formulation Ablation

IMAL ML1M DETl DETl_ch TOPll OLSl
βœ— βœ— 37.0 39.8 29.0 43.6
βœ“ βœ— 40.7 42.1 32.4 46.6
βœ“ βœ“ 40.8 45.8 32.9 48.0

IMAL: Instance Mask Auxiliary Loss | ML1M: Mask-L1 Mix Matcher
Showing significant improvements from instance mask formulation and mask-L1 mix matcher: +3.8 points in DETl and +4.4 points in OLSl.

Attention Mechanism Ablation

Attention Type DETl DETl_ch TOPll OLSl
SA (Standard Attention) 34.5 38.4 25.1 41.0
MA (Masked Attention) 35.8 40.2 26.9 42.6
SPDA (Single-Point Deformable Attention) 38.3 39.8 29.5 44.1
MPDA4 (Multi-Point Deformable Attention) 40.2 45.0 32.6 47.4
MPDA16 (16-Point Deformable Attention) 40.3 45.1 32.7 47.5
BDA (Bezier Deformable Attention) 40.8 45.8 32.9 48.0

Visual Results

TopoBDA Closed-loop Analysis and Multi-modal Fusion Results
Figure 5: Multi-modal Fusion Analysis. Visual demonstration in the BEV domain showing the impact of lidar and SDMap additions in Subset-A of the OpenLane-V2 dataset. C, SD, and L represent the camera, SDMap, and lidar, respectively. Green polylines indicate the ground truth, and red polylines represent predictions. The circular regions highlight the inaccurate regions compared to other reference BEV images.
OpenLane-V2 Dataset Overview - Perspective and Bird's Eye View samples with annotations
Figure 7: Dataset Ground Truth Visualization. Perspective-view (PV) and bird's-eye-view (BEV) samples from the OpenLane-V2 dataset. (a) and (d) show centerline instances in PV and BEV domains, respectively, with each color representing a distinct instance. (b) and (c) illustrate centerlines with colors indicating topological relationships between centerlines and traffic elements in PV and BEV. (e) visualizes the topological relationships among different centerlines, where directed arrows indicate connectivity between centerlines.

πŸŽ₯ TopoBDA Video Demonstrations

πŸ“Ή Front View Perspective - Ground Truth vs Predictions

Front View Analysis: Demonstrates TopoBDA's centerline detection performance in the front camera perspective view, comparing ground truth and model predictions. Centerlines are shown in blue colors, while topological relationships between centerlines and traffic elements are indicated by arrows. The arrow colors represent different traffic element types, highlighting the spatial connectivity and semantic understanding capabilities of the model.

πŸ“Ή Multi-View 360Β° Coverage - Centerline Instance Predictions

Multi-View Analysis: Showcases TopoBDA's performance across all camera views with comprehensive 360-degree coverage. The visualization focuses on centerline instance predictions, where different instances are distinguished by distinct colors. This demonstration highlights the model's capability to accurately detect and differentiate multiple centerline instances across various viewing perspectives.

πŸ“Ή Bird's Eye View Analysis - Multi-Modal Fusion

BEV Analysis: Comprehensive Bird's Eye View visualization showcasing nine distinct perspectives with detailed ground truth-prediction correspondences: (1) Ground truth traffic element-centerline topology with highlighted relationships, (2) Ground truth instances with color-coded differentiation, (3) Predicted traffic element-centerline topology with highlighted relationships (correspondence to 1), (4) Predicted instances with color-coded differentiation (correspondence to 2), (5) Ground truth centerline-to-centerline topology relationships, (6) Predicted centerline-to-centerline topology relationships (correspondence to 5), (7) Overlaid prediction comparison analysis, (8) SDMap input visualization, and (9) LiDAR point cloud input visualization. This sequence demonstrates TopoBDA's multi-modal fusion capabilities integrating camera, SDMap, and LiDAR data for robust road topology understanding.

Conclusion

Experimental evaluations demonstrate that TopoBDA achieves state-of-the-art performance across both subsets of the OpenLane-V2 dataset. Specifically, TopoBDA surpasses existing methods with a DETl score of 38.9 and an OLS score of 51.7 in Subset-A, and a DETl score of 45.1 and an OLS score of 54.3 in Subset-B.

The integration of multi-modal data significantly boosts performance: fusing camera and LiDAR data increases the OLS score in Subset-A from 51.7 to 56.4, and in Subset-B from 54.3 to 61.7. Further incorporating SDMap alongside camera and LiDAR sensors raises the OLS score in Subset-A to 58.4. These results underscore the effectiveness of TopoBDA in road topology comprehension and highlight the substantial benefits of multi-modal fusion.

Additionally, TopoBDA achieves superior results on the OpenLane-V1 benchmark for 3D lane detection, with F1-scores of 63.9 at a 1.5m distance and 57.9 at a 0.5m distance. This work contributes toward closing existing gaps in HDMap element prediction, offering a unified framework for road topology understanding and 3D lane detection in autonomous driving.

Citation

@article{kalfaoglu2026topobda,
  title={TopoBDA: Towards Bezier Deformable Attention for Road Topology Understanding},
  author={Kalfaoglu, Muhammet Esat and Ozturk, Halil Ibrahim and Kilinc, Ozsel and Temizel, Alptekin},
  journal={Neurocomputing},
  volume={670},
  pages={132360},
  year={2026},
  publisher={Elsevier},
  doi={10.1016/j.neucom.2025.132360}
}
×