Online Road Network Understanding

Can, Yigit Baran

doi:10.3929/ethz-b-000662203

Download

Full text (PDF, 17.96Mb)

Open access

Author

Can, Yigit Baran

Date

2023

Type

Doctoral Thesis

ETH Bibliography

yes

Altmetrics

Download

Full text (PDF, 17.96Mb)

Rights / license

In Copyright - Non-Commercial Use Permitted

Abstract

In this thesis, online understanding of the traffic scene using onboard sensors is addressed. To this end, several approaches for direct extraction of centerline based lane graphs are discussed. Throughout this thesis, the lane graphs are constructed as directed graphs. These graphs represent the centerline segments as the vertices and the directed edges indicate the connectivity of the centerline segments. The centerlines are modeled with Bezier curves. Using a fixed number of control points, this representation provides straightforward integration with neural networks. In order to produce the lane graph estimate, a transformer based model that can take an onboard camera image and output the Bird’s-Eye-View lane graph of the local road network is presented. The model can produce a complete graph which can easily be used by the downstream tasks such as planning and prediction. In order to enable evaluation of lane graph estimates several metrics are also proposed. The metrics aim to measure the accuracy of the directed graph estimate both in terms of the vertices and the edges. In Chapter 3, the temporal extension for the lane graph extraction task is presented. The proposed model can use any number of frames, including one, to produce the lane graph representation of a target region. The core of the method is the temporal aggregation module where the image feature maps are projected to the target BEV grid. The projection is achieved using a flat ground assumption. We show that this assumption does not impact the performance in a significant way since the model learns to correct for the projection errors in the downstream layers. In order to validate the performance of this method, we also propose two baselines. The first one is a transformer based model that extends the spatial positional embedding into temporal dimension by concatenating the sinusoidal signals from the BEV grid location and the temporal ordering of a frame in the channel dimension. The other baseline is a post-processing procedure that combines the lane graph estimates obtained independently from the input frames individually. The post processing procedure explicitly transforms the estimated centerlines into a common reference frame and applies a heuristic matching to cluster and refine the estimated centerlines. After the centerlines are refined, the connectivity estimates of individual lane graph estimates are filtered to create the final lane graph. The experiments show that the proposed method outperforms the baselines as well as the previous single frame state-of-the-art methods even when using a single frame. In Chapter 4, topology of a lane graph is defined and a training framework is proposed where the method is supervised both in terms of lane graph and topological structure. The topology of a lane graph is defined as the intersection order of the centerlines which is shown to be equivalent to the the list of minimal covers arising from the geometric structure of the lane graph. This framework enhances the method’s ability to learn the underlying geometric structure of the lane graph. Providing this additional representation improves the accuracy of the estimates in all metrics. Moreover, in order to empirically validate the proposed theoretical framework and measure the performance of the baselines as well as the proposed method, two new metrics are introduced. These metrics specifically focus on the topological structure of the road network. The results indicate that the order of intersections can, indeed, be equivalently represented by the minimal covers and the proposed method achieves higher scores in topological understanding task. The autonomous agents frequently utilize online object detection algorithms. In Chapter 5, we discuss how to use the detections of these algorithms to improve the static lane graph of a scene. The propose method takes 3D object detection bounding boxes and an image to produce a Bird’sEye-View lane graph. The spatial accuracy of the 3D detections can help with the accurate localization of the centerlines. The key idea is to cluster the objects around the centerlines. To this end, the method learns to distinguish between the objects that occupy a centerline and objects that are in the background (parked, not in the traffic). For the objects in the traffic, the method assigns each input bounding box to one of its estimated centerlines. Through supervised training, this framework encourages the method to utilize the object detections for lane graph extraction and improve the performance in all metrics. Show more

Permanent link

https://doi.org/10.3929/ethz-b-000662203

Publication status

published

External links

Search print copy at ETH Library

Contributors

Examiner: Van Gool, Luc
Examiner: Caesar, Holger
Examiner: Krähenbühl, Philipp
Examiner: Paudel, Danda Pani

Publisher

ETH Zurich

Organisational unit

03514 - Van Gool, Luc / Van Gool, Luc