Abstract

Teaser image

Localization is paramount for autonomous robots. While camera and LiDAR-based approaches have been extensively investigated, they are affected by adverse illumination and weather conditions. Therefore, radar sensors have recently gained attention due to their intrinsic robustness to such conditions. Most existing radar-based localization methods compare the onboard radar measurement with a pre-built radar map of the environment. As radar maps are not readily available today, these methods first require a mapping phase of each environment in which the robot has to be deployed. LiDAR maps, on the other hand, are becoming increasingly available due to the growing demand for high-definition maps required by the next generation of Advanced Driver-Assistance Systems (ADAS) and self-driving cars.

In this work, we propose RaLF, a novel deep neural network-based approach for localizing radar scans in a LiDAR map of the environment, by jointly learning to address both place recognition and metric localization. RaLF is composed of radar and LiDAR feature encoders, a place recognition head that generates global descriptors, and a metric localization head that predicts the 3-DoF transformation between the radar scan and the map. We tackle the place recognition task by learning a shared embedding space between the two modalities via cross-modal metric learning. Additionally, we perform metric localization by predicting pixel-level flow vectors that align the query radar scan with the LiDAR map. We extensively evaluate our approach on multiple real-world driving datasets and show that RaLF achieves state-of-the-art performance for both place recognition and metric localization. Moreover, we demonstrate that our approach can effectively generalize to different cities and sensor setups than the ones used during training.

Technical Approach

Overview of our approach
Figure: Overview of our proposed RaLF architecture for joint place recognition and metric localization of radar scans in a LiDAR map. It consists of feature encoders, a place recognition head to extract global descriptors, and a metric localization head to estimate the 3-DoF pose of the query radar scan within the LiDAR map.

Differently from existing radar-LiDAR localization methods, RaLF is the first method to jointly address both place recognition and metric localization. We reformulate the metric localization task as a flow estimation problem, where we aim at predicting pixel-level correspondences between the radar and LiDAR samples, which are subsequently used to estimate a 3-DoF transformation. For place recognition, we leverage a combination of same-modal and cross-modal metric learning to learn a shared embedding space where features from both modalities can be compared against each other.


The architecture of the two encoders, namely the radar encoder and LiDAR encoder, is based on the feature encoder of RAFT, which consists of a convolutional layer with stride equal to two, followed by six residual layers with downsampling after the second and fourth layer. Differently from the original feature encoder of RAFT which shares weights between the two input images, RaLF employs separate feature extractors for each modality due to the distinct nature of radar and LiDAR data.


The place recognition head has a twofold purpose: firstly, it aggregates the feature maps from the feature extractor into a global descriptor. Secondly, it maps features from radar and LiDAR data, which naturally lie in different embedding spaces, into a shared embedding space, where global descriptors of radar scans and LiDAR submaps can be compared against each other. The architecture of the place recognition head is a shallow CNN composed of four convolutional layers with feature sizes (256, 128, 128, 128), respectively. Each convolutional layer is followed by batch normalization and ReLU activation. Differently from the feature encoders, the place recognition head is shared between the radar and LiDAR modalities. To train the place recognition head, we use the well-known triplet technique, where triplets composed of (anchor, positive, negative) samples are selected to compute the triplet loss. The positive sample is a BEV image depicting the same place as the anchor sample, while the negative sample is a BEV image of a different place. While typically this technique is employed to compare triplets of samples of the same modality, in our case the samples can be generated from different modalities.


For metric localization of radar scans against a LiDAR map, we propose to learn pixel-wise matches in the form of flow vectors. The intuition behind this decision is that a radar BEV image and a LiDAR BEV image taken at the same position should be well aligned. Therefore, for every pixel in the LiDAR BEV image, our metric localization head predicts the corresponding pixel in the radar BEV image. The architecture of our metric localization head is based on RAFT, which first computes a 4-D correlation volume between the features extracted by the two encoders. The correlation volume is then fed into a GRU that iteratively refines the estimated flow map.


During inference, given a query radar scan, we first compute its global descriptor, and we compare it against all the submap descriptors. We then select the submap with the highest similarity. We then feed the radar scan and the selected submap to the metric localization head, which outputs a flow map, which we use to estimate the 3-DoF transformation between the radar scan and the LiDAR map using RANSAC.


Video

Code

A software implementation of this project based on PyTorch can be found in our GitHub repository for academic usage and is released under the GPLv3 license. For any commercial purpose, please contact the authors.

Publications

If you find our work useful, please consider citing our paper:

Abhijeet Nayak, Daniele Cattaneo, Abhinav Valada
RaLF: Flow-based Global and Metric Radar Localization in LiDAR Maps
IEEE International Conference on Robotics and Automation (ICRA), 2024.

(PDF) (BibTeX)

Authors

Abhijeet Nayak

Abhijeet Nayak

University of Freiburg

Daniele Cattaneo

Daniele Cattaneo

University of Freiburg

Abhinav Valada

Abhinav Valada

University of Freiburg

Acknowledgment

This work was funded by the German Research Foundation (DFG) Emmy Noether Program grant number 468878300.