LF²SLAM: Learning-based Features For Visual SLAM

Marco Legittimo1, Francesco Crocetti1, Mario Luca Fravolini1, Giuseppe Mollica2, Gabriele Costante1

1Department of Engineering, University of Perugia, Italy & 2ART Spa

PDF Icon Paper YouTube Icon Video GitHub Icon Code

Abstract

Autonomous robot navigation relies on a robot's ability to understand its environment for localization, typically using a Visual Simultaneous Localization And Mapping (VSLAM) algorithm that processes image sequences. While state-of-the-art methods have shown remarkable performance, they face limitations: geometric VSLAM algorithms depend on carefully tuned hand-crafted feature extractors, whereas data-driven approaches suffer from limited generalization and require extensive training datasets. LF²SLAM proposes a hybrid solution that combines the strengths of both paradigms by integrating a deep neural network for feature extraction with a traditional VSLAM pipeline. The system leverages a novel loss function to train sparse keypoint detectors, optimized for pose estimation tasks, within the Monodepth2 framework. Experimental evaluations demonstrate the system's robustness in challenging scenarios, such as low-light and low-texture environments, significantly outperforming ORB-SLAM3 in accuracy and generalization capabilities.

Overview

Overview of LF²SLAM compared to ORB-SLAM3

LF²SLAM addresses the challenges of traditional Visual SLAM algorithms by integrating data-driven feature extraction with geometric pipelines. The system leverages the Superpoint network for sparse keypoint detection and the Monodepth2 training framework, enabling robust performance in environments with poor lighting and low texture. This hybrid approach minimizes reliance on extensive parameter tuning, offering a balanced solution that combines the adaptability of deep learning with the reliability of geometric methods, making it well-suited for real-world robotics applications.

Proposed Method

SLAM-specific feature extractor

Training LF²SLAM Pipeline

LF²SLAM employs a novel training procedure that integrates the Superpoint feature extractor into the Monodepth2 framework, enabling the detection of sparse keypoints optimized for visual odometry. This is achieved by modifying the standard photometric loss function to focus specifically on the most salient keypoints, using a binary heatmap generated by the Superpoint network. The heatmap identifies pixels with high probability of being keypoints, and these are used to compute a reprojection loss that minimizes pose estimation errors. The training process incorporates regularization terms to ensure robustness and prevent trivial solutions, such as suppressing keypoint detection entirely. One term enforces the selection of a specific number of keypoints, while another ensures consistency with a pre-trained Superpoint model.

Integration within ORBSLAM3

Integration of LF²SLAM within ORBSLAM3

Once optimized, the trained network is embedded into ORB-SLAM3, replacing the traditional ORB feature extractor. This allows LF²SLAM to leverage its learned features for accurate pose estimation, improving performance in scenarios where hand-crafted features often fail, such as low-light or low-texture environments.

Qualitative Results

Trajectories comparison between LF²SLAM and ORBSLAM3

LF²SLAM was evaluated on datasets like KITTI, EuRoC, and UNILAB, demonstrating significant improvements over ORB-SLAM3. The system maintained trajectory accuracy in challenging scenarios, including low-light conditions. In KITTI/09 (a), ORBSLAM3 (green) partially drifts, while LF²SLAM (red) achieves better performance. Both algorithms perform well on EuRoC/MH_05 (b). In UNILAB/C1 (c), ORBSLAM3 fails to track part of the ground truth, while LF²SLAM shows superior accuracy. Similarly, on UNILAB/X2 (d), LF²SLAM achieves a much better trajectory estimation where ORBSLAM3 almost completely loses tracking.

BibTeX

    @INPROCEEDINGS{10801935,
      author={Legittimo, Marco and Crocetti, Francesco and Fravolini, Mario Luca and Mollica, Giuseppe and Costante, Gabriele},
      booktitle={2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, 
      title={LF2SLAM: Learning-based Features For visual SLAM}, 
      year={2024},
      volume={},
      number={},
      pages={5648-5655},
      keywords={Training;Visualization;Simultaneous localization and mapping;Pipelines;Pose estimation;Feature extraction;Robustness;Tuning;Standards;Optimization},
      doi={10.1109/IROS58592.2024.10801935}}