Abstract
Autonomous robot navigation relies on a robot's ability to understand its environment for localization, typically using a Visual Simultaneous Localization And Mapping (VSLAM) algorithm that processes image sequences. While state-of-the-art methods have shown remarkable performance, they face limitations: geometric VSLAM algorithms depend on carefully tuned hand-crafted feature extractors, whereas data-driven approaches suffer from limited generalization and require extensive training datasets. LF²SLAM proposes a hybrid solution that combines the strengths of both paradigms by integrating a deep neural network for feature extraction with a traditional VSLAM pipeline. The system leverages a novel loss function to train sparse keypoint detectors, optimized for pose estimation tasks, within the Monodepth2 framework. Experimental evaluations demonstrate the system's robustness in challenging scenarios, such as low-light and low-texture environments, significantly outperforming ORB-SLAM3 in accuracy and generalization capabilities.
Overview

LF²SLAM addresses the challenges of traditional Visual SLAM algorithms by integrating data-driven feature extraction with geometric pipelines. The system leverages the Superpoint network for sparse keypoint detection and the Monodepth2 training framework, enabling robust performance in environments with poor lighting and low texture. This hybrid approach minimizes reliance on extensive parameter tuning, offering a balanced solution that combines the adaptability of deep learning with the reliability of geometric methods, making it well-suited for real-world robotics applications.
Proposed Method
SLAM-specific feature extractor

LF²SLAM employs a novel training procedure that integrates the Superpoint feature extractor into the Monodepth2 framework, enabling the detection of sparse keypoints optimized for visual odometry. This is achieved by modifying the standard photometric loss function to focus specifically on the most salient keypoints, using a binary heatmap generated by the Superpoint network. The heatmap identifies pixels with high probability of being keypoints, and these are used to compute a reprojection loss that minimizes pose estimation errors. The training process incorporates regularization terms to ensure robustness and prevent trivial solutions, such as suppressing keypoint detection entirely. One term enforces the selection of a specific number of keypoints, while another ensures consistency with a pre-trained Superpoint model.
Integration within ORBSLAM3

Once optimized, the trained network is embedded into ORB-SLAM3, replacing the traditional ORB feature extractor. This allows LF²SLAM to leverage its learned features for accurate pose estimation, improving performance in scenarios where hand-crafted features often fail, such as low-light or low-texture environments.
Qualitative Results

LF²SLAM was evaluated on datasets like KITTI, EuRoC, and UNILAB, demonstrating significant improvements over ORB-SLAM3. The system maintained trajectory accuracy in challenging scenarios, including low-light conditions. In KITTI/09 (a), ORBSLAM3 (green) partially drifts, while LF²SLAM (red) achieves better performance. Both algorithms perform well on EuRoC/MH_05 (b). In UNILAB/C1 (c), ORBSLAM3 fails to track part of the ground truth, while LF²SLAM shows superior accuracy. Similarly, on UNILAB/X2 (d), LF²SLAM achieves a much better trajectory estimation where ORBSLAM3 almost completely loses tracking.