Visual-Sonar SLAM: Monocular ORB-SLAM2 + Sonar in HoloOcean simulation

Correcting Scale Ambiguity of Monocular SLAM with Absolute Sonar Range Measurements, utilizing captured data from HoloOcean simulation (Unreal Engine).

Posted Feb 20, 2025 Updated May 13, 2025

By Bach Nguyen

views 19 min read

Visual-Sonar SLAM: Monocular ORB-SLAM2 + Sonar in HoloOcean simulation

Goals

In this post, I want to walk you through some curiosities:

understand scale ambiguity problem in monocular SLAM
understand multi-beam sonar basics
absolute sonar measurements can correct scale of monocular SLAM, but needs to obtain camera-sonar correspondences -> understand how to obtain these correspondences
understand how to obtain camera, sonar data from HoloOcean simulation
implement ORB-SLAM2 + ros2 package: scale estimation/correction of monocular ORB-SLAM2 from HoloOcean data

Monocular SLAM and Scale Ambiguity problem

Monocular SLAM

For a robot to navigate in an unknown environment, it basically needs to answer two questions: “What is around me?” and “Where am I?” (relative to surroundings, and past positions). SLAM (simultaneuous localization and mapping) helps robot to answer those questions, utilizing sensors data.

SLAM concept

If a single camera is the only sensor that SLAM uses, then it is called Monocular SLAM. So how do Localization and Mapping task work using only a sequence of camera images?

Monocular Localization and Mapping

For every two adjacent frames, Localization does three main steps: 1) features extraction (each frame), 2) features matching (between 2 frames), 3) estimate essential matrix (representing relative camera motion). For the Mapping task, it utilized those matched features, then triangulate those 2D points to obtain 3D points. As more frames are processed, a cumulation of 3D points represent the mapped objects or environment.

Scale Ambiguity problem

For every matched pair of features P1 & P2, the third step estimating essential matrix (E) uses the epipolar constraint equation to solve for its values.

Epipolar Constraint Equation

However, P1 and P2 are normalized coordinates in image plane, thus they no long have the true depth information z. Therefore, the solution to this equation has two types of ambiguities: ambiguity along epipolar line, and ambiguity along baseline.

Ambiguity along epipolar line: any points P2', P2'', .. along epipolar line also satisfy equation similar to P2

Ambiguity along baseline: frames along the baseline also satisfy equation

Matching features P1 and P2 may avoid ambiguity along epipolar line, but it still suffer from ambiguity along baseline. In this case, the solution of E in the epipolar equation can only be up to a scale, for the translation part (i.e: \(t_{true} = \lambda t_{est}\), can solve for \(t_{est}\) but not \(\lambda\) ). This is also known as scale ambiguity.

How this affects Localization and Mapping

Multi-beam Sonar

Multi-beam Sonar concept

Basically, for a target 3D point in underwater, multi-beam sonar outputs range R (absolute distance from sonar sensor to target), azimuth \(\theta\) (angle between center-beam axis and the target beam). However, the elevation \phi (vertical position of the 3D target relative to 0-elevation plane of sonar) is lost

How Sonar range measurements can correct scale of Monocular SLAM

Camera and Sonar with overlapped FoV, and observing the same 3D point P

Monocular SLAM frameworks (i.e: ORB-SLAM2) can estimate visual distance d from camera to P, whereas sonar captures the absolute distance r from sonar to P. How do we use r to correct d?

-> We need to match camera features with sonar features first (camera-sonar correspondences), making sure they are observing the same P, then use their relative geometry to estimate the scale \(d_{true} / d_{est}\)

Camera-Sonar Correspondences

Each sonar 2D feature (sonar image) corresponds to a pixel column (camera image). Each pixel column may contain 2D features (i.e: detected by ORB-SLAM2). Each 2D feature corresponds to a 3D triangulated point.

-> Each sonar 2D feature (sonar image) is matched with the 2D pixel feature (camera image) that has closest 3D triangulated point (compared to other 2D pixel features in that pixel column)

Scale Estimation / Correction

For not too close stand off distances between AUV and target object, the scale \(d_{true} / d_{est}\) can be approximately estimated by \(r / d_{est}\)

Implementation: Monocular ORB-SLAM2 + ROS2

ROS2 topics, nodes, and service. (optional) Note: IMU data and groundtruth odometry are only used for EKF and evaluation

Capture data

Task: use HoloOcean simulation to generate synced camera images and sonar images

I assume you’ve already followed this doc and installed HoloOcean (I used version 1.0.0 with Unreal Enine 4.27)

In this blog, we’ll use a sample scenario ExampleLevel that I previously built for you, in which the HoveringAUV will move parallel to the surface of a large cube (similar to inspection).

HoveringAUV schematic

We first need to setup the configuration for AUV with its sensors. You can set the sampling rate at 30Hz so the output sensors data are synced.

  
{
    "name": "HoveringCamera",
    "world": "ExampleLevel",
    "main_agent": "auv0",
    "ticks_per_sec": 180,
    "frames_per_sec": true,
    "octree_min": 0.02,
    "octree_max": 30.0,
    "scene_id": 1,
    "agents":[
        {
            "agent_name": "auv0",
            "agent_type": "HoveringAUV",
            "sensors": [
            	{
                    "sensor_type": "DVLSensor",
                    "socket": "DVLSocket",
                    "Hz": 30,
                    "configuration": {
                        "Elevation": 22.5,
                        "VelSigma": 0.0,
                        "ReturnRange": false,
                        "MaxRange": 2,
                        "RangeSigma": 0.0,
                        "DebugLines": false
                    }
                },
            	{
                    "sensor_type": "IMUSensor",
                    "socket": "COM",
                    "Hz": 30,	
                    "configuration": {
                        "AccelSigma": 0.0,
                        "AngVelSigma": 0.0,
                        "AccelBiasSigma": 0.0,
                        "AngVelBiasSigma": 0.0,
                        "ReturnBias": false
                    }
                },
                {
                    "sensor_type": "RotationSensor",
                    "socket": "COM",	
                    "Hz": 30
                },
            	{
                    "sensor_type": "PoseSensor",
                    "sensor_name": "PoseSensorCOM",
                    "socket": "COM",
                    "Hz": 30,
                    "location": [0.07325, -0.02, -0.3179],
                    "rotation": [0.0, 0.0, 0.0]
                },
            	{
            	    "sensor_type": "PoseSensor",
                    "sensor_name": "PoseSensorRGB",
                    "socket": "CameraRightSocket",
                    "Hz": 30,
                    "location": [0.1, -0.095, -0.205],
                    "rotation": [0.0, 0.0, 0.0]
                },
                {
                    "sensor_type": "PoseSensor",
                    "sensor_name": "PoseSensorSonar",
                    "socket": "SonarSocket",
                    "Hz": 30,
                    "location": [0.0, 0.0, 0.205],
                    "rotation": [0.0, 0.0, 0.0]
                },
                {
                    "sensor_type": "RGBCamera",
                    "sensor_name": "RightCamera",
                    "socket": "CameraRightSocket",
                    "Hz": 30,
                    "configuration": {
                        "CaptureWidth": 512,
                        "CaptureHeight": 512
                    },
                    "location": [0.1, -0.095, -0.205],
                    "rotation": [0.0, 0.0, 0.0]
                },
                {
                    "sensor_type": "ImagingSonar",
                    "socket": "SonarSocket",	
                    "Hz": 30,
                    "configuration": {
                        "RangeBins": 512,
                        "AzimuthBins": 512,
                        "RangeMin": 0.01,
                        "RangeMax": 3.5,
                        "InitOctreeRange": 10,
                        "Elevation": 10,
                        "Azimuth": 130,
                        "ShowWarning": false,
                        "ViewRegion": false,
                        "ViewOctree": -10
                    },
                    "location": [0.0, 0.0, 0.205],
                    "rotation": [0.0, 0.0, 0.0]
                }
            ],
            "control_scheme": 0,
            "location": [-5.597, 7.103, -6.15],
            "rotation": [0.0, 0.0, 0.0]
        }
    ],

    "window_width":  1024,
    "window_height": 640
}

You can save this config as test-scene.json.

Next, we write a script to spawn a HoloOcean simulation, load the scene and AUV with this config file:

  
with holoocean.make('test-scene', show_viewport=True) as env:
    ### do further stuffs with AUV ...

, enter a loop ticking the environment states, then extract sensors data:

  
while True:
    state = env.tick()
    if 'RightCamera' in state and 'ImagingSonar' in state and 'PoseSensorRGB' in state and 'PoseSensorSonar' in state:
        ### extract rgb data, sonar data (in polar coordinate), and their ground-truth poses (can be used in evaluation)

, saving data:

  
cv2.imwrite(f"./out/rgb_{SCENE_ID}/{ts:.6f}.png", cam_img)
np.save(f"./out/sonar_{SCENE_ID}_polar/{ts:.6f}", polar_img)

Note: You should use a manually updated timestamp ts if you plan to use other sensors like IMU for EKF fusion purpose, since time.time() may depend on the running computer resource, and not reflect true simulation time.

The full script can be found in my repo

Once you run the script, the simulation window should open.

HoloOcean simulation window - AUV observing cube scene

Dataset node (Python)

Task: read data saved in directories, then publish messages

Initialize publishers, timers, file paths & names, ..

  
from sensor_msgs.msg import Imu, Image
from nav_msgs.msg import Odometry

class DatasetNode(Node):
    def __init__(self):
        super().__init__("dataset_node_py")
        self.publisher_rgb_ = self.create_publisher(Image, 'image/data', 10)
        self.publisher_sonar_ = self.create_publisher(Image, 'sonar/data', 10)
        self.publisher_imu_ = self.create_publisher(Imu, '/imu/data', 10) ## (optional) for EKF fusion
        self.publisher_gt_ = self.create_publisher(Odometry, '/gt/odometry', 10) ## (optional) for comparison

        self.timer_rgbsonar_ = self.create_timer((1/30)*10, self.timer_rgbsonar_callback)
        self.timer_imu_ = self.create_timer((1/180)*10, self.timer_imu_callback) ## (optional) for EKF fusion
        self.timer_gt_ = self.create_timer((1/30)*10, self.timer_gt_callback) ## (optional) for comparison

        self.rgb_path = f"{self.get_parameter('base_path').value}/rgb_{self.get_parameter('num').value}/"
        self.sonar_path = f"{self.get_parameter('base_path').value}/sonar_{self.get_parameter('num').value}/"
        self.imu_path = f"{self.get_parameter('base_path').value}/imu_{self.get_parameter('num').value}/"
        self.gt_path = f"{self.get_parameter('base_path').value}/poses_{self.get_parameter('num').value}/poses_rgb.txt"
        
        self.rgb_fnames = sorted(glob.glob(f"{self.rgb_path}/*"),  key=lambda x: int(x.split('/')[-1][:-4].replace('.','')))
        self.sonar_fnames = sorted(glob.glob(f"{self.sonar_path}/*"),  key=lambda x: int(x.split('/')[-1][:-4].replace('.','')))
        self.imu_fnames = sorted(glob.glob(f"{self.imu_path}/*"),  key=lambda x: int(x.split('/')[-1][:-4].replace('.','')))

For camera and sonar images, we use sensor_msgs.msg.Image. For ground truth odometry, we use nav_msgs.msg.Odometry.

Publishing camera and sonar images:

  
def timer_rgbsonar_callback(self):
        if self.idx_rgbsonar >= 0 and self.idx_rgbsonar < len(self.rgb_fnames):
            if self.idx_rgbsonar==0: self.forward = True
            
            ts = float(self.imu_fnames[self.idx_rgbsonar].split('/')[-1][:-4])
            
            rgb_im = cv2.imread(self.rgb_fnames[self.idx_rgbsonar])
            sonar_im = np.load(self.sonar_fnames[self.idx_rgbsonar])
            timestamp = float(self.rgb_fnames[self.idx_rgbsonar].split("/")[-1][:-4])
                       
            msg_sonar = self.bridge.cv2_to_imgmsg((sonar_im*255).astype(np.uint8), encoding="mono8")
            msg_sonar.header.stamp.sec = int(ts)
            msg_sonar.header.stamp.nanosec = int((ts - int(ts)) * (10**9))
            msg_sonar.header.frame_id = str(self.idx_rgbsonar)
            
            msg_rgb = self.bridge.cv2_to_imgmsg(rgb_im.astype(np.uint8), encoding="bgr8")#sonar_im)
            msg_rgb.header.stamp.sec = int(ts)
            msg_rgb.header.stamp.nanosec = int((ts - int(ts)) * (10**9))
            msg_rgb.header.frame_id = str(self.idx_rgbsonar)
            
            self.publisher_sonar_.publish(msg_sonar)
            self.publisher_rgb_.publish(msg_rgb)
            
            if self.forward: self.idx_rgbsonar+=1 
            else: self.idx_rgbsonar-=1   
        
        
        if self.idx_rgbsonar == len(self.rgb_fnames):
            self.done = f"{(int(self.done,base=2) | int('1000',base=2)):03b}"

Publishing ground truth odometry:

  
def timer_gt_callback(self):
        if self.idx_gt >= 0 and self.idx_gt < len(self.gt_poses):
            ts, tx, ty, tz, qx, qy, qz, qw = [float(_) for _ in self.gt_poses[self.idx_gt].split()]
            
            ### align gt with odometry
            if self.gt_pose_first is None:
                self.gt_pose_first = [tx, ty, tz, qx, qy, qz, qw]
            
            tx = -(tx - self.gt_pose_first[0])
            ty = -(ty - self.gt_pose_first[1])
            tz = tz - self.gt_pose_first[2]
            ######
            
            ### rotate orientation based on rotated origin 180
            rot_mat = R.from_quat([qx, qy, qz, qw]).as_matrix()
            rot_fix = np.array([
                [-1, 0, 0],
                [0, -1, 0],
                [0, 0, 1]
            ])
            rot_quat = R.from_matrix(rot_fix @ rot_mat).as_quat()
            
            ######
            
            odom_msg = Odometry()

            odom_msg.header.stamp.sec = int(ts)
            odom_msg.header.stamp.nanosec = int((ts - int(ts)) * (10**9))
            odom_msg.header.frame_id = "odom"
            odom_msg.child_frame_id = "base_link"
            
            odom_msg.pose.pose.position.x = tx
            odom_msg.pose.pose.position.y = ty
            odom_msg.pose.pose.position.z = tz

            odom_msg.pose.pose.orientation.x = rot_quat[0]
            odom_msg.pose.pose.orientation.y = rot_quat[1]
            odom_msg.pose.pose.orientation.z = rot_quat[2]
            odom_msg.pose.pose.orientation.w = rot_quat[3]
            
            self.publisher_gt_.publish(odom_msg)
            self.publisher_gt2_.publish(odom_msg)
            self.idx_gt += 1
            
        if self.idx_gt == len(self.gt_poses):
            self.done = f"{(int(self.done,base=2) | int('0001',base=2)):03b}"

Monocular ORB-SLAM2 node (C++)

Task: subscribe to data topics, sync data + keypoints + 3D pointcloud, send packed data as request for scale correction

initialize subscriber, service client

  
MonocularSlamNode::MonocularSlamNode(ORB_SLAM2::System* pSLAM)
:   Node("orbslam"), 
    m_SLAM(pSLAM)
{
    client_cb_group_ = this->create_callback_group(rclcpp::CallbackGroupType::MutuallyExclusive);
    timer_cb_group_ = this->create_callback_group(rclcpp::CallbackGroupType::MutuallyExclusive);

    rclcpp::SubscriptionOptions options;
    options.callback_group = timer_cb_group_;
    m_image_subscriber = this->create_subscription<RgbSonar>(
        "dataset_rgb_sonar",
        10,
        std::bind(&MonocularSlamNode::GrabImage, this, std::placeholders::_1),
        options);
    
    client_ = this->create_client<custom_interfaces::srv::FrameSlam>("scale_estimation", rmw_qos_profile_services_default, client_cb_group_);

    /* initialize other stuffs .. */
}

GrabImage function handles callback to m_image_subscriber:

  
void MonocularSlamNode::GrabImage(const RgbSonar::SharedPtr msg)
{
    // Copy the ros image message to cv::Mat.
    try
    {
        m_cvImPtr = cv_bridge::toCvCopy(msg->rgb_im);
    }
    catch (cv_bridge::Exception& e)
    {
        RCLCPP_ERROR(this->get_logger(), "cv_bridge exception: %s", e.what());
        return;
    }

    /* scale estimation & correction */
    Tcw = m_SLAM->TrackMonocular(m_cvImPtr->image, msg->timestamp);
    
    if (m_SLAM->mpTracker->newKeyFrameFlag){ // you need to modify ORB-SLAM2 source code to add newKeyFrameFlag checking if this current frame is a keyframe, then rebuild ORB-SLAM2
        /* scale estimation (request) */

        float scale = getScale(msg, Tcw);

        if (scale != -1 && scale != 0 && scale > 0.01 ){ //&& scale < 1.20) { //&& scale < 1.10 ) { // !!!
            RCLCPP_INFO(this->get_logger(), "[scale_estimation] scale = %f", scale);  

            if ( scaled==false || (scaled==true && (abs(scale-1.0) <= 0.05 ) ) ){ // for stable rescaling ..

                cv::Mat cur_pose = m_SLAM->mpTracker->mCurrentFrame.mpReferenceKF->GetPose();
                cur_pose.rowRange(0,3).col(3) = cur_pose.rowRange(0,3).col(3) * scale;
                m_SLAM->mpTracker->mCurrentFrame.mpReferenceKF->SetPose(cur_pose);           



                /* rescale current frame's map points */
                for (auto &mp : m_SLAM->mpTracker->mCurrentFrame.mpReferenceKF->mvpMapPoints){
                    if (mp != NULL){
                        mp->SetWorldPos(mp->GetWorldPos() * scale);
                    }
                }
                std::cout << "[!] scaled" << std::endl;

                if (abs(1.0 - scale) <= 0.1 && rescaleCount >= 5)
                    scaled = true;

            }



            rescaleCount += 1;
            

        }
        else RCLCPP_ERROR(this->get_logger(), "[scale_estimation] failed.");

    }
    else {
        m_SLAM->mpMapDrawer->curPointIndex.clear();
    }
    /****************/


    /* publish odom message */
    if (!Tcw.empty()){
        

        cv::Mat Rwc = Tcw.rowRange(0,3).colRange(0,3).t();
        cv::Mat twc = -Rwc * Tcw.rowRange(0,3).col(3); // !!!!

        // convert to Right-hand coordinate system

        cv::Mat R_fix = (cv::Mat_<float>(3,3) << 0, 0, 1,
                                                -1, 0, 0,
                                                0, -1, 0);
        cv::Mat t_fix = (cv::Mat_<float>(3,1) << 0, 0, 0);

        std::vector<float> q = ORB_SLAM2::Converter::toQuaternion(Rwc);
          
        auto odom_msg = nav_msgs::msg::Odometry();

        odom_msg.header.stamp.sec = floor(msg->timestamp);
        odom_msg.header.stamp.nanosec = (msg->timestamp - floor(msg->timestamp)) * pow(10,9);
        odom_msg.header.frame_id = "odom";
        odom_msg.child_frame_id = "base_link";

        odom_msg.pose.pose.position.x = twc.at<float>(2); //twc.at<float>(0);
        odom_msg.pose.pose.position.y = -twc.at<float>(0); //twc.at<float>(1);
        odom_msg.pose.pose.position.z = twc.at<float>(1); //twc.at<float>(2);

        odom_msg.pose.pose.orientation.x = q[2]; //q[0];
        odom_msg.pose.pose.orientation.y = -q[0]; //q[1];
        odom_msg.pose.pose.orientation.z = -q[1]; //q[2];
        odom_msg.pose.pose.orientation.w = q[3];

        odom_msg.pose.covariance = {
            3000.1, 0.0, 0.0, 0.0, 0.0, 0.0,
            0.0, 3000.1, 0.0, 0.0, 0.0, 0.0,
            0.0, 0.0, 1000000000000.001, 0.0, 0.0, 0.0,
            0.0, 0.0, 0.0, 1000000000000.001, 0.0, 0.0,
            0.0, 0.0, 0.0, 0.0, 1000000000000.001, 0.0,
            0.0, 0.0, 0.0, 0.0, 0.0, 1000000000000.001
        };

        m_odom_publisher_->publish(odom_msg);

    }

}

In the above code, it calls our getScale function for every keyframe.

  
float MonocularSlamNode::getScale(const RgbSonar::SharedPtr msg, cv::Mat Tcw){
    /* create request */
    auto request = std::make_shared<custom_interfaces::srv::FrameSlam::Request>();
    request->timestamp = msg->timestamp;

    /* parse msg */

    request->rgb_im = msg->rgb_im;
    request->sonar_im = msg->sonar_im;
    request->kps.clear();
    request->mps.clear();

    /* obtain current camera pose (to compute visual distance d) */
    if (!Tcw.empty()){
        
        cv::Mat Rwc = Tcw.rowRange(0,3).colRange(0,3).t();
        cv::Mat twc = -Rwc * Tcw.rowRange(0,3).col(3); // !!!!
        std::vector<float> q = ORB_SLAM2::Converter::toQuaternion(Rwc);

        tf2::Transform tf2_transform;
        tf2_transform.setOrigin(tf2::Vector3(twc.at<float>(0,0), twc.at<float>(0,1), twc.at<float>(0,2)));
        tf2_transform.setRotation(tf2::Quaternion(q[0], q[1], q[2], q[3]));

        
        geometry_msgs::msg::Pose pose_msg;
        tf2::toMsg(tf2_transform, pose_msg);
        request->rgb_pose = pose_msg;
    }

    /* obtain 2D keypoints and 3D map points (to compute visual distance d) */
    for (int i=0; i<m_SLAM->mpTracker->mCurrentKeyPointsMapPoints.size(); i++){
        cv::KeyPoint kpCur = m_SLAM->mpTracker->mCurrentKeyPointsMapPoints[i].first;
        ORB_SLAM2::MapPoint *mpCur = m_SLAM->mpTracker->mCurrentKeyPointsMapPoints[i].second;
        auto mpCurPos = mpCur->GetWorldPos();
        
        KeyPoint kpVar;
        kpVar.x = kpCur.pt.x;
        kpVar.y = kpCur.pt.y;
        
        MapPoint mpVar;
        mpVar.x = mpCurPos.at<float>(0);
        mpVar.y = mpCurPos.at<float>(1);
        mpVar.z = mpCurPos.at<float>(2);

        request->kps.push_back(kpVar);
        request->mps.push_back(mpVar);
    }

    /* send request */

    float scale = -1;
    auto future = client_->async_send_request(request);


    std::future_status status = future.wait_for(10s);

    if (status == std::future_status::ready){
        auto response = future.get();
        if (response->timestamp == msg->timestamp){
            scale = response->scale;

            if (response->indexes.size() > 0 ){
                m_SLAM->mpMapDrawer->curPointIndex.clear();
                for (int i=0; i < response->indexes.size(); i++){
                    m_SLAM->mpMapDrawer->curPointIndex.push_back(response->indexes[i]);
                }
            }
            else {
                m_SLAM->mpMapDrawer->curPointIndex.clear();
            }
        }
        else RCLCPP_ERROR(this->get_logger(), "[scale estimation] timestamp not matched.");
        
    }
    else if (status == std::future_status::timeout){
        RCLCPP_ERROR(this->get_logger(), "[scale_estimation] timeout.");
    }
    else if (status == std::future_status::deferred){
        RCLCPP_ERROR(this->get_logger(), "[scale_estimation] deferred.");
    }

    return scale;
}

Scale Estimation / Correction service node (Python)

Task: receive data request, estimate + correct scale, then respond the corrected scale

Initialize some configurations

  
### sonar config
azi = 120 
phi = 10 
minR = 0.01 
maxR = 2 
binsR = 512 
binsA = 512

### camera config
W = 512
H = 512
fx = W/2 # fov = 90
fy = H/2
cx = W/2 
cy = H/2

### camera-sonar config
baseline_y = 0.205 

sonar_idx_arr = np.arange(-binsA/2, binsA/2+1, 1)
sonar_idx_arr = sonar_idx_arr[sonar_idx_arr != 0]
theta_i_arr = sonar_idx_arr * (azi/binsA)
u_i_arr = np.round(fx * np.tan((theta_i_arr/180) * np.pi) + cx).astype(int) # TODO: add +- theta_h

polar_mask = np.where((u_i_arr >= 0) & (u_i_arr <= W))[0]

## multi-beam in center (most accurate)
polar_mask = polar_mask[int(len(polar_mask)/2)-70 : int(len(polar_mask)/2)+70]


u_i_arr_masked = u_i_arr[polar_mask]


ranges = np.linspace(minR, maxR, binsR)

define some utility functions

  
def extract_ranges(polar_img, idx_arr, maxR, binsR):
    r_arr = np.ones(polar_img.shape[1]) * -1
    intensity_min_idx_arr = np.ones(polar_img.shape[1]) * -1
    for idx in idx_arr:
        # breakpoint()
        intensity_min_idx = np.argmax(polar_img[:, idx] > 0) #np.where(polar_img[:, idx] > 0)[0]
        if intensity_min_idx > 0:
            r_arr[idx] = ranges[intensity_min_idx]
            
            intensity_min_idx_arr[idx] = intensity_min_idx
            
    return r_arr, intensity_min_idx_arr.astype(int)


def compute_track_beam_idx(track_beam_idx_ref, azi, binsA, r_ref, relative_dist):    
    theta_ref = (track_beam_idx_ref - binsA/2)* azi / binsA
    z_ref = r_ref * np.cos(theta_ref * np.pi/180)
    x_ref = r_ref * np.sin(theta_ref * np.pi/180)
    
    z_cur = z_ref + np.abs(relative_dist)
    theta_cur = np.arctan(x_ref / z_cur) * 180 / np.pi
    track_beam_idx_cur = int(theta_cur * binsA / azi + binsA/2) 
    
    return track_beam_idx_cur

Our main ScaleEstimationNode:

  
class ScaleEstimationNode(Node):
    def __init__(self):
        super().__init__("scale_est_node")
        
        self.bridge_ = CvBridge()
        self.rgb_trans_prev = None
        self.declare_parameter('scene_id', rclpy.Parameter.Type.INTEGER)
        self.scene_id = self.get_parameter('scene_id').value 
        self.server_ = self.create_service(srv.FrameSlam, "scale_estimation", self.callback_scale_estimation)


    def callback_scale_estimation(self, request, response):
        timestamp = request.timestamp
        print(f"\n[scale_estimation] timestamp = {timestamp}")
        
        rgb_im = self.bridge_.imgmsg_to_cv2(request.rgb_im)
        sonar_im = self.bridge_.imgmsg_to_cv2(request.sonar_im)
        kps = self.kpsmsg_to_kps(request.kps)
        mps = self.mpsmsg_to_mps(request.mps)
        
        rgb_trans = np.array([request.rgb_pose.position.x, request.rgb_pose.position.y, request.rgb_pose.position.z]) ### ORB_SLAM coord system
        
        print(f"rgb_trans = {rgb_trans}")
        rgb_im_kps = rgb_im.copy() #cv2.cvtColor(im, cv2.COLOR_BGR2RGB)
        rgb_im_lines_min = rgb_im.copy()
        
        sonar_im_flip = cv2.flip(sonar_im, 1)
        r_arr, intensity_min_idx_arr = extract_ranges(sonar_im_flip, polar_mask, maxR, binsR)
        
        ### rgb_im_lines    
        v_range_arr = np.repeat(np.array([[-1,-1]]), len(u_i_arr_masked), axis=0 )        
        for i, idx in enumerate(polar_mask):
            r = r_arr[idx]
            if r != -1 and r != 0:
                v_range_arr[i][0] = np.abs(fy/np.cos(theta_i_arr[idx]*np.pi/180)) * ((-r*np.sin(phi*np.pi/180) -  baseline_y)/ (r*np.cos(phi*np.pi/180)) ) + cy #+ 25 #30 # 65 #85
                v_range_arr[i][1] = np.abs(fy/np.cos(theta_i_arr[idx]*np.pi/180)) * ((-r*np.sin(-phi*np.pi/180) -  baseline_y)/ (r*np.cos(-phi*np.pi/180)) ) + cy #+ 25 #30 # 65 #85

        ### rgb_im_lines_min
        for i, u_i in enumerate(u_i_arr_masked):
            v_range = v_range_arr[i]
            cv2.line(rgb_im_lines_min, (u_i, v_range[0]), (u_i, v_range[1]), color=(0,255,0), thickness=1)
            # print(f"heyy {v_range}")

        ### sonar_im_flip
        sonar_im_flip = cv2.cvtColor(sonar_im_flip, cv2.COLOR_GRAY2BGR)
        for r_idx in polar_mask:
            if r_arr[r_idx] != -1:
                cv2.circle(sonar_im_flip, (r_idx, intensity_min_idx_arr[r_idx]), radius=2, color=(0,255,0), thickness=1)

        ### select closest 3D points (to sonar) in each beam
        mp_min_arr = []
        kp_min_arr = []
        kp_min_idx_arr = []
        for i, u_i in enumerate(u_i_arr_masked):
            d_min = 999999
            mp_min = None
            kp_min = None
            kp_min_idx = None
            
            for j, kp in enumerate(kps):
                if kp[0] == u_i:
                    v_range = v_range_arr[i]
                    if kp[1] >= v_range[0] and kp[1] <= v_range[1]:
                        if d_min >= np.abs(rgb_trans[2] - mps[j][2]):
                            d_min = np.abs(rgb_trans[2] - mps[j][2])
                            mp_min = mps[j]
                            kp_min = kp
                            kp_min_idx = j
                            
            mp_min_arr.append(mp_min)
            kp_min_arr.append(kp_min)
            kp_min_idx_arr.append(kp_min_idx)
        
        scales = []
        d_est_arr = []
        err_arr = []
        r_arr_ = []
        for i, mp_min in enumerate(mp_min_arr): #mp_min_arr): # same len as u_i_arr_masked
            r_i = r_arr[polar_mask][i]
            
            if mp_min is not None and r_i != -1:
                
                err_i = np.abs(r_i - np.sqrt(r_i**2 + baseline_y**2))
                err_arr.append(err_i)
                r_arr_.append(r_i)
                
                ### [elevation=10, PoI] d_true = dist(rgb_trans, mps[i])
                d_est_i = np.linalg.norm(rgb_trans - mp_min)
                
                ## baseline
                scale_i = r_i / d_est_i 
                
                d_est_arr.append(d_est_i)
                
                scales.append(scale_i)
                
                cv2.circle(rgb_im_kps, kp_min_arr[i], color=(0,255,0), radius=3, thickness=2)
                
        
        scale = np.mean(scales) if len(scales)>0 else -1.0   
        print(f"scale = {scale} - len(kp_min_arr != None) = {len([_ for _ in kp_min_arr if _ is not None])}")
        print(f"kp_min_idx_arr = {[_ for _ in kp_min_idx_arr if _ is not None]}\n")
        
        if self.rgb_trans_prev is None:
            self.rgb_trans_prev = rgb_trans
        
        
        
        out_im = np.hstack((rgb_im_kps.astype(float)/255., rgb_im_lines_min.astype(float)/255., sonar_im_flip))
        cv2.imshow("", out_im)
        cv2.waitKey(1)
        
        self.rgb_trans_prev = rgb_trans
        
        ### response
        response.timestamp = timestamp
        response.scale = scale
        response.indexes = [_ for _ in kp_min_idx_arr if _ is not None]
        return response
    
    
    def kpsmsg_to_kps(self, kpsmsg):
        kps = []
        for kp_msg in kpsmsg:
            kps.append(np.round((kp_msg.x, kp_msg.y)).astype(int))
            # kps.append((kp_msg.x, kp_msg.y))
        return kps
    
    def mpsmsg_to_mps(self, mpsmsg):
        mps = []
        for mp_msg in mpsmsg:
            mps.append([mp_msg.x, mp_msg.y, mp_msg.z])
        return mps

The full script can be found in my repo

Results

Once you colcon build the package, you can run the scripts with these commands in three separate terminals:

# terminal 1: monocular ORB-SLAM2 node
ros2 run ros2_orbslam mono /home/bachnguyen/Desktop/SLAM/slam/ORB_SLAM2/Vocabulary/ORBvoc.txt /home/bachnguyen/Desktop/SLAM/active-slam/pyslam/settings/HOLO.yaml

  
# terminal 2: scale_recover node
ros2 run scale_recover scale_est_node --ros-args -p scene_id:=[YOUR_SCENE_ID]

  
# terminal 3: dataset node
ros2 run dataset_reader dataset_node --ros-args -p base_path:=/home/bachnguyen/Desktop/SLAM/slam/holoocean/HoloOcean/tests/out/ -p num:=[YOUR_SCENE_ID]

You should start dataset_node last.

Here are some final results:

Localization & Mapping (without scale correction)

Localization & Mapping (with scale correction)

You can see the mapping has been rescaled using the scale_recover node.

Additionally, this is further evaluation of Localization between different methods.

The above re-scaling code results in the final trajectory denoted VS_SLAM [8]. The approximation \(d_{true} / d_{est}\) ~ \(r / d_{est}\) is not accurate for close distances like in Close Visual Inspection (CVI) missions. In the next post, I will show you an improved rescaling ratio (VS_SLAM (proposal) in above figure) that can be applied for both close and far stand off distances between AUV and target object.

Thank you for reading ^^

research

This post is licensed under CC BY 4.0 by the author.