Collaborative Visual-Inertial State and Scene Estimation

Karrer, Marco

doi:10.3929/ethz-b-000465334

Download

Full text (PDF, 19.22Mb)

Open access

Author

Karrer, Marco

Date

2020-09

Type

Doctoral Thesis

ETH Bibliography

yes

Altmetrics

Download

Full text (PDF, 19.22Mb)

Rights / license

In Copyright - Non-Commercial Use Permitted

Abstract

The capability of a robot to create a map of its workspace on the fly, while constantly updating it and continuously estimating its motion in it, constitutes one of the central research problems in mobile robotics and is referred to as Simultaneous Localization And Mapping (SLAM) in the literature. Relying solely on the sensor-suite onboard the robot, SLAM is a core building block in enabling the navigational autonomy necessary to facilitate the general use of mobile robots and has been the subject of booming research interest spanning over three decades. With the largest body of related literature addressing the challenge of single-agent SLAM, it is only very recently, with the relative maturity of this field that approaches tackling collaborative SLAM with multiple agents have started appearing. The potential of collaborative multi-agent SLAM is great; not only promising to boost the efficiency of robotic missions by splitting the task at hand to more agents but also to improve the overall robustness and accuracy by boosting the amount of data that each agent’s estimation process has access to. While SLAM can be performed using a variety of different sensors, this thesis is focused on the fusion of visual and inertial cues, as one of the most common combinations of sensing modalities in robotics today. The information richness captured by cameras, along with the high-frequency and metric information provided by Inertial Measurement Units (IMUs) in combination with the low weight and power consumption offered by a visual-inertial sensor suite render this setup ideal for a wide variety of applications and robotic platforms, in particular to resource-constrained platforms such as Unmanned Aerial Vehicles (UAVs). The majority of the state-of-the-art visual-inertial estimators are designed as odometry algorithms, providing only estimates consistent within a limited time-horizon. This lack in global consistency of estimates, however, poses a major hurdle in an effective fusion of data from multiple agents and the practi- cal definition of a common reference frame, which is imperative before collaborative effort can be coordinated. In this spirit, this thesis investigates the potential of global optimization, based on a central access point (server) as a first approach, demonstrating global consistency using only monocular-inertial data. Fusing data from multiple agents, not only consistency can be maintained, but also the accuracy is shown to improve at times, revealing the great potential of collaborative SLAM. Aiming at improving the computational efficiency, in a second approach a more efficient system architecture is employed, allowing a more suitable distribution of the computational load amongst the agents and the server. Furthermore, the architecture implements a two-way communication enabling a tighter collaboration between the agents as they become capable of re-using information captured by other agents through communication with the server, enabling improvements of their onboard pose tracking online, during the mission. In addition to general collaborative SLAM without specific assumptions on the agents’ relative pose configuration, we investigate the potential of a configuration with two agents, carrying one camera each with overlapping fields of view, essentially forming a virtual stereo camera. With the ability of each robotic agent to move independently, the potential to control the stereo baseline according to the scene depth is very promising, for example at high altitudes where all scene points are far away and, therefore, only provide weak constraints on the metric scale in a standard single-agent system. To this end, an approach to estimate the time-varying stereo transformation formed between two agents is proposed, by fusing the egomotion estimates of the individual agents along with the image measurements extracted from the view-overlap in a tightly coupled fashion. Taking this virtual stereo camera idea a step further, a novel collaboration framework is presented, utilizing the view-overlap along with relative distance measurements across the two agents (e.g. obtained via Ultra-Wide Band (UWB) modules), in order to successfully perform state estimation at high altitudes where state-of-the-art single-agent methods fail. In the interest of low-latency pose estimation, each agent holds its own estimate of the map, while consistency between the agents is achieved using a novel consensus-based sliding window bundle adjustment. Despite that in this work, experiments are shown in a two-agent setup, the proposed distributed bundle adjustment scheme holds great potential for scaling up to larger problems with multiple agents, due to the asynchronicity of the proposed estimation process and the high level of parallelism it permits. The majority of the developed approaches in this thesis rely on sparse feature maps in order to allow for efficient and timely pose estimation, however, this translates to reduced awareness of the spatial structure of a robot’s workspace, which can be insufficient for tasks requiring careful scene interaction and manipulation of objects. Equipping a typical visual-inertial sensor suite with an RGB-D camera, an add-on framework is presented that enables the efficient fusion of naturally noisy depth information into an accurate, local, dense map of the scene, providing sufficient information for an agent to plan contact with a surface. With the focus on collaborative SLAM using visual-inertial data, the approaches and systems presented in this thesis contribute towards achieving collaborative Visual-Inertial SLAM (VI-SLAM) deployable in challenging real-world scenarios, where the participating agents’ experiences get fused and processed at a central access point. On the other side, it is shown that taking advantage of specific configurations can push the collaboration amongst the agents towards achieving greater general robustness and accuracy of scene and egomotion estimates in scenarios, where state-of-the-art single-agent systems are otherwise unsuccessful, paving the way towards intelligent robot collaboration. Show more

Permanent link

https://doi.org/10.3929/ethz-b-000465334

Publication status

published

External links

Search print copy at ETH Library

Contributors

Examiner: Chli, Margarita