Q
ingqing Li
F 40
A
N
N
A
LES U
N
IV
ERSITATIS TU
RK
U
EN
SIS
TURUN YLIOPISTON JULKAISUJA – ANNALES UNIVERSITATIS TURKUENSIS
SARJA – SER. F OSA – TOM. 40 | TECHNICA – INFORMATICA | TURKU 2024
LIDAR BASED 
MULTI-SENSOR FUSION 
FOR LOCALIZATION, 
MAPPING, AND TRACKING
Qingqing Li
LIDAR BASED MULTI-SENSOR
FUSION FOR LOCALIZATION,
MAPPING, AND TRACKING
Qingqing Li
TURUN YLIOPISTON JULKAISUJA – ANNALES UNIVERSITATIS TURKUENSIS 
SARJA – SER. F OSA – TOM. 40 | TECHNICA – INFORMATICA| TURKU 2024 
University of Turku
Faculty of Technology
Department of Computing
Information and Communication Technology
Doctoral Programme in Technology
Supervised by
Professor, Tomi Westerlund
University of Turku
Professor, Zhuo Zou
Fudan University
Reviewed by
Professor, George Nikolakopoulos
Lulea˚ University of Technology
Research Professor, Antero Kukko
Finnish Geospatial Research Institute
Opponent
Associate Professor, Evangelos Boukas
Technical University of Denmark
The originality of this publication has been checked in accordance with the University
of Turku quality assurance system using the Turnitin OriginalityCheck service.
Serial F 40
ISBN 978-951-29-9735-0 (PRINT)
ISBN 978-951-29-9736-7 (PDF)
ISSN 2736-9390 (PRINT)
ISSN 2736-9684 (ONLINE)
Painosalama Oy, Turku, Finland, 2024
To my wife Qianqian, my parents, and my brother
...and to my nephew Yiyi
UNIVERSITY OF TURKU
Faculty of Technology
Department of Computing
Information and Communication Technology
LI, QINGQING: LiDAR Based Multi-Sensor Fusion for Localization, Mapping,
and Tracking
Doctoral dissertation, 122 pp.
Doctoral Programme in Technology
May 2024
ABSTRACT
The development of fully autonomous driving vehicles has become a key focus for
both industry and academia over the past decade, fostering significant progress in sit-
uational awareness abilities and sensor technology. Among various types of sensors,
the LiDAR sensor has emerged as a pivotal component in many perception systems
due to its long-range detection capabilities, precise 3D range information, and reli-
able performance in diverse environments. With advancements in LiDAR technol-
ogy, more reliable and cost-effective sensors have shown great potential for improv-
ing situational awareness abilities in widely used consumer products. By leveraging
these novel LiDAR sensors, researchers now have a diverse set of powerful tools
to effectively tackle the persistent challenges in localization, mapping, and tracking
within existing perception systems.
This thesis explores LiDAR-based sensor fusion algorithms to address percep-
tion challenges in autonomous systems, with a primary focus on dense mapping and
global localization using diverse LiDAR sensors. The research involves the integra-
tion of novel LiDARs, IMU, and camera sensors to create a comprehensive dataset
essential for developing advanced sensor fusion and general-purpose localization and
mapping algorithms. Innovative methodologies for global localization across varied
environments are introduced. These methodologies include a robust multi-modal Li-
DAR inertial odometry and a dense mapping framework, which enhance mapping
precision and situational awareness. The study also integrates solid-state LiDARs
with camera-based deep-learning techniques for object tracking, refining mapping
accuracy in dynamic environments. These advancements significantly enhance the
reliability and efficiency of autonomous systems in real-world scenarios.
The thesis commences with an introduction to innovative sensors and a data col-
lection platform. It proceeds by presenting an open-source dataset designed for the
evaluation of advanced SLAM algorithms, utilizing a unique ground-truth generation
method. Subsequently, the study tackles two localization challenges in forest and ur-
ban environments. Furthermore, it highlights the MM-LOAM dense mapping frame-
work. Additionally, the research explores object-tracking tasks, employing both cam-
era and LiDAR technologies for human and micro UAV tracking.
KEYWORDS : SLAM, LiDAR, sensor fusion, robotics, localization, autonomous
driving, autonomous system, harvester, object tracking.
iv
TURUN YLIOPISTO
Faculty of Technology
Department of Computing
Information and Communication Technology
LI, QINGQING: LiDAR Based Multi-Sensor Fusion for Localization, Mapping,
and Tracking
Va¨ito¨skirja, 122 s.
Doctoral Programme in Technology
May 2024
TIIVISTELMA¨
Viimeisen vuosikymmenen aikana ta¨ysin itseohjautuvien ajoneuvojen kehitys on hera¨-
tta¨nyt laajaa kiinnostusta niin teollisuudessa kuin tiedemaailmassakin, mika¨ on mer-
kitta¨va¨sti edista¨nyt tilannetietoisuuden ja anturiteknologian kehitysta¨. Erityisesti
LiDAR-anturit ovat nousseet keskeiseen rooliin monissa havainnointija¨rjestelmissa¨ ni-
iden tarjoaman pitka¨n kantaman havaintokyvyn, tarkan 3D-eta¨isyystiedon ja luotetta-
van suorituskyvyn ansiosta. LiDAR-teknologian kehittyminen on mahdollistanut en-
tista¨ luotettavampien ja kustannustehokkaampien antureiden ka¨yto¨n, mika¨ puolestaan
on osoittanut suurta potentiaalia parantaa laajasti ka¨ytettyjen kuluttajatuotteiden tilan-
netietoisuutta. Uusien LiDAR-antureiden hyo¨dynta¨minen tarjoaa tutkijoille monipu-
olisen valikoiman tehokkaita tyo¨kaluja, joiden avulla voidaan ratkaista paikannuksen,
kartoituksen ja seurannan haasteita nykyisissa¨ havaintoja¨rjestelmissa¨.
Ta¨ssa¨ va¨ito¨skirjassa tutkitaan LiDAR-pohjaisia sensorifuusioalgoritmeja. Tutki-
muksen pa¨a¨paino on tihea¨ssa¨ kartoituksessa ja globaalissa paikan-nuksessa erilaisten
LiDAR-anturien avulla. Tutkimuksessa luodaan kattava tietokanta uusien LiDAR-,
IMU- ja kamera-antureiden tuottamasta datasta. Tietokanta on va¨ltta¨ma¨to¨n kehit-
tyneiden anturifuusioalgoritmien ja yleiska¨ytto¨isten paikannus- ja kartoitusalgorit-
mien kehitta¨miseksi. Ta¨ma¨n lisa¨ksi va¨ito¨skirjassa esitella¨a¨n innovatiivisia menetelmi-
a¨ globaaliin paikannukseen erilaisissa ympa¨risto¨issa¨. Esitellyt menetelma¨t kartoituk-
sen tarkkuuden ja tilannetietoisuuden parantamiseksi ovat muun muassa modulaari-
nen monen LiDAR-anturin odometria ja kartoitus, toimintavarma multimodaalinen
LiDAR-inertiamittau-sja¨rjestelma¨ ja tihea¨ kartoituskehys. Tutkimus integroi myo¨s
kiintea¨t LiDAR -anturit kamerapohjaisiin syva¨oppimismenetelmiin kohteiden seu-
rantaa varten parantaen kartoituksen tarkkuutta dynaamisissa ympa¨risto¨issa¨. Na¨iden
edistysaskeleiden avulla autonomisten ja¨rjestelmien luotettavuutta ja tehokkuutta voi-
daan merkitta¨va¨sti parantaa todellisissa ka¨ytto¨ympa¨risto¨issa¨.
Va¨ito¨skirja alkaa esittelema¨lla¨ innovatiiviset anturit ja tiedonkeruualustan. Ta¨ma¨n
ja¨lkeen esitella¨a¨n avoin tietokanta, jonka avulla voidaan arvioida kehittyneita¨ paikann-
us- ja kartoitusalgoritmeja hyo¨dynta¨en ainutlaatuista perustotuuden kehitta¨mismenet-
elma¨a¨. Tyo¨ssa¨ ka¨sitella¨a¨n myo¨s kahta haastavaa paikannusympa¨risto¨a¨: metsa¨- ja
kaupunkiympa¨risto¨a¨. Lisa¨ksi tarkastellaan kohteen seurantatehta¨via¨ seka¨ kamera-
etta¨ LiDAR-tekniikoilla ihmisten ja pienten droonien seurannassa.
Avainsanat: SLAM, LiDAR, anturifuusio, robotiikka, lokalisointi, autonominen
ajo, autonominen ja¨rjestelma¨, harvesteri, objektien seuranta.
v
Acknowledgements
I am profoundly grateful to all colleagues and friends who have supported me through-
out this incredible journey. I still vividly remember when I first arrived in Fin-
land—the serene, welcoming atmosphere, the kindness of the people, and the world-
class research environment—all of which inspired me to explore further and pursue
my Ph.D. dream here. Today, I feel incredibly fortunate to have made such a won-
derful decision and to have met so many amazing people who have had a profound
impact on my life.
First and foremost, I would like to express my sincere appreciation to my su-
pervisor and friend, Prof. Tomi Westerlund. His unwavering support, invaluable
mentorship, and critical thinking have been indispensable to me throughout this jour-
ney. His profound expertise, insightful guidance, and genuine encouragement have
fueled my passion for research and enabled me to explore a diverse range of engag-
ing topics. I am deeply grateful for his continuous support and guidance, both in
academia and in life.
Secondly, I also want to extend my heartfelt thanks to Prof. Zhuo Zou, whose
support has been invaluable during this journey. Prof. Zhuo’s deep understanding of
the subject matter, combined with his insightful feedback and constructive criticism,
have greatly enriched my research endeavors. His continuous encouragement and
guidance have not only helped me navigate through various challenges but have also
played a significant role in shaping my academic and professional growth.
Additionally, I want to extend my deepest gratitude to my wife, Qianqian, whose
trust and constant support have been a source of strength for me during this journey.
The decision to study abroad thousands of miles away from my home country was
not an easy one, but time has proven that our relationship is strong and resilient.
Your support, optimistic attitude, and beautiful smile have encouraged me every step
along the way. I am forever grateful to my parents and my brother for their steadfast
support and encouragement. Their constant support and sacrifices throughout my
life have motivated me to keep pushing forward. Despite being physically apart in
recent years, their encouragement has always inspired me to strive for more. I am
truly thankful to have them as my family.
Moreover, I want to express my gratitude to all my colleagues and friends at
TIERS lab. Special thanks to my dear friend, Jorge Pen˜a Queralta, whose constant
motivation, and curiosity for science, along with his innovative ideas, have made
vi
this journey more exciting. I will never forget the countless days spent preparing
experiments, lectures, and papers before deadlines, as well as the movie nights and
skiing trips we shared. Without your help and wisdom, this journey would have
been far less fulfilling. I also want to thank Xianjia Yu, whose invaluable assistance
has made this journey truly remarkable. My life here would be much less colorful
without your help. I am also grateful to Tuan and Paavo for their passion for research,
stimulating discussions, and constant encouragement. I extend my thanks to Ha Sier,
Paola, Salma, Yuki, Daniel, Marius, Iacopo, and all the others who have walked this
path with me, contributing to the formation of an exceptional team at TIERS.
Furthermore, I am immensely thankful to Henri Riihima¨ki, Peter Leander, Markus
Karppinen, and my colleagues at Prefor Oy, who have provided me with excellent op-
portunities to apply my research in real-world applications. Your support has given
me a unique perspective on my research journey and has greatly contributed to this
thesis.
In the end, I would like to express my gratitude to the reviewer and pre-examiner
whose critical feedback has greatly contributed to the quality of this thesis. Addition-
ally, I want to acknowledge the support provided by the University of Turku Graduate
School, the HPY Research Foundation, and the Finnish Foundation for Technology
Promotion (TES). Their support has been instrumental in facilitating my research
endeavors and ensuring the successful completion of this thesis.
Once again, I want to express my deepest gratitude to everyone mentioned above
and to all those who have been part of my academic and personal journey. I am
forever thankful for your presence in my life.
Helsinki, May 6th,2024
Qingqing Li
QINGQING LI
Qingqing Li received a B.Sc. in Electronic Engineering in
2016 and an M.Eng. in Electronics and Communication
Engineering in 2018, both from Fudan University, China.
Since 2019, he has been a researcher at the Turku Intelli-
gent Embedded and Robotic Systems (TIERS) Lab, Fac-
ulty of Technology, University of Turku. His research in-
terests include sensor fusion, SLAM, multi-robot coopera-
tion, robotic perception,and deep learning.
vii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Original Publications . . . . . . . . . . . . . . . . . . . . . . xiv
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Research Background . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 LiDAR-Based Odometry . . . . . . . . . . . . . . . . 2
1.1.2 Global localization . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Multi-Sensor Fusion for SLAM . . . . . . . . . . . . . 5
1.1.4 SLAM with Solid-State-LiDAR . . . . . . . . . . . . . 5
1.1.5 Multi-LiDAR-Based Sensor Fusion . . . . . . . . . . 6
1.2 Motivation and Objectives . . . . . . . . . . . . . . . . . . . . 7
1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Thesis Content Summary . . . . . . . . . . . . . . . . . . . . 11
2 Multi-modal Multi-LiDAR Sensors Dataset . . . . . . . . . . . 12
2.1 LiDAR-Based SLAM Datasets . . . . . . . . . . . . . . . . . 14
2.2 Multi-LiDAR Data Collection Platform . . . . . . . . . . . . . 16
2.2.1 Hardware System . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Software System . . . . . . . . . . . . . . . . . . . . 17
2.2.3 Sensors Calibration . . . . . . . . . . . . . . . . . . . 18
2.3 Generating Ground Truth Data . . . . . . . . . . . . . . . . . 19
2.3.1 MOCAP Avaliable Ground Truth . . . . . . . . . . . . 19
2.3.2 SLAM Assisted Ground Truth . . . . . . . . . . . . . 20
2.3.3 Data Sequences . . . . . . . . . . . . . . . . . . . . 23
2.3.4 Data Format . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 SOTA Algorithms Benchmarking . . . . . . . . . . . . . . . . 25
2.4.1 LiDAR Odometry . . . . . . . . . . . . . . . . . . . . 26
2.4.2 Mapping Quality . . . . . . . . . . . . . . . . . . . . . 27
viii
2.4.3 Runtime Evaluation . . . . . . . . . . . . . . . . . . . 30
2.5 Summary and Conclusions . . . . . . . . . . . . . . . . . . . 31
3 Accurate Localization in Large-scale Forest Environments 32
3.1 Autonomous Driving Meet Forest . . . . . . . . . . . . . . . . 33
3.1.1 Autonomous Vehicles in Forest . . . . . . . . . . . . 33
3.1.2 LiDAR-Based Localization . . . . . . . . . . . . . . . 34
3.2 Delaunay Triangulation Based Localization . . . . . . . . . . 36
3.2.1 Trunk Point Cloud Segmentation . . . . . . . . . . . 38
3.2.2 Global Map DT Graph Generation . . . . . . . . . . 40
3.2.3 Local Map and Point Cloud aggregation . . . . . . . 40
3.2.4 Local and Global DT Graph Matching . . . . . . . . 42
3.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 47
3.3.1 Study Area & Hardware . . . . . . . . . . . . . . . . 47
3.3.2 Trunk Segmentation Performance . . . . . . . . . . . 49
3.3.3 Computational Time and Accuracy . . . . . . . . . . 50
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.1 Topology Mapping . . . . . . . . . . . . . . . . . . . . 52
3.4.2 Potential Accuracy Improvements . . . . . . . . . . . 52
3.4.3 Potential Computational Improvements . . . . . . . . 53
3.5 Summary and conclusions . . . . . . . . . . . . . . . . . . . 54
4 Multi-Sensor Fusion for Accurate Localization in Urban
Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1 SLAM and Autonomous Driving Vehicle . . . . . . . . . . . . 55
4.2 JDD Dataset and Challenge . . . . . . . . . . . . . . . . . . 57
4.3 Localization Methods . . . . . . . . . . . . . . . . . . . . . . 59
4.3.1 GNSS-based localization . . . . . . . . . . . . . . . . 59
4.3.2 GNSS+IMU Localization . . . . . . . . . . . . . . . . 59
4.3.3 LiDAR Odometry . . . . . . . . . . . . . . . . . . . . 59
4.3.4 NDT-Based Localization (NDT+) . . . . . . . . . . . 60
4.3.5 NDT+IMU Localization (NDT++) . . . . . . . . . . . 61
4.4 Corrupted Map Reconstruction . . . . . . . . . . . . . . . . . 62
4.5 Experiment and Results . . . . . . . . . . . . . . . . . . . . . 62
4.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . 64
5 Robust Multi-Modal Multi-LiDAR-Inertial Odometry andMap-
ping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 Multi modal LiDARs and IMU platform . . . . . . . . . . . . . 69
5.1.1 System pipeline . . . . . . . . . . . . . . . . . . . . . 71
5.1.2 Problem Formulation . . . . . . . . . . . . . . . . . . 71
ix
5.1.3 Spatial-temporal Calibration and Initialization . . . . 71
5.1.4 IMU Initialization and Preintegration . . . . . . . . . 72
5.2 Multi-modal LiDAR Pose Estimation . . . . . . . . . . . . . . 73
5.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . 73
5.2.2 Feature Clouds Merging . . . . . . . . . . . . . . . . 74
5.2.3 Keyframe Selection & Undistortion . . . . . . . . . . 74
5.2.4 Sliding Window Optimization . . . . . . . . . . . . . 75
5.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 76
5.3.1 Sensor Configuration and Implementation . . . . . . 76
5.3.2 Qualitative Experiment . . . . . . . . . . . . . . . . . 78
5.3.3 Outdoor Mapping . . . . . . . . . . . . . . . . . . . . 80
5.3.4 Mapping Quality Comparison . . . . . . . . . . . . . 81
5.3.5 Runtime Analysis . . . . . . . . . . . . . . . . . . . . 81
5.4 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . 81
6 Dynamic Object Detection and Tracking . . . . . . . . . . . . 83
6.1 Human Detection and Tracking with UAVs . . . . . . . . . . 84
6.1.1 Human Detection for SAR Task . . . . . . . . . . . . 86
6.1.2 Deep Learning-Based Huamn Detection . . . . . . . 87
6.1.3 Data Acquisition . . . . . . . . . . . . . . . . . . . . . 87
6.1.4 Modal Training and Test Setup . . . . . . . . . . . . 88
6.1.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . 89
6.1.6 Experimental Results . . . . . . . . . . . . . . . . . . 89
6.1.7 Summary and Conclusion . . . . . . . . . . . . . . . 92
6.2 Micro UAV Detection and Tracking with Solid-state LiDAR . 92
6.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . 94
6.2.2 Tracking System Overview . . . . . . . . . . . . . . . 95
6.2.3 Adaptive Scan Integration . . . . . . . . . . . . . . . 96
6.2.4 Experimental Evaluation . . . . . . . . . . . . . . . . 99
6.2.5 Summary and Conclusion . . . . . . . . . . . . . . . 103
7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.1 Summary and Contributions . . . . . . . . . . . . . . . . . . 105
7.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2.1 Adaptive Multi Sensor Online Calibration for Life Long
Autonomous System . . . . . . . . . . . . . . . . . . 107
7.2.2 Global Localization in Dynamic Environments . . . . 108
7.2.3 Enhancing Human-robot Interaction Through Multi-
Modal Multi-LiDAR Fusion . . . . . . . . . . . . . . . 108
List of References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
x
Abbreviations
ALS Airborne Laser Scanning
CPU Central processing unit
DARPA Defense Advanced Research Projects Agency
DBH Diameter at the breast height
DL Deep learning
DOF Degree of freedom
DT Delaunay triangulation
EKF Extended Kalman filter
FN False negatives
FP False positive
FoV Field of view
GICP Generalized Iterated Closest Point
GNSS Global navigation satellite system
GPS Global Positioning system
GPU Graphics processing unit
IMU Inertial measurement unit
IoU Intersection over Union
ICP Iterative closest point
LCD Loop Closure Detection
LiDAR Light Detection and Ranging
LOAM Lidar odometry and mapping
xi
MAV Micro aerial vehicle
MINS multi-sensor aided inertial navigation system
MLS Mobile laser scanning
MM Multiple model
MOCAP Motion capture
MSE mean square distance
MSCKF Multi-State Constraint Kalman Filter
KD-tree K Dimensional tree
KITTI Karlsruhe Institute of Technology and Toyota Technological Institute
NDT Normal distribution transforms
PCL Point cloud library
PTP Precise timestamp protocol
RAM Random-access memory
RL Reinforcement learning
ROS Robot Operating System
RTK Real-Time Kinematic position
SAR search and rescue
SDP Sequential Decision Process
SEIFs Sparse extended information filters
SLAM Simultaneous localization and mapping
SOTA State of the art
TLS Terrestrial laser scan
TP True positive
UAV Unmanned aerial vehicle
UGV Unmanned ground vehicle
xii
UWB Ultra-wideband
VINS Visual-inertial navigation systems
VIO Visual-inertial odometry
VRPN Virtual-Reality Peripheral Network
WGS World Geodetic System
YOLO You only look once
xiii
List of Original Publications
This dissertation is based on the following original publications:
I Qingqing, L., Queralta, J.P., Gia, T.N., Zou, Z. and Westerlund, T., 2020.
Multi-sensor fusion for navigation and mapping in autonomous vehicles:
Accurate localization in urban environments. Unmanned Systems, 8(03),
pp.229-237.
II Qingqing, L., Nevalainen, P., Pen˜a Queralta, J., Heikkonen, J. and West-
erlund, T., 2020. Localization in unstructured environments: Towards au-
tonomous robots in forests with delaunay triangulation. Remote Sensing,
12(11), p.1870.
III Qingqing, L., Xianjia, Y., Queralta, J.P. and Westerlund, T., 2021, Decem-
ber. Adaptive lidar scan frame integration: Tracking known mavs in 3d
point clouds. In 2021 20th International Conference on Advanced Robotics
(ICAR) (pp. 1079-1086). IEEE.
IV Qingqing, L., Xianjia, Y., Queralta, J.P. and Westerlund, T., 2022, October.
Multi-modal lidar dataset for benchmarking general-purpose localization
and mapping algorithms. In 2022 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS) (pp. 3837-3844). IEEE.
V Qingqing, L., Taipalmaa, J., Queralta, J.P., Gia, T.N., Gabbouj, M., Ten-
hunen, H., Raitoharju, J. and Westerlund, T., 2020, November. Towards
active vision with UAVs in marine search and rescue: Analyzing human
detection at variable altitudes. In 2020 IEEE International Symposium on
Safety, Security, and Rescue Robotics (SSRR) (pp. 65-70). IEEE.
VI Qingqing, L., Queralta, J.P., Gia, T.N., Zou, Z., Tenhunen, H. and Wester-
lund, T., 2019, September. Detecting water reflection symmetries in point
clouds for camera position calibration in unmanned surface vehicles. In
2019 19th International Symposium on Communications and Information
Technologies (ISCIT) (pp. 507-512). IEEE.
VII Sier, H.,Qingqing, L., Yu, X., Pen˜a Queralta, J., Zou, Z. and Westerlund,
T., 2023. A benchmark for multi-modal lidar slam with ground truth in
gnss-denied environments. Remote Sensing, 15(13), p.3314.
xiv
VIII Nevalainen, P.,Qingqing, L., Melkas, T., Riekki, K., Westerlund, T. and
Heikkonen, J., 2020. Navigation and mapping in forest environment using
sparse point clouds. Remote Sensing, 12(24), p.4088.
The following related publications are not directly included in this thesis:
I Qingqing, L., Queralta, J.P., Gia, T.N. and Westerlund, T., 2019, December.
Offloading monocular visual odometry with edge computing: Optimizing
image quality in multi-robot systems. In Proceedings of the 2019 5th Inter-
national Conference on Systems, Control and Communications (pp. 22-26).
II Queralta, J.P., Qingqing, L., Ferrer, E.C. and Westerlund, T., 2022. Secure
encoded instruction graphs for end-to-end data validation in autonomous
robots. IEEE Internet of Things Journal, 9(18), pp.18028-18040.
III Gia, T.N.,Qingqing, L., Queralta, J.P., Zou, Z., Tenhunen, H. and Wester-
lund, T., 2019, September. Edge AI in smart farming IoT: CNNs at the edge
and fog computing with LoRa. In 2019 IEEE AFRICON (pp. 1-6). IEEE.
IV Queralta, J.P.,Qingqing, L., Zou, Z. and Westerlund, T., 2020, April. En-
hancing autonomy with blockchain and multi-access edge computing in dis-
tributed robotic systems. In 2020 Fifth International Conference on Fog and
Mobile Edge Computing (FMEC) (pp. 180-187). IEEE.
V Queralta, J.P.,Qingqing, L., Schiano, F. and Westerlund, T., 2022, July.
VIO-UWB-based collaborative localization and dense scene reconstruction
within heterogeneous multi-robot systems. In 2022 International Confer-
ence on Advanced Robotics and Mechatronics (ICARM) (pp. 87-94). IEEE.
VI Queralta, J.P., Qingqing, L., Gia, T.N., Truong, H.L. and Westerlund, T.,
2020, May. End-to-end design for self-reconfigurable heterogeneous robotic
swarms. In 2020 16th International Conference on Distributed Computing
in Sensor Systems (DCOSS) (pp. 281-287). IEEE.
VII Sarker, V.K., Qingqing, L. and Westerlund, T., 2020, June. 3D Perception
with Low-cost 2D LIDAR and Edge Computing for Enhanced Obstacle
Detection. In 2020 IEEE Conference on Industrial Cyberphysical Systems
(ICPS) (Vol. 1, pp. 49-54). IEEE.
VIII Xianjia, Y., Qingqing, L., Queralta, J.P., Heikkonen, J. and Westerlund,
T., 2021, June. Applications of uwb networks and positioning to autonomous
robots and industrial systems. In 2021 10th Mediterranean Conference on
Embedded Computing (MECO) (pp. 1-6). IEEE.
xv
IX Xianjia, Y., Qingqing, L., Queralta, J.P., Heikkonen, J. and Westerlund, T.,
2021, August. Cooperative uwb-based localization for outdoors positioning
and navigation of uavs aided by ground robots. In 2021 IEEE International
Conference on Autonomous Systems (ICAS) (pp. 1-5). IEEE.
X Zhao, W., Queralta, J.P.,Qingqing, L. and Westerlund, T., 2020, November.
Towards closing the sim-to-real gap in collaborative multi-robot deep rein-
forcement learning. In 2020 5th International conference on robotics and
automation engineering (ICRAE) (pp. 7-12). IEEE.
The original publications have been reproduced with the permission of the copyright
holders.
xvi
1 Introduction
Over the past few decades, the landscape of autonomous systems has undergone a
profound evolution, propelled by significant advancements in sensor technologies.
Among these, LiDAR sensors have emerged as crucial components in the realm of
autonomous systems, with applications spanning from household vacuum cleaning
robots [1] to state-of-the-art self-driving platforms [2]. Their ubiquitous integration
in these systems can be credited to several pivotal factors, notably their exceptional
long-range sensing abilities, precise object detection accuracy, and reliable perfor-
mance across diverse environmental conditions [3].
LiDAR technology has seen significant advancements over the last decade, lead-
ing to improvements in density, accuracy, and higher frequency of measurements.
The concept of integrating multiple 2D scanners along the Z-axis to attain 3D per-
ception has inspired the development and widespread adoption of multiple channels
spinning LiDAR [4]. During the 2007 DARPA Urban Challenge, five out of the
six automated vehicles that successfully completed the race were equipped with the
Velodyne 64 channels spinning LiDAR [5]. In the context of 3D LiDAR systems,
the term channel refers to the individual laser beams or detection elements used to
capture data points in the environment. Each channel corresponds to a specific laser
beam or detector that emits or receives light pulses [6]. Nowadays, the 3D spinning
LiDAR systems offer a range of channels, typically ranging from 16 to 32, and even
up to 256 channels. For example, the 128 channels spinning LiDAR OS-0 in our
data acquisition system shows in Figure 1 can generate 2.6 million points per second,
with a maximum range of 50 meters. These advancements in multi-channel spin-
ning LiDAR have enabled high degrees of situational awareness in mobile robotic
applications. This innovative technology has garnered considerable interest across
diverse fields, proving valuable in practical applications such as autonomous driving
vehicles [7], unmanned aerial vehicles [8; 9], and forest inventory [10].
1.1 Research Background
The advancements in 3D LiDAR technology have become essential tool for funda-
mental perception tasks such as odometry, global localization, mapping, and tracking.
In this section, we introduce the primary tasks addressed in this thesis.
1
Qingqing Li
VLP-16
HorizonL515
OS-0 OS-1
AVIA
OptiTrack 
Markers
X
Y
Z
T265
GPS RTK
Figure 1. Front view of the proposed multi-modal multi-LiDAR data acquisition system.
The setup comprises three spinning LiDARs: a 16-channel Velodyne LiDAR (VLP-16), a
64-channel Ouster LiDAR (OS1), and a 128-channel Ouster LiDAR (OS0), along with two
solid-state LiDARs, Horizon and Avia.
1.1.1 LiDAR-Based Odometry
LiDAR odometry is a technique that utilizes LiDAR sensor data to estimate the move-
ment and position of a mobile robot or vehicle in real-time.LiDAR odometry gener-
ally employs scan-matching techniques including ICP, KISS-ICP [11], GICP [12],
and others to determine the relative transformation between two successive frames.
Feature-based matching approaches have gained popularity as a computationally ef-
ficient alternative to full point cloud matching. For example, in [13], Zhang et al.
propose the registration of edge and plane features for real-time LiDAR odometry.
This type of operation assumes that the LiDAR moves within a structured environ-
ment, with edge and plane points clearly identifiable from the point clouds. The
matching of consecutive scans is then performed by solving a least-squares optimiza-
tion problem. Due to the high performance achieved by the proposed method, sev-
eral iterations of LOAM have emerged. For instance, LeGO-LOAM [14] optimizes
ground points separately, resulting in better odometry accuracy and computational
efficiency.
Low-cost and high-performance solid-state LiDAR have attracted significant in-
terest among researchers in both academia and industry in recent years. Compared
to spinning LiDAR, the different sensing characteristics of solid-state LiDAR, such
as non-repetitive scanning patterns, introduce new challenges in point cloud registra-
tion and mapping. In [15], Lin et al. address several fundamental challenges with
2
Introduction
Fi
gu
re
2.
L
iD
A
R
po
in
tc
lo
ud
an
d
ca
m
er
a
da
ta
sa
m
pl
es
fr
om
th
e
pr
op
os
ed
da
ta
se
ti
n
th
e
In
do
or
04
se
qu
en
ce
ar
e
vi
su
al
iz
ed
he
re
.T
he
le
ft
m
os
t
co
lu
m
n
sh
ow
s
th
e
L
iD
A
R
da
ta
fr
om
A
vi
a
an
d
H
or
iz
on
;t
he
se
co
nd
co
lu
m
n
sh
ow
s
th
e
L
iD
A
R
da
ta
fr
om
O
S1
an
d
O
S0
;t
he
th
ir
d
co
lu
m
n
sh
ow
s
th
e
da
ta
fr
om
th
e
V
L
P-
16
an
d
th
e
de
pt
h
im
ag
e
fr
om
L
51
5.
T
he
ri
gh
tm
os
tc
ol
um
n
sh
ow
s
th
e
R
G
B
im
ag
e
fr
om
L
51
5
an
d
th
e
ra
ng
e
im
ag
es
fr
om
O
S1
an
d
O
S0
.
3
Qingqing Li
Table 1. Sensor specification for the data collection platform. Angular resolution is
configurable in the OS1-64 (varying the vertical FoV). Livox LiDARs have a non-repetitive
scan pattern that delivers higher angular resolution with longer integration times. For
LiDARs, range is based on manufacturer information, with values corresponding to 80%
Lambertian reflectivity and 100 klx sunlight, except for the L515 LiDAR camera.
Sensor IMU Type Channels FoV Resolution Range Freq. Points (pts/s)
VLP-16 N/A spinning 16 360°×30° V:2.0°, H:0.4° 100m 10Hz 300,000
OS1-64 ICM-20948 spinning 64 360°×45° V:0.7°, H:0.18° 120m 10Hz 1,310,720
OS0-128 ICM-20948 spinning 128 360°×90° V:0.7°, H:0.18° 50m 10Hz 2,621,440
Horizon BS-BMI088 solid-state N/A 81.7°×25.1° N/A 260m 10Hz 240,000
Avia BS-BMI088 solid-state N/A 70.4°×77.2° N/A 450m 10Hz 240,000
L515 BS-BMI085 LiDAR camera N/A 70°×43°(±3°) N/A 9m 30Hz -
T265 BS-BMI055 fisheye cameras N/A 163±5° N/A N/A 30Hz -
a robust, real-time LiDAR odometry and mapping algorithm for solid-state LiDAR
(LOAM Livox). The proposed method addresses the challenges of reduced field of
view (FoV) and non-repetitive sampling patterns by focusing on feature extraction
and selection, robust outlier rejection, filtering of moving objects, and compensa-
tion for motion distortion. Other recent results have also presented tightly-coupled
LiDAR-inertial odometry and mapping schemes for both solid-state and mechanical
LiDARs [16; 17]. Regarding the inherently limited FoV of a single solid-state Li-
DAR, a decentralized approach for simultaneous calibration, localization, and map-
ping utilizing multiple solid-state LiDARs was introduced in [18] to enhance system
resilience. However, increasing the number of LiDARs can expand the field of view
(FoV), but it also leads to higher costs and increased system complexity. The solid-
state-based odometry and mapping algorithm has made significant progress. How-
ever, its limited field of view (FoV) hinders its application in cluttered environments
where the entire FoV might be blocked. Therefore, more robust sensor fusion meth-
ods need to be investigated to ensure its robust performance in diverse environments.
1.1.2 Global localization
Global localization in robotics refers to the process of accurately estimating a robot’s
position within a global coordinate system or a given map. This determination is vital
for enabling effective mapping, precise navigation, and facilitating cooperative tasks
among multiple robots in diverse robotic applications. Accurate global localization
also plays a pivotal role in Loop Closure Detection (LCD) within the SLAM frame-
work, crucial for recognizing revisited locations and mitigating pose estimation drift
4
Introduction
through graph optimization [19; 20; 21]. Extensive studies have addressed global
localization and Loop Closure Detection in structured environments [22; 23; 24; 25].
However, research focusing on structure-poor environments like forests or orchards
is limited [26; 27], mainly due to the complexity, scarcity of landmarks, and the large
scale of these environments. Robust and efficient global localization methodologies
tailored to unstructured environments remain an under-explored area of study.
1.1.3 Multi-Sensor Fusion for SLAM
Relying solely on LiDAR for pose estimation often results in substantial drift due to
registering skewed point clouds. Modern LiDAR-based SLAM systems frequently
integrate data from multiple sensors, such as IMU, GNSS, and cameras, to bolster
accuracy, robustness, and situational awareness.
IMU sensors play a crucial role in modern SLAM framework by providing es-
sential motion and orientation information at high frequency, leading to the fusion
of LiDAR and IMU, which has received a lot attention among researchers and is
widely employed in various SLAM algorithms [21; 17]. LIO-SAM was introduced
as a solution to mitigate accumulated drift in LiDAR-inertial odometry using a graph
optimization strategy. This method tightly couples LiDAR and the inertial measure-
ment unit (IMU), optionally incorporating GNSS sensors, and utilizes smoothing
and mapping techniques [21]. Other studies, such as LIO-Mapping [28] and Fast-
LIO implementing an iterated Kalman filter [17], also leverage LiDAR and IMU
fusion for SLAM. Furthermore, integrating LiDAR-inertial odometry with visual-
inertial odometry (VIO) significantly enhances accuracy and robustness, especially
in texture-less or feature-less environments [29; 30].
Integrating more sensors into the system provides a wealth of environmental in-
formation. However, this expansion increases system complexity, necessitating ro-
bust calibration and maintenance, as well as more powerful computers to process the
combined data. These approach may not be viable for computing resource-limited
platforms, such as micro UAVs.
1.1.4 SLAM with Solid-State-LiDAR
One of the key limitations of LiDAR technology preventing more widespread adop-
tion for localization and mapping in mobile robots is the high cost of the sensors,
specially compared to vision sensors. However, with lower-cost models becoming
available, mainly solid-state LiDARs, multiple research efforts have been directed
towards optimizing existing algorithms for the new sensing modalities and scanning
patterns.
A robust, real-time LOAM algorithm for solid-state-LiDAR with small FoV and
irregular samplings has been presented in [31] to address several fundamental chal-
5
Qingqing Li
lenges arising from solid-state-LiDARs. In another work, Lin et al. proposed a
decentralized framework for SLAM tasks with multiple solid-state-LiDARs to in-
crease the FoV and improve overall system robustness [18]. Inspired by local Bundle
Adjustment (BA) techniques utilized in visual SLAM, a BA approach with an adap-
tive voxelization method to search feature correspondence and solve the problem of
sparse features points in three-dimensional LiDAR data was presented in [32]. More
recent works have also worked towards improving LiDAR-based SLAM system ro-
bustness, with tightly-coupled LiDAR-inertial odometry and mapping schemes for
both solid-state and mechanical LiDARs presented in [16; 33]. Additionally, a cam-
era and solid-state LiDAR fusion SLAM framework have also been proposed in [34].
1.1.5 Multi-LiDAR-Based Sensor Fusion
More recently, the research focus has also shifted towards the fusion of data from
multiple LiDARs. For example, [35] addresses multi-LiDAR online extrinsic cali-
bration, odometry, and mapping, where extracted edge and planar features are uti-
lized and data uncertainty is modeled with Gaussian distribution. A tightly coupled
LiDAR-inertial odometry and mapping approach with low drift is proposed in [36],
which utilizes features extracted from multiple time-synchronized LiDARs with com-
plementary FoV. Rather than utilizing spinning LiDARs, a decentralized extended
Kalman filter (EKF) approach for simultaneous calibration, localization, and map-
ping for multiple solid-state LiDARs was introduced to improve upon the limited
FoV of a single solid-state LiDAR [18].
Achieving denser 3D geometry measurements of the surrounding environment
is crucial for enhancing 3D environment understanding. Unfortunately, spinning
LiDAR with higher resolution can be costly due to its more complicated architec-
ture. Low-resolution spinning LiDAR, while more affordable, produces sparse point
clouds with limited features, making the problem difficult to tackle [37] and leading
to inevitable inherent alignment errors [38]. Despite the clear advantages of solid-
state LiDARs, the naturally narrow horizontal FoV leads, in a similar way to monoc-
ular pinhole camera systems, to the sensing volume being blocked by objects, or
even entirely occupied by a near wall, resulting in an insufficient number of feature
points to estimate a 6-degree-of-freedom (6-DoF) pose. This limitation has made
current solid-state-based SLAM methods challenging to apply in outdoor or large
indoor scenarios [16; 39] as shown in Figure 3. However, we find a lack of work in
the literature in terms of fusing multiple LiDARs with different scanning modalities,
which has potential to improve map quality and odometry accuracy by leveraging the
advantages of different model of LiDAR sensors.
Multiple studies have focused on improving LiDAR maps by integrating point
clouds from multiple LiDAR sensors [35; 40]. However, the low frame publish-
ing frequency of typical LiDARs (e.g., 10 Hz) can hinder an accurate 6-DOF pose
6
Introduction
Figure 3. The dense mapping results with solid-state LiDAR by LIO-Mapping are
showcased from different environments, positioned from top to bottom, left to right,
depicting urban road, forest, hall, and office room environments, respectively.
estimation in multi-LiDAR systems. In contrast, IMUs have been widely used in
state-of-the-art SLAM systems [41; 16], due to their ability to measure accelera-
tion and angular velocity at a high frequency (e.g., 200 Hz) in three-dimensional
space. Nonetheless, there remains a lack of methods that can effectively exploit
multi-LiDAR inertial systems for odometry estimations.
1.2 Motivation and Objectives
LiDAR technology has witnessed significant advancements, especially with the emer-
gence of higher-resolution spinning LiDAR and low-cost solid-state LiDAR. How-
ever, achieving a fully affordable, reliable, and precise localization, mapping, and
object tracking system using LiDAR technology still presents challenges. This thesis
aims to address these issues and unlock the technology’s full potential for practical
applications across various industries.
The following are the key research questions that this thesis aims to answer
throughout the different chapters, ranging from more general to more specific con-
cept:
1. Datasets are crucial for SLAM algorithm development as they provide real-
world sensor measurements and ground truth information, enabling the train-
7
Qingqing Li
ing, validation, and improvement of SLAM algorithms, which are essential for
accurate and reliable spatial mapping and navigation of autonomous systems.
Among existed SLAM dataset, the KITTI benchmark [42] is the most signif-
icant dataset with capabilities of evaluating several tasks including odometry,
SLAM, objects detection, tracking with spinning LiDAR. Solid-state LiDAR
overcome some of the challenges of spinning LiDAR in terms of cost and res-
olution, but introduce some new limitations in terms of a relatively small field
of view (FoV) [31; 16]. Despite solid-state LiDAR’s increasing popularity,
few works have benchmarked the performance of both spinning LiDAR and
solid-state LiDAR in diverse environments, which limits the development of
more general-purpose LiDAR-based SLAM algorithms [43]. In this thesis, we
explored the limitations and advantages of each LiDAR sensor in various en-
vironmental conditions. We aim to answer the following questions: How does
the mapping ability between different LiDAR sensors at the same time in di-
verse environments? What kind of sensor combination of these sensors is able
to achieve a more robust and accurate mapping system?
2. In the context of autonomous driving vehicles, accurate mapping and localiza-
tion are paramount, particularly in dense urban environments where precise
localization is indispensable [44]. The dynamic surroundings and narrower
pathways in such areas pose significant technical challenges [45]. In this thesis,
we aim to address the following questions: Which LiDAR-based localization
method is optimal for enhancing the navigation accuracy of delivery robots in
dense urban environments? Additionally, what strategies can be employed to
effectively update or repair maps in response to changes in the environment or
map degradation in dense urban settings?
3. Most SLAM algorithms are primarily optimized for structured urban environ-
ments, leading to reduced performance in unstructured environments such as
forests. Firstly, in forests, GNSS signals often suffer from the multipath effect,
resulting in poor location accuracy, particularly under the canopy [46]. Sec-
ondly, due to environmental similarities, general point cloud ICP matching
algorithms may encounter issues with local minima during loop closure detec-
tion tasks, leading to an inability to correct odometry drift [13]. Moreover, in
forest harvesting operations, the movement of harvesters and forwarders along
undefined paths, coupled with potential changes in the environment due to log-
ging activities, further complicates the localization process. In this thesis, we
aim to address the localization challenges in GNSS-denied environments and
develop solutions for accurate localization in forest environment.
4. Multiple studies have focused on improving LiDAR maps in terms of accuracy
and density by integrating point clouds from multiple LiDAR sensors [35; 40].
8
Introduction
However, the low frame publishing frequency of typical spinning LiDARs
(e.g., 10 Hz) can hinder an accurate 6-DOF pose estimation in multi-LiDAR
systems. In contrast, IMUs have been widely used in state-of-the-art SLAM
systems [41; 16], due to their ability to measure acceleration and angular veloc-
ity at a high frequency (e.g., 200 Hz) in three-dimensional space. Nonetheless,
there remains a lack of methods that can effectively exploit multi-LiDAR in-
ertial systems for odometry estimations. In this thesis, after studying multiple
LiDAR systems and their limitations and advantages, our aim is to answer the
following question: What is the most affordable, robust, and accurate mapping
system with different sensor systems? Additionally, how can sensor fusion
with different modalities of LiDARs and IMUs systems be implemented to
achieve state-of-the-art mapping performance?
5. Most SLAM methods operate under the assumption that the environment is
static. However, the presence of dynamic objects in the real world can signifi-
cantly impact the accuracy of scan matching [47]. To counteract the influence
of such dynamic objects, many researchers have concentrated on identifying
and eliminating feature points originating from moving objects, particularly in
autonomous driving systems involving humans and cars [48]. While consider-
able attention has been devoted to detecting large moving objects, such as vehi-
cles and pedestrians, there has been relatively little emphasis on micro object
detection. One primary reason for this neglect is the limited capability of tradi-
tional spinning LiDARs to capture sufficient points from small objects, posing
challenges in tracking them effectively. However, the utilization of solid-state
LiDARs with non-repetitive scanning patterns enables the generation of dense
point clouds within the field of view (FoV), facilitating the tracking of smaller
objects like micro objects. In this thesis, our goal is to further explore the
tracking capabilities of solid-state LiDAR, specifically focusing on its ability
to track smaller objects.At last, we aim to answer the following questions: How
can small objects be effectively detected from the environment? What is the
detection ability using a camera with deep learning methods from a UAV view
at different attitudes? How can challenging objects be tracked using LiDAR
sensors with small field of view (FoV) and non-repetitive scanning features?
Answering these questions can contribute to the development of a more robust
SLAM system.
1.3 Main Contributions
The thesis concentrates on the development of a cost-effective, robust 3D LiDAR-
based localization, mapping, and tracking system. This system leverages multimodal
sensors, encompassing various types of LiDARs, IMUs, and GNSS, particularly in
9
Qingqing Li
challenging environments.
The contribution can be categorized as following:
• An open-source multi-modal LiDAR dataset for general-purpose localization
and mapping. The dataset includes a benchmark that compares different modal-
ity LiDARs (spinning, solid-state) in diverse environments, like indoor offices,
long corridors, halls, forests, and open roads. To allow for more accurate and
fair comparison, we introduce a new method for ground truth generation in
larger indoor spaces. This enhanced ground truth enables significantly higher
degree of quantitative benchmarking and comparison with respect to our pre-
vious work [43]. The dataset and ground truth labels, as well as more detailed
data, provides a performance reference for multi-modal LiDAR sensors in both
structured and unstructured environments to both academia and industry. The
datasets are fully open sourced at [49] and [50].
• A segmentation and graph search-based global localization method for forest
environments. Forest enviornments are characterized by limited distinctive
features and visual similarity across the surroundings, specifically targeting
autonomous harvesters. The method achieved a robust, efficient, and accurate
localization ability in forest environment.
• LiDAR-based sensor fusion algorithms for autonomous delivery vehicles in
GNSS-denied and corrupted map environments. By integrating data from Li-
DAR sensors with other complementary sensors such as cameras, IMUs, and
odometry, the proposed method enhance localization accuracy, perception ca-
pabilities, and dynamic path planning. The LiDAR-based sensor fusion algo-
rithms, which won the first award in the US area and achieved the second posi-
tion globally in the JDD competition, effectively address the last mile delivery
problem in GNSS-denied and corrupted map environments for autonomous
vehicles.
• A cost-effective and robust simultaneous localization and dense mapping sys-
tem that take advantage of different modalities of LiDAR sensors. The ob-
jective was to achieve enhanced situational awareness and mapping accuracy
by effectively fusing data from low-resolution spinning LiDAR and low-cost
solid-state LiDAR with limited field of view (FoV). The research successfully
tackled calibration, synchronization, data processing, and feature extraction
challenges to seamlessly integrate diverse data sources.
• A novel method for object detection and tracking. The method leverages the
dense measurement capability of solid-state LiDAR sensors. Additionally, the
research delved into camera-based human detection and tracking using data-
10
Introduction
driven based method at variable altitudes, specifically for search and rescue
operations.
1.4 Thesis Content Summary
The thesis is organized as follows: The introduction chapter offers an overview of the
research background and the addressed problem. Chapter 2 analyzes various sensor
modalities and datasets while benchmarking SLAM algorithms across diverse envi-
ronments. In Chapter 3, a graph search-based localization strategy for autonomous
harvesters in dense forests is proposed. Chapter 4 focuses on accurate localization
for autonomous delivery vehicles in urban environments with scenarios involving
GNSS-denied and map-corrupted situations. Chapter 5 introduces a multi-modal
multi-LiDAR inertial dense mapping framework that harnesses the benefits of wide
FoV spinning LiDAR and high-resolution solid-state LiDAR. In Chapter 6, the re-
search delves into dynamic object detection and tracking, specifically focusing on its
application in search and rescue missions and multi-robot cooperation tasks. Finally,
Chapter 7 summarizes research findings, presents conclusions, and outlines future
research directions.
11
2 Multi-modal Multi-LiDAR Sensors
Dataset
Datasets have had a significant impact on research efforts in the development of
robust LiDAR odometry, localization, and mapping algorithms. By facilitating col-
laboration and comparison between different research groups, open-access datasets
have accelerated algorithm development. Researchers can evaluate their algorithms
on standardized datasets, enabling fair comparisons and benchmarking of different
approaches. This common ground for testing and validation has led to the adoption
of best practices and the sharing of knowledge within the research community, ulti-
mately resulting in faster progress and innovation in the field.
One of the pioneers and perhaps the most significant dataset to date is arguably
the KITTI benchmark[42]. The KITTI dataset includes a 64-beam 3D laser scan-
ner, four gray-scale and color cameras, and a GNSS/IMU navigation system within
a single data-gathering platform. The KITTI benchmark has become an essential
tool to evaluate the performance of algorithms in multiple tasks such as odometry,
SLAM, object detection, or tracking, among others, in both academia and industry.
Several similar datasets have also been published with a system composed of mul-
tiple cameras and spinning LiDARs, providing images and the corresponding point
clouds in urban environment. Some relevant examples include the Oxford Robocar
dataset [51], nuScences [52], or the EU long-term dataset [53]. Multiple spinning
LiDARs are often employed in these data collecting platform, albeit mostly sharing
the same sensing modality or technology.
The aim of this chapter is to investigate state of the art LiDAR based odometry
and mapping algorithms’ performance in diverse environment on proposed datasets.
One of the main limitations of existing datasets that we aim to overcome is the lack
of availability of data from solid-state LiDARs. Solid-state LiDARs often present
limited FoV, owing to the lack of mechanical rotation. The limited FoV together
with the different scan patterns pose significant differences for many of the popular
LiDAR odometry, localization and mapping algorithms [33]. As a result, we believe
that more data is necessary to advance research towards general-purpose and sensor-
agnostic LiDAR data processing algorithms. To bridge this gap, we present a novel
multi-LiDAR dataset with spinning LiDARs of three different resolutions, two solid-
state LiDARs with different FoV and scan pattern, and a LiDAR-camera.
The primary characteristics of the presented dataset are the following:
12
Multi-modal Multi-LiDAR Sensors Dataset
1. A dataset with data from five different LiDAR sensors and one LiDAR cam-
era in a variety of environments. This is, to our knowledge, the most diverse
dataset in terms of LiDAR sensors for these environments. The dataset in-
cludes spinning LiDARs with 16 (Velodyne VLP-16), 64 (Ouster OS1-64) and
128 (Ouster OS1-128) channels and different vertical FoVs. Two different
solid-state LiDARs (Livox Horizon and Livox Avia) with different scanning
patterns and FoVs are also included. A LiDAR camera (RealSense L515) pro-
vides RGB images and LiDAR-aided depth images. Low-resolution images
with depth, near-infrared and laser reflectivity data from the Ouster sensors
complete the dataset. These are illustrated in Figure 4.
2. The dataset includes sequences with MOCAP (motion capture) based ground
truth in both indoors and outdoors environments. This is, to the best of our
knowledge, the first LiDAR dataset to provide such accurate ground truth in
forest environments in addition to indoor areas, albeit the limited trajectory
length (see for samples Figure 5).
3. In addition to the MOCAP labeled data, the dataset includes other sequences in
large indoor halls, roads, and forest paths. The wide variety of sensors enables
comparison between LiDAR odometry and mapping algorithms to an extent
that was not possible before, with both general-purpose and sensor-specific
approaches.
4. Proposed a new multi-modal multi-LiDAR SLAM-assisted and ICP-based sen-
sor fusion method for generating ground truth maps. With these maps, we then
match real-time point cloud data using a normal distributions transform (NDT)
method to obtain the ground truth with full six degrees of freedom (DOF) pose
estimation. This novel ground truth data leverages high-resolution spinning
and solid-state LiDARs.
5. Based on the presented dataset, we provide a baseline comparison of the state-
of-the-art in LiDAR odometry, localization and mapping. We compare the
odometry drift as well as the quality of the maps obtained with different sen-
sors and different algorithms.
Based on the above characteristics of the presented dataset, this work provides
a timely and complimentary addition to existing datasets which are mostly focused
towards mobile robots indoors or autonomous cars outdoors. This chapter also show
how the performance of the state-of-the-art LiDAR SLAM algorithms varies signif-
icantly based on the environment and the type of LiDAR sensor. The high degree
of multi-modality in the sensor suite opens the door to research a variety of new
challenges for the research community. This dataset can aid in the design and de-
13
Qingqing Li
velopment of future algorithms that better adapt to both structured and unstructured
environments, to limited FoV of the sensors, and to different scanning modalities.
Figure 4. Front view of the multi-modal data acquisition system. Next to each sensor, we
show the individual coordinate frames for the generated point clouds.
2.1 LiDAR-Based SLAM Datasets
There are a number of datasets available that are relevant to this work, mostly gath-
ered within the autonomous driving community. In Table 3, we compare the most
relevant related datasets from the literature with ours. In the rest of this section, we
address the key differences between the proposed dataset and existing ones.
In addition to the myriad of datasets captured in urban road environments and fo-
cused towards research in autonomous driving, the literature also showcases efforts
in off-road environments. For example, the NCLT dataset provides a large-scale
indoor and outdoor dataset with multi-modal sensors, including spinning LiDARs,
cameras and IMU attached on a wheeled robot [54]. In another work, a handheld
device comprised of one spinning LiDAR and depth camera was utilized to collect
data from urban outdoor and vegetated environments [55]. A multi-sensor SLAM
benchmark, encompassing diverse indoor and outdoor environments, has been intro-
duced in [56]. In relation to these works, we provide a wider variety of sensor data
as well as more accurate ground truth in a selection of sequences.
There is also a number of datasets available in unstructured environments. For
instance, the Robot Unstructured Ground Driving (RUGD) dataset captured from a
small, unmanned mobile robot traversing in unstructured environments has been in-
troduced in [57]. The RUGD dataset contains different terrain types focusing on vi-
sual perception tasks like semantic segmentation. Several similar datasets in unstruc-
tured environments have been presented for tasks such as scene depth prediction [58],
14
Multi-modal Multi-LiDAR Sensors Dataset
Figure 5. Samples of map data form different dataset sequences. From left to right and top
to down, we display maps generated from a forest, an urban area, an open road and a large
indoors hall, respectively.
terrain roughness understanding [59], off-road pedestrian detection [60]. Compared
to these datasets, we provide MOCAP-based ground truth in a forest environment,
while also including a wider variety of sensors.
In general, the number of publicly available datasets with solid-state LiDAR data
is scarce. Among them, the PandaSet collects driving scenarios in urban environ-
ments with data from a forward-facing solid-state LiDAR and a 64-channels spinning
LiDAR [61]. Additionally, Lin et al. presented an outdoor and indoor dataset with a
solid-state LiDAR in college environment to test a novel LiDAR odometry and map-
ping (LOAM) algorithm tailored to solid-state LiDAR sensors [31]. In the present
dataset, we provide a significantly higher number of sensors as well as ground truth
both indoors and outdoors.
The research in the adaptation and tuning of algorithms for new LiDAR sen-
sors lacks the support of a dataset for benchmarking and comparing the different
approaches. Moreover, the lack of a truly heterogeneous and multi-modal dataset
with various types of LiDAR sensors is preventing further comparisons between the
methods to advance towards general-purpose LiDAR-based SLAM algorithms. To
bridge these gaps, this chapter focus on providing a dataset that can serve as an initial
benchmark for odometry, localization and mapping in diverse environments and with
different types of LiDAR sensors.
15
Qingqing Li
Figure 6. Our data collecting platform, top view (left) and front view (right)
2.2 Multi-LiDAR Data Collection Platform
The sensor configuration of our data collection platform is shown in Figure 4, and
more specific information of each sensor is available in Table 1. Owing to the va-
riety of environments where the platform has been used, it has been mounted on
different types of mobile platforms. In road-like environments and large indoor halls,
a Clearpath Husky mobile robot has been used. In forests outdoors with snow, it
has been handheld. In small indoor spaces, it has been mounted on a mobile wheeled
platform, manually pushed. In order to increase the usability of the dataset for bench-
marking general-purpose algorithms, pitch and roll rotations have been applied in
different configurations when handheld, in addition to standard horizontal settings
where only the yaw angles varies if the surface where it is moving is horizontal.
2.2.1 Hardware System
The core objective of the sensor system is to provide data from various LiDAR sen-
sors with different characteristics, from novel low-cost solid-state LiDAR to 3D spin-
ning LiDARs with different resolutions and vertical FoV, and LiDAR cameras as
well. To this end, our data collecting platform includes three spinning LiDARs: 16-
channels Velodyne LiDAR (VLP-16), 64-channels Ouster LiDAR (OS1), and 128-
channels Ouster LiDAR (OS0). On the side of solid-state LiDARs, two units from
Livox are installed: Horizon, with a FoV close to a rectangle, and Avia, with an
almost-circular FoV. An Intel RealSense L515 LiDAR camera completes the setup.
Regarding the physical configuration, the Horizon and Avia LiDARs were installed
in the center of the frame facing forward. The L515 camera was attached to the
front left of the platform. On the sides, the OS0 and OS1 sensors were mounted at
a bit higher level, where the OS1 is turned 45 degrees clockwise, and the OS0 is
16
Multi-modal Multi-LiDAR Sensors Dataset
Figure 7. Top view of point cloud data generated for the calibration process with multiple
LiDARs. The red and green point clouds represent data obtained from the Livox Horizon
and Avia, respectively. The purple, yellow, and blue clouds are from the VLP-16, OS1, and
OS0 sensors.
turned 45 degrees anticlockwise. The Velodyne LiDAR is at the top-most position
with the x-axis facing forward as well. Please refer to the top view of Figure 6 for
the detailed distances, positions and orientations. The Optitrack marker set for the
MOCAP-based ground truth are fixed on the top of the aluminum stick to maximize
its visibility and detection range.
To ensure a low-latency and high-speed transmission of all data, the LiDARs
are connected to a Gigabit Ethernet router and a computer onboard the platform
featuring an Intel i7-10750h processor, 64 GB of DDR4 RAM memory and 1 TB
SSD storage. The Optitrack system is also physically connected via Ethernet to the
onboard computer on a separate interface to the LiDARs. Finally, the RealSense
L515 camera is connected using a USB 3.0 port.
2.2.2 Software System
Our software system is based entirely on ROS Melodic under Ubuntu 18.04. The
set of ROS drivers and the publishing frequency of the different sensors is shown
visually in Figure 8. Owing to the lack of hardware signals to synchronize the sensor
data, as in other datasets in the literature [53], we approach the minimization of the
data synchronization problem by running all the sensor drivers, data recording pro-
grams locally on a high performance computer. This, together with the networking
equipment, aid in reducing the latency of data transmission at the hardware and soft-
ware level (timestamped at the ROS drivers). In order to support a potentially wider
use of the data, the dataset also includes the time stamp from built-in internal oscil-
lators for both Livox and Ouster LiDARs, and for both point cloud and IMU data,
in addition to the timestamp included in the header of all ROS messages. We have
17
Qingqing Li
Table 2. List of data sequences in proposed dataset (V: Velodyne VLP-16, H:Livox
Horizon, A:Livox Avia, 𝑂0: Ouster OS0, 𝑂1: Ouster OS1, L:Realsense L515, T: Realsense
T265, G: RTK/GNSS)
Sequence Description Ground Truth Sensors
Forest01 Forest(Winter,Square) Mocap V,H,A,𝑂0,𝑂1
Forest02 Forest(Winter,Straight) Mocap V,H,A,𝑂0,𝑂1
Forest03 Forest (long path) SLAM V,H
Indoor01 Office room(easy) Mocap V,H,A,𝑂0,𝑂1,L
Indoor02 Office room(middle) Mocap V,H,A,𝑂0,𝑂1,L
Indoor03 Office room(hard) Mocap V,H,A,𝑂0,𝑂1,L
Indoor04 Large Hall SLAM V,H,A,𝑂0,𝑂1,L
Indoor05 Long Corridor SLAM V,H,A,𝑂0,𝑂1,L
Indoor06 Lab space (easy) MOCAP V,H,A,𝑂0,𝑂1,L
Indoor07 Lab space (hard) MOCAP V,H,A,𝑂0,𝑂1,L,T
Indoor08 Classroom space SLAM+ICP V,H,A,𝑂0,𝑂1,L,T
Indoor09 Corridor (short) SLAM+ICP V,H,A,𝑂0,𝑂1,L,T
Indoor10 Corridor (long) SLAM+ICP V,H,A,𝑂0,𝑂1,L,T
Indoor11 Hall (large) SLAM+ICP V,H,A,𝑂0,𝑂1,L,T
Road01 Open road(short) SLAM V,H,A,𝑂0,𝑂1
Road02 Open road(long) SLAM V,H,A,𝑂0,𝑂1
Road03 Open road GNSS RTK V,H,A,𝑂0,𝑂1,L,T,G
also compared the angular velocities of IMUs together with data from the MOCAP
system to conclude that the latency of our system is less than 10 ms.
2.2.3 Sensors Calibration
The extrinsic parameters of the LiDARs are calculated based on optimization meth-
ods similar to those presented in [62]. We calculate the extrinsic parameters in an
indoor office environment, while the sensor platform was stationary. The coordinate
system of the Horizon LiDAR sensor is treated as the reference frame during the
calibration process. Ten consecutive frames of point cloud data are integrated from
the solid-state LiDARs to accumulate a higher degree of detail from the environment.
The point cloud data from each different LiDAR is then transformed to the reference
frame based on manual measurements of a set of features in the environment. Then,
a Generalized Iterative Closest Point (GICP) method is employed to optimize the rel-
ative transformation between the reference frame and LiDARs iteratively [12]. For
reference, in Figure 7 we show sample sensor data from one of the indoor envi-
ronments after calibration. The ROS package 𝑙𝑖𝑣𝑜𝑥 𝑐𝑎𝑚𝑒𝑟𝑎 𝐿𝑖𝐷𝐴𝑅 𝑐𝑎𝑙𝑖𝑏𝑟𝑎𝑡𝑖𝑜𝑛
was utilized to calibrate the extrinsic parameter between the Horizon sensor and the
18
Multi-modal Multi-LiDAR Sensors Dataset
Table 3. Comparison of related open-access LiDAR-based datasets with the proposed
dataset.
Dataset Year Environment Ground Truth LiDARs Other
KITTI[42] 2013 Urban road RTK GPS/INS 3D-Velodyne HDL-64E @10Hz 4× cameras , accel/gyro
NCLT[54] 2017
Urban Indoor
Outdoor GPS/INS
3D-Velodyne HDL-64E@10Hz
2× 2D-Hokuyo @10/40Hz camera
Oxford RobotCar[51] 2017 Urban Road GPS/INS
2× 2D-SICK @50Hz
3D-SICK @12.5Hz 4 Camera; accel/gyro
RUGB Dataset[57] 2019
Unstructured
outdoor - 3D-Velodyne HDL-32E @10Hz GPS&IMU ; 3× cameras
nuScences[52] 2020 Urban Road - 3D-32-Beams LiDAR @20Hz
6x Camera (RGB);GPS&IMU;
5x Radar@13Hz
Newer College[55] 2020
Urban outdoor
Vegetated 6DOF ICP 3D-Ouster-64 @10Hz D435i (Infrared); accel/gyro
PandaSet[61] 2021 Urban road -
3D-Hesai-Pandar64 @10Hz
3D solid-state LiDAR@10Hz 6x Cameras. GNSS&IMU
M2DGR [56] 2022 Urban In/Outdoors
Laser 3D tracker
RTK GPS/INS 3D VLP-32C @10Hz 3 Cameras. GNSS&IMU
Our Dataset 2022
Urban indoor
Urban road
Forest
6DOF MoCAP
SLAM
3x 3D-Spinning LiDAR(16,64,128) @10Hz
2x 3D-Solid-State-LiDAR @10Hz
LiDAR-Camera @30Hz
2x accel/gyro @200Hz
2x accel/gyro @100Hz
L515 LiDAR camera. The intrinsic parameters of LiDARs and the LiDAR camera
are given based on factory settings and manufacturer information. A specific rosbag
containing raw data recorded in stationary settings at the room shown in Figure 7 is
provided for end-user re-calibration and potential application of different methods.
2.3 Generating Ground Truth Data
Generating accurate ground truth data in complex environments is a challenging task,
as has been identified in multiple existing datasets. Many vehicular benchmarks uti-
lize the pose generated from GNSS/INS fusion method as ground truth. However,
multi-path effect can affect the accuracy of the pose estimated by GNSS sensors in
forest and urban environments. For indoor environments, GNSS signals are unavail-
able.
2.3.1 MOCAP Avaliable Ground Truth
MOCAP systems have been widely adopted in indoor environments owing to their
ability to provide millimeter-level accuracy in positioning data. However, the utiliza-
tion of MOCAP systems is limited mainly by the range of the cameras, usually in
the 10 to 20 m range. The need for relatively complex setup of the system has also
prevented the adoption of such systems for outdoors environments, and specially in
unstructured environment such as forests.
To meet the demands of reliable ground-truth data for diverse environments,
19
Qingqing Li
Master
OptiTrack Driver
Avia Driver
Velodyne Driver
L515 Driver
Horizon Driver
OS0 Driver OS1 Driver
rosbag
10 Hz
10 Hz
10 Hz
120 Hz
30 Hz
10 Hz
10 Hz
Figure 8. ROS drivers and data gathering frequency for the different LiDAR sensors used in
data collection platform.
the present dataset includes MOCAP-based ground-truth data in both a subset of
indoor and forest environments. This enables millimeter level pose estimation as
ground truth for odometry algorithms in both structured and unstructured environ-
ments, which can aid in researching low-drift odometry algorithms, accurate feature
tracking, and reduction of motion-induced distortions in the data. For large-scale
environment, where the MOCAP system is unavailable, we also provide location
information as a reference from SLAM methods [33]. In these settings, the higher-
resolution LiDAR OS0-128 can be used as a baseline for the other sensors. We
evaluate the SLAM algorithms in diverse environments, with a sample of environ-
ments shown in Figure 11. From the different SLAM methods further characterized
in the next section, those that use data from the OS0 sensor showcase the most robust
performance in a series of sampled sequences.
2.3.2 SLAM Assisted Ground Truth
To provide accurate ground truth for large-scale indoor and outdoor environments,
where the MOCAP system is unavailable or GNSS/RTK positioning result becomes
unreliable due to the multi-path effect, we propose a SLAM-assisted solid-state
LiDAR-based ground map generation framework.
Inspired by the prior map generation methods in [55], where a survey-grade 3D
imaging laser scanner Leica BLK360 scanner is utilized to obtain static pointclouds
of the target environment, we employed a low-cost solid-state LiDAR Livox Avia
and high resolution spinning LiDAR to collect undistorted pointclouds from environ-
ments. According to the Livox Avia datasheet, the range accuracy of the Avia sensor
is 2 cm with a maximum detection range of 480 m. Due to the non-repetitive scanning
pattern, the environment coverage of the pointcloud within the FoV increases with
20
Multi-modal Multi-LiDAR Sensors Dataset
time. Therefore, we integrated multiple frames when the platform was stationary to
get more detailed undistorted environmental sampling. Each integrated pointcloud
contains more than 240,000 points. The Livox built-in IMU is used to detect the
stationary state of the platform when the acceleration values are smaller than 0.01
𝑚/𝑠2 along all axes.
After gathering multiple undistorted pointcloud submaps from the target environ-
ment, the next step is to match and merge all submap into a global map by ICP. As the
ICP process requires a good initial guess, we employ a high resolution spinning Li-
DAR OS0 with a 360-degree horizontal FOV to provide raw position by performing
real-time SLAM algorithms. This process is outlined in Algorithm 1. A dense and
high-definition ground truth map can be obtained by denoising the map generated by
the algorithm described above to remove noise.
Let 𝒫𝑠𝑘 be the pointcloud produced by the spinning LiDAR, 𝒫𝑑𝑘 be the point-
cloud generated by solid-state LiDAR, and ℐ𝑘 be the IMU data from the built-in
IMU. Our previous work has shown high resolution spinning LiDAR has the most
robust performance in diverse environments. Therefore, LeGo-LOAM [14] is per-
formed with a high resolution spinning LiDAR (OS0-128) and outputs the estimated
pose for each submap.
The cached data 𝒮𝑐𝑎𝑐ℎ𝑒 stores submaps and the related poses. Let 𝒫𝑖 be the
pointcloud and related pose p𝑖 in 𝒮𝑐𝑎𝑐ℎ𝑒[𝑖]. The submap𝒫𝑖 will be first transformed to
map coordinate as 𝒫𝑚𝑖 based on estimated pose p𝑖; then GICP methods are employed
on 𝒫𝑚𝑖 to minimize the Euclidean distance between closest points against pointcloud
ℳ𝑎𝑝 iteratively; 𝒫𝑚𝑖 will be transformed by the transformation matrix generated
from GICP process, then merged to the map ℳ𝑎𝑝. The result map ℳ𝑎𝑝 is treated as
ground truth map.
After the ground truth map generated, we employ normal NDT method in [63]
to match the real-time pointcloud data from spinning LiDAR against the HD (High-
definition) map as the Figure 9 shows to get the platform position in ground truth
map. The matching result from the NDT localizer is treated as the ground truth.
The evaluation of the accuracy of the proposed ground truth prior map method
is challenging for some scenes in the dataset, as both GNSS and MOCAP systems
are not available in indoor environments such as long corridors. To evaluate the
generated ground truth, we adhere to the methodological approach delineated in the
referenced study [55]. We evaluated the standard deviations of the ground truth
generated by the proposed method during the first ten seconds when the device is
stationary from sequence Indoors09. The standard deviations along the 𝑋 , 𝑌 , and 𝑍
axes are 2.2 cm, 4.1 cm, and 2.5 cm, respectively, or about 4.8 cm overall. However,
evaluating localization performance when the device is in motion is more difficult.
To better understand the order of magnitude of the accuracy, we compare the NDT-
based ground truth 𝑍 values with the MOCAP-based ground truth 𝑍 values in the
sequence Indoor06 environment.
21
Qingqing Li
Algorithm 1: SLAM-assisted ICP-based prior map generation for ground
truth data.
Input:
Spinning LiDAR pointcloud: 𝒫𝑠𝑘
Solid-state LiDAR pointcloud: 𝒫𝑠𝑘
IMU data: ℐ𝑘
Output:
Platform state: p𝑘
Prior map: ℳ𝑎𝑝
while new 𝒫𝑠𝑘 do
p𝑘 ← 𝑆𝐿𝐴𝑀(𝒫𝑠𝑘);
// Cached still clouds and raw pose
𝒮𝑐𝑎𝑐ℎ𝑒 = {};
// Cached still cloud
𝒫𝑐𝑎𝑐ℎ𝑒 = [];
while new 𝒫𝑑𝑘 do
if ℐ𝑘.𝑉𝑎𝑛𝑔𝑢𝑙𝑎𝑟 < 𝑡ℎ𝑎, p𝑘.𝑉𝑙𝑖𝑛𝑒𝑎𝑟 < 𝑡ℎ𝑣 then
𝑠 = 𝑇𝑟𝑢𝑒;
𝒫𝑚 = 𝒫𝑚 + 𝒫𝑑𝑘;
else
𝑠 = 𝐹𝑎𝑙𝑠𝑒;
𝒫𝑐𝑎𝑐ℎ𝑒.𝑐𝑙𝑒𝑎𝑟();
𝒮𝑐𝑎𝑐ℎ𝑒 ← (𝒫𝑚,p𝑘);
while 𝒮𝑐𝑎𝑐ℎ𝑒.𝑠𝑖𝑧𝑒() > 0 do
ℳ𝑎𝑝 ← 𝐼𝐶𝑃 (𝒮𝑐𝑎𝑐ℎ𝑒,p𝑘,ℳ𝑎𝑝);
𝒮𝑐𝑎𝑐ℎ𝑒.𝑐𝑙𝑒𝑎𝑟();
22
Multi-modal Multi-LiDAR Sensors Dataset
2.3.3 Data Sequences
The different subsets of proposed dataset are divided into three categories based on
the environment: forest, indoor, urban outdoor. Table 2 lists all the sequences in our
dataset.
Three sequences are provided for the forest environment. The forest data is col-
lected at a forest in Turku, Finland (60°28′14.3”𝑁22°19′54.8”𝐸). The sequences
Forest01 and Forest02 are collected in winter time with snow-covered ground. For-
est01 includes a square-shaped trajectory, while in Forest02 the system is moved in
a straight trajectory. Both of these sets include MOCAP data. A larger-scale forest
recording is also provided in the Forest03 sequence, with Horizon and VLP-16 Li-
DARs mounted on a smaller, handheld device. These sequences can support research
in areas from tree-counting to tree stem diameter estimation. The vast difference in
environment structure from urban settings to forest settings can also support LiDAR-
based general-purpose odometry, localization and mapping algorithms.
The indoor environment then adds another dimension to the dataset with five data
sequences. The data is collected in rooms and open halls of ICT-City in Turku, Fin-
land. Three sequences are collected in a large experiment room where data from the
MOCAP system is available. From these, Indoor03 contains faster rotations and sud-
den movements, while positioning the sensors closer to objects in front and around.
In consequence, most of the solid-state LiDAR view is blocked by objects or walls,
presenting a significantly more challenging situation for odometry estimation algo-
rithms based primarily on scan matching methods. The data in Indoor01 is recorded
while maintaining a longer distance (≈ 50 𝑐𝑚) with objects and following a square-
shaped trajectory with a reduced number of rotations. The Indoor02 sequence then
features a circular trajectory with more rotation but again maintaining an even larger
distance to objects than in Indoor03. Sequences Indoor04 and Indoor05 correspond
to recordings in a large hall and long corridor environment, respectively.
Finally, two sequences of open-road environment around the ICT-City building
in Turku, Finland, are also included in this dataset. The length of Road01 is over
50 m, while the traversed length of the trajectory in Road02 is about 500 m.
2.3.4 Data Format
The data is collected in ROS and saved with the rosbag format, which has become
a standard in the robotics research community. Sampled data frames from a subset
of the sensors is shown in 2. The detailed data format for each type included in the
dataset is listed as follow:
1. Point cloud from spinning LiDARs from the three spinning LiDARs, namely
VLP-16, OS0-128 and OS1-64. The sensor message type from spinning Li-
DARs is recorded as 𝑠𝑒𝑛𝑠𝑜𝑟 𝑚𝑠𝑔𝑠 :: 𝑃𝑜𝑖𝑛𝑡𝐶𝑙𝑜𝑢𝑑. Each point in the point
23
Qingqing Li
Figure 9. NDT localization with ground truth map. External view and Internal view when
the current laser scan (orange) is aligned with the Ground truth map (blue).
cloud holds four values (𝑥, 𝑦, 𝑧, 𝐼), where 𝑥, 𝑦, 𝑧 represent the local Cartesian
coordinates, and 𝐼 is the laser reflectance of the point measured.
2. Point cloud from solid-state LiDAR from the two solid-state LiDARs, namely
Avia and Horizon. The message type of these solid-state LiDARs in the ros-
bags is Livox’s custom data format named 𝑙𝑖𝑣𝑜𝑥 𝑟𝑜𝑠 𝑑𝑟𝑖𝑣𝑒𝑟/𝐶𝑢𝑠𝑡𝑜𝑚𝑀𝑠𝑔.
The customized message keeps the first point’s timestamp of each frame as the
base time and then provides an offset time relative to the base time for each
point. This is needed as the non-repetitive pattern does not allowed for a pos-
teriori estimation, unlike the spinning LiDARs, in which we can estimate the
time difference between points based on the settings of the mechanical parts.
With this information, the de-skew process can then be conducted on the data
to compensate for the distortion in the point cloud data caused by the sensor’s
egomotion [31]. We have maintained this message type that contains time in-
formation for each point for algorithms that include in the processing flow the
de-skew of point cloud data and other related research. However, standard
ROS messages simplify the visualization of the point cloud with tools such
24
Multi-modal Multi-LiDAR Sensors Dataset
as Rviz, and provide a format that many other LiDAR processing algorithms
relying on standard ROS messages use [16]. Therefore, we provide format
conversion tools to transform the Livox custom message data to the ROS stan-
dard message type 𝑠𝑒𝑛𝑠𝑜𝑟 𝑚𝑠𝑔𝑠 :: 𝑃𝑜𝑖𝑛𝑡𝐶𝑙𝑜𝑢𝑑. Each point is then converted
to a new one that holds five values (𝑥, 𝑦, 𝑧, 𝐼, 𝐶), where 𝑥, 𝑦, 𝑧 is the local
Cartesian coordinate set, 𝐼 is the intensity of the point, and where the integer
part of 𝐶 represents the line number and the decimal part the point timestamp.
3. Images from LiDAR camera. The RealSense L515 LiDAR camera is config-
ured to publish RGB images with a size of 1920×1080, and depth images with
a size of 1024×768. The message type is 𝑠𝑒𝑛𝑠𝑜𝑟 𝑚𝑠𝑔𝑠 :: 𝐼𝑚𝑎𝑔𝑒 at frequency
10 Hz. The depth estimations are aided by the built-in LiDAR sensor.
4. Images from high-resolution spinning LiDAR. The two high resolution Li-
DAR from Ouster, OS0-128 and OS1-64, can output fixed-resolution range
images, near-infrared images captured by the laser sensor, and signal images.
In these, each pixel represents the distance from the sensor origin to the point,
the strength of the light captured, and the object’s reflectivity, respectively. The
images are published at frequency of 10 Hz. The image data is spatially corre-
lated, with 16 bits per pixel and a linear photo response. The message type in
the rosbags is the standard 𝑠𝑒𝑛𝑠𝑜𝑟 𝑚𝑠𝑔𝑠/𝐼𝑚𝑎𝑔𝑒.
5. Inertial data from spinning and solid-state LiDARs. There are in total four
built-in 6-axis IMU sensors with 3-axis gyroscope and a 3-axis accelerometer,
one in each of the Ouster and Livox LiDARs. They publish data at a frequency
of 100 Hz in the former and 200 Hz in the latter. The data type of IMU data in
the rosbags is again ROS’ standard 𝑠𝑒𝑛𝑠𝑜𝑟 𝑚𝑠𝑔𝑠 :: 𝐼𝑚𝑢.
6. Ground truth data. The ground truth data from the MOCAP system is in-
cluded in rosbags as 𝑔𝑒𝑜𝑚𝑒𝑡𝑟𝑦 𝑚𝑠𝑔𝑠 :: 𝑃𝑜𝑠𝑒𝑆𝑡𝑎𝑚𝑝𝑒𝑑 messages. They are
obtained from the computer driving the set of OptiTrack cameras through a
VRPN (Virtual-Reality Peripheral Network) connection.
2.4 SOTA Algorithms Benchmarking
As a part of the dataset, we have evaluated several state-of-the-art SLAM algorithms
on the different sequences. Through the rest of this section we discuss the best meth-
ods for different types of LiDAR data and environments.
25
Qingqing Li
Table 4. ATE (𝜇/𝜎) of selected SLAM methods (N/A when odometry estimations diverge).
Best results for each sequence are in bold.
Sequence FLIO OS0 FLIO OS1 FLIO Avia FLIO Hori LeGo Velo LIOL Hori
Indoor01 0.11 / 0.07 0.12 / 0.04 0.58 / 0.3 0.65 / 0.24 0.22 / 0.19 N/A
Indoor02 0.17 / 0.12 0.34 / 0.21 0.70 / 0.20 N/A 0.48 / 0.17 N/A
Indoor03 0.16 / 0.09 0.21 / 0.08 N/A N/A 0.38 / 0.23 N/A
Forest01 0.14 / 0.05 0.13 / 0.04 0.10 / 0.03 0.09 / 0.03 0.12 / 0.05 0.04/ 0.01
Forest02 0.13 / 0.07 0.12 / 0.06 0.09 / 0.03 0.11 / 0.05 0.31 / 0.05 0.07/ 0.04
Figure 10. Qualitative comparison of the mapping quality using different LiDAR and
SLAM algorithms. Bottom row shows in (e) to (j) the Horizon-based LIOL, Horizon, Avia,
OS0, and OS1-based FLIO, and Velodyne’s LeGo-LOAM maps, respectively.
2.4.1 LiDAR Odometry
Different LiDAR SLAM methods have been employed in our experiments. The
FAST-LIO (FLIO) algorithm [33], a LiDAR-inertial odometry system that works for
both spinning LiDAR and solid-state LiDAR, has been applied on Ouster LiDARs
and Livox LiDARs leveraging the built-in IMUs. The objective here is to compare
the performance of the same SLAM method applied to data from different types of
LiDARs. In addition to FLIO, LeGo-LOAM1 has been applied to data from the
Velodyne LiDAR [14]. Another SLAM system, LIO-LIVOX (LIOL)2, a tightly cou-
pled SLAM that specifically developed for Horizon LiDAR, has also been tested on
Horizon LiDAR data.
The estimated trajectories are visualized in Figure 11. The plots are two-dimensional
1https://github.com/RobustFieldAutonomyLab/LeGO-LOAM
2https://github.com/Livox-SDK/LIO-Livox
26
Multi-modal Multi-LiDAR Sensors Dataset
to improve readability as the changes in the vertical coordinate are minimal. Full data
of the reconstructed paths is available in the dataset repository. From the results, one
of the first conclusions is that the solid-state LiDAR-based SLAM system performs
as well as or even better than the spinning LiDARs with the appropriate algorithms
in the outdoors environments, but perform significantly more poorly in the indoor
environments.
For the outdoor sequences, Forest01, Forest02, and Road02, all SLAM methods
perform well, and the trajectories are completed without major disruptions. For the
indoor sequence Indoor01, Avia- and Horizon-based FLIO are able to reconstruct the
sensor trajectory but show that significant drift accumulates. In the same sequence,
Horizon-based LIOL fails to reconstruct even the first loop in the trajectory. Similar
behaviour is observed in the Indoor02 sequence, with all the solid-state LiDARs
failing completely in Indoor03. In all of these sequences, all the methods applied
to spinning LiDARs perform satisfactorily. This result can be expected as they have
full view of the environment, which has a clear geometry.
For the sequence Indoor04 showcasing a long corridor, all the spinning LiDARs
can again reconstruct a complete trajectory. The best performance is obtained from
OS0-based FLIO and Velodyne-based LeGo-LOAM, with correct alignment between
the first and last location. However, angular drifts accumulates with OS1-based
FLIO, while Horizon- and Avia-based algorithms result in diverging odometry es-
timations.
In addition to the qualitative trajectory analysis, we also provide a quantitative
analysis of the odometry error based on the MOCAP-based ground truth data in
Table 4. Absolute trajectory errors (ATE) [64] are employed as an evaluation met-
ric. All trajectories are transformed to the local coordinate reference of the MOCAP
markers, and aligned with global ground truth data reference. Then, we calculate
ATE using the EVO toolset 3. Methods based on spinning LiDAR data clearly show
performance indoors, with naturally lower error as the vertical resolution increases.
However, in the forest environment solid-state LiDARs demonstrate superior perfor-
mance, with LIOL featuring an ATE error as low as 4 cm mean error.
In summary, the above results show that spinning LiDARs are more stable across
different environments, while the solid-state LiDARs show significantly better cost-
performance ratio in some outdoors environments.
2.4.2 Mapping Quality
In the last part of our analysis, we compare the mapping quality generated from dif-
ferent LiDARs in urban open road environments as shown in Figure 10. From the
figure, we can observe that the LIOL method applied to solid-state LiDAR presents
3https://github.com/MichaelGrupp/evo.git
27
Qingqing Li
Figure 11. Estimated trajectory results using different LiDAR and SLAM algorithms. First
row: Indoor01, Indoor02. Second row: Indoor03, Forest02. Third row: Forest01, Road02.
Bottom row: Indoor04, Indoor05.
28
Multi-modal Multi-LiDAR Sensors Dataset
the most detailed and clear map. It is worth noting that these maps have been gen-
erated with default configuration of the methods and without changing parameters
such as the map update frequency. This result matches the quantitative results ob-
tained with the same sensors and algorihtms in the forest environment. More results
are available in our project page.
Figure 12. Qualitative comparison of the mapping quality using different LiDAR and
SLAM algorithms. The first row from left to right shows RGB full view image, full view
Horizon-based LIOL and close view RGB image. The second row from left to right shows
OS0, OS1, Velodyne, Avia and Horizon-based FLIO. The bottom row from left to right
shows the Horizon-based LIOL, Horizon, OS1-based LLOM and LLOMR, Velodyne’s
LeGo-LOAM maps and Horizon-based LVXM, respectively.
From Figure 10,we can observe that the LIOL method applied to solid-state Li-
DAR presents the most detailed and clear map. It is worth noting that these maps
have been generated with the default configuration of the methods and without chang-
ing parameters such as the map update frequency. This result matches the quantita-
tive results obtained with the same sensors and algorithms in the forest environment.
As shown in Figure 12, Horizon-based LIOL has the best mapping ability, but if
the environment (such as sequence indoors 06-09) is complex, LIOL will fail to map
due to drift. In addition, OS0 and OS1-Based FLIO also have good mapping ability,
thanks to the wide FOV and excellent resolution of OS0 and OS1. Compared to OS0
and OS1, Velodyne has poorer mapping ability due to its lower vertical resolution,
and it has almost failed to reconstruct the letter B sign in Figure 10. LVMX, LLOM,
and LLOMR focus on calculating the mobile platform’s pose estimation rather than
point cloud mapping ability, so the point cloud maps they reconstructed are relatively
poor.
29
Qingqing Li
2.4.3 Runtime Evaluation
We conducted this experiment on 4 different platforms. The first platform (1) was
a Lenovo Legion Y7000P with 16 GB RAM, a 6-core Intel i5-9300H (2.40 GHz)
and an Nvidia GTX 1660Ti (1536 CUDA cores, 6 GB VRAM). The second (2) plat-
form was the Jetson Xavier AGX, a popular computing platform for mobile robots,
which has an 8-core ARMv8.2 64-bit CPU (2.25 GHz), 16 GB RAM and 512-core
Volta GPU. From its 7 power modes, we chose MAX and 30 W (6 core only) modes.
The third (3) platform was the Nvidia Xavier NX which is a common embedded
computing platform with a 6-core ARM v8.2 64-bit CPU, 8 GB RAM, and 384-core
Volta GPU with 48 Tensor cores. We chose the 15 W power mode (all 6 cores) for
the NX. The fourth (4) platform was the UP Xtreme board featuring an 8-core Intel
i7-8665UE (1.70 GHz) and 16 GB RAM.
Table 5. Average run-time resource (CPU/RAM) utilization and performance (pose
calculation speed) comparison of selected SLAM methods across multiple platforms. The
data is played at 15 times the real speed for the pose publishing frequency. CPU utilization
of 100% equals one full processor core.
( CPU utilization (%), RAM utilization (MB), Pose publication rate (Hz) )
Intel PC AGX MAX AGX 30W UP Xtreme NX 15W
FLIO OS0 (79.4, 384.5, 74.0) (40.9, 385.3, 13.6) (55.1, 398.8, 13.2) (90.9, 401.8, 47.3) (53.7, 371.1, 14.3)
FLIO OS1 (73.7, 437.4, 67.5) (54.5, 397.5, 21.2) (73.9, 409.2, 15.4) (125.9, 416.2, 58.0) (73.3, 360.4, 14.2)
FLIO Velo (69.9, 385.2, 98.6) (44.4, 369.7, 29.1) (58.3, 367.6, 21.4) (110.5, 380.5, 89.6) (57, 331.5, 19.5)
FLIO Avia (65.0, 423.8, 98.3) (40.8, 391.5, 32.3) (47.4, 413.4, 24.5) (113.2, 401.2, 90.7) (51.2, 344.8, 21.9)
FLIO Hori (65.7, 423.8, 103.7) (37.6, 408.4, 34.7) (50.5, 387.9, 26.8) (109.7, 422.8, 91.0) (47.5, 370.7, 23.4)
LLOM Hori (126.2, 461.6, 14.5) (128.5, 545.4, 9.1) (168.5, 658.5, 1.5) (130.1, 461.1, 12.8) (N / A)
LLOMR OS1 (112.3, 281.5, 25.8) (70.8, 282.3, 9.6) (107.1, 272.2, 6.5) (109.0, 253.5, 13.6) (N / A)
LIOL Hori (186.1, 508.7, 19.1) (247.2, 590.3, 9.6) (188.1, 846.0, 4.1) (298.2, 571.8, 14.0) (239.0, 750.5, 4.54)
LVXM Hori (135.4, 713.7, 14.7) (162.3, 619.0, 10.5) (185.86, 555.81, 5.0) (189.6, 610.4, 7.9) (198.0, 456.7, 5.5)
LEGO Velo (28.7, 455.4, 9.8) (42.4, 227.8, 7.0) (62.8, 233.4, 3.5) (39.7, 256.6, 9.1) (36.9, 331.4, 3.7)
These platforms all run ROS Melodic on Ubuntu 18.04. The CPU and mem-
ory utilization is measured with a ROS resource monitor tool 4. Additionally, for
minimizing the difference of the operating environment, we unified the dependen-
cies used in each SLAM system into same version, and each hyperparameter in the
SLAM system is configured with the default values. The results are shown in Table 5.
The memory utilization of each selected SLAM approach among the two proces-
4https://github.com/alspitz/cpu monitor
30
Multi-modal Multi-LiDAR Sensors Dataset
sor architectures platforms is roughly equivalent. However, the CPU utilization of the
same SLAM algorithm running on Intel processors is generally higher than the other
algorithms, and also the highest publishing frequency is obtained. LeGO LOAM has
the lowest CPU utilization but its accuracy is towards the low end (see Table 5), and
has a very low pose publishing frequency. Fast-LIO performs well, especially on em-
bedded computing platforms, with good accuracy, low resource utilization, and high
pose publishing frequency. In contrast, LIO LIVOX has the highest CPU utiliza-
tion due to the computational complexity of the frame-to-model registration method
applied to estimate the pose.
A final takeaway is in the generalization of the studied methods. Many state-of-
the-art methods are only applicable to a single LiDAR modality. In addition, those
that have higher flexibility (e.g., FLIO) still lack the ability to support a point cloud
resulting from the fusion of both types of LiDARs.
2.5 Summary and Conclusions
This chapter presented a novel dataset collected with a multi-modal LiDAR sensor
system in diverse environments. The dataset includes data from LiDARs of differ-
ent types (spinning and solid-state), resolution (16, 64 and 128 channels for spinning
LiDARs) and scan patterns (for two different solid-state LiDARs), in addition to a Li-
DAR camera. This opens the door to future research in general-purpose algorithms,
as our analysis shows that different algorithms clearly perform better in one or an-
other type of LiDAR, if they are able to process the data at all. There is therefore a
significant gap to be filled in more robust LiDAR odometry, localization and mapping
algorithms. To aid in analyzing the drift of the algorithms, we have provided ground
truth data both indoors and in a forest environment. For comparison of the mapping
quality mainly, we also provide data from larger indoor halls and urban roads. The
experiments have covered nine sequences across four computing platforms.
Overall, we found that in both indoor and outdoor environments, the spinning
LiDAR-based FLIO exhibited good performance with low power consumption, which
we believe is due to the ability of the spinning LiDAR to obtain a full view of the
environment. However, in the forest environment, the LIOL algorithm based on solid-
state LiDAR It has the best accuracy and mapping quality performance, although it
has the highest power consumption due to sliding window optimization. These find-
ings led to the proposal of a novel multi-modal multi-LiDAR-Inertial SLAM frame-
work in Chapter 5, leveraging the strengths of various LiDAR sensor modalities.
31
3 Accurate Localization in Large-scale
Forest Environments
Accurate global localization on a given map is critical for autonomous systems as it
ensures safe navigation and robust performance in various conditions by taking ad-
vantage of prior knowledge from the map. While LiDAR-based SLAM algorithms
excel in simultaneously creating maps and estimating poses relative to their original
position, there is an additional need to accurately locate the robot globally on an exist-
ing map for long-term navigation, mapping, and multi-robot cooperation tasks [65].
In structured environments like urban roads, where perception systems can detect
abundant identifiable edge and plane features from ground structures, buildings, and
traffic signs, global localization becomes more feasible. However, unstructured envi-
ronments such as forests pose unique challenges, where irregular terrain and visually
similar surroundings render conventional ICP-based algorithms ineffective in such
settings [13; 12]. Moreover, GNSS accuracy is significantly hindered by the multi-
path effect occurring under forest canopies.
In this chapter, our objective is to enable fast, real-time global localization on
prior map in forest environment for autonomous driving vehicle in forest, such as
forest harvesters and forwarders. In a forest harvesting operation, harvesters and for-
warders travel through undefined paths. Moreover, the detection of potential features
(ground and tree stems) is considerately affected by irregularities (branches and fo-
liage in the trees, and small plants, rocks or fallen trees on the ground). In this work,
we pursue the design and development of a localization method which is reliable,
accurate and of low computational cost. The computation is aimed to occur in real
time on-board of harvesters and forwarders in the forest with canopy cover. The fi-
nal hardware to be utilized in a real application would have similar capabilities to
the hardware we have used in the experiments reported in this work. Relatively ran-
domly located but ubiquitous tree stems are a natural basis for localization efforts.
We also assume that a prior map of the forest in the form of a pointcloud is available
before the mission starts.
In this chapter, we proposed a lightweight and graph-based localization method
for pose estimation within a global map of a harvester in dense and unstructured
forest environments. The proposed method is lightweight and matches a triangular-
ization of a single LiDAR scan with a subgraph of the triangularization of the global
forest map. Compared to matching raw LiDAR data point-by-point, we reduce the
32
Accurate Localization in Large-scale Forest Environments
problem to a triangularization created based on the positions of tree stems. In our
experiment setting where a 16-channel 3D LiDAR with a field of view of about
210°and situated at a height of 1.5 m., individual trees get an average of 100 laser
hits. The proposed triangularization process is able to reduce the number of points
to be matched to approximately 1% of the size of a single LiDAR scan, significantly
improving computation efficiency.To the best of our knowledge, this is the first work
to utilize Delaunay triangulation for the purpose of localizing a vehicle in a forest
environment. Moreover, we provide a fully experimental approach with a realistic
use case in forest harvesting.
3.1 Autonomous Driving Meet Forest
3.1.1 Autonomous Vehicles in Forest
Over the last ten years, autonomous harvesting and transportation have become the
long-term goal of the future development of the forest industry and attracted the in-
terest of the research community. Naturally, there are several intermediate goals such
as the organization of the public data [66], and the gradual increase of autonomy in
varying degrees, starting with short-range transport in forests [67]. Incremental ad-
vances in forest autonomy include driver assistance platforms and function-specific
automation. For example, these include the automated selection of tree stems to
be processed, micro-tasks such as sequencing the processing of individual trees, lo-
cal route planning, or semi-autonomous transportation and quality assurance. In all
these tasks, one key element is the availability of tree maps, together with methods
enabling the identification and selection of individual trees. In addition, local route
planning requires accurate updates on the position of the harvester inside the forest in
real-time, which can not be relied on global navigation satellite system (GNSS) sen-
sors [68; 69]. An autonomous path tracking based on GNSS only was demonstrated
in [70] using an industrial vehicle in a scene with clear-cut forest. To actually operate
autonomously, and to locate and collect logs, more complex environments need to be
analyzed. A summary of autonomous robots in cross-country terrain is [71], which
includes also military applications. There seems to be not many attempts to have
autonomous forwarders under full canopy.
There are multiple well-established frameworks and algorithms for autonomous
driving in urban environments [72], as well as localization and mapping in roads
and buildings [73]. Many of the solutions proposed in these areas have a strong
dependency on LiDAR scanners [74], among other sensors. In the field of forest
mapping and navigation, several researchers have utilized terrestrial laser scan (TLS)
to build point-cloud maps [75; 76; 77; 78; 79]. In some of these works [75; 77], data
are collected from fixed-positioned tripods to gain an understanding of how well
individual trees can be detected and their diameter at the breast height (DBH) can
33
Qingqing Li
be registered. Other studies rely on mobile laser scanning (MLS) [76; 78; 79] to
simulate the operation of the harvester.
Navigating in a forest presents several inherent challenges owing to the complex-
ity and lack of structure in the environment [80]. A realistic forest harvesting process
involves constant obstructions from cut or fell branches and trunks and irregular mi-
crorelief and terrain. Furthermore, the movement of a harvester tends to be rotational
most of the time, with constant changes in orientation along the work cycle and very
slow translational motion or even with no movement at all while trees are being
cut. These particularities of forest navigation bring both advantages and disadvan-
tages to standard development of autonomous mobile robots: idle periods and slow
motion aid at realizing real-time operation and data processing, while fast rotations
hinder accurate matching of scan data (odometry) owing to a large variability in the
distance between near and far objects, and multiple similar objects in different direc-
tions. A small error in the orientation estimation can significantly affect the mapping
of trees that are farther away [76; 81].
This work focuses on the real-time localization of an autonomous forwarder unit,
a forestry vehicle that collects felled logs and hauls them to a local loading area.
The required transport distance is usually short, in the range of 100–400 m, and hap-
pens along a rough-terrain track a forest harvester has previously made while per-
forming logging. To autonomously navigate in the forest, visual sensors present
significant challenges owing to the lack of a stable background from which con-
tours could be extracted, as well as demanding low-light and harsh weather condi-
tions [82]. Forest environments present difficult light conditions during winter and at
night, therefore we considered only the LiDAR technology. Nonetheless, it is worth
noting that visual-inertial odometry [41], and other methods for estimating structure
from motion [83], have seen significant advances over the past few years. For in-
stance, in [83] the authors report improvements in day-night recognition stability.
This work required long range up to 60 m for location and wide field of view, and we
have not tested any approach based on visual sensors.
3.1.2 LiDAR-Based Localization
Autonomous mobile robots meant to operate outdoors often rely on GNSS sensors
as the basic source of global localization data, and then integrate other sensor data
through sensor fusion techniques for local position estimation [74]. GNSS sensors
alone do not provide enough accuracy in urban or dense environments, with accuracy
often in the order of meters [84]. While GNSS sensors have increased their accuracy
in recent years, multiple challenges remain in environments with structures affecting
the signal path. In particular, GNSS signals are weak in forests, owing to the irregular
land contours and the coverage that tree foliage provides [85]. Therefore, we focus
on the design and development of localization methods that rely solely on LiDAR
34
Accurate Localization in Large-scale Forest Environments
data, with GNSS providing only the initial position before entering the forest or
starting the autonomous operation. The following are the main approaches enabling
the estimation of the movement of a mobile robot based on LiDAR scans by either
comparing consecutive scans (hereinafter also called point clouds) or a given scan
with a global point cloud. We refer to comparing scans as scan matching:
Full scan matching. In SLAM algorithms, a necessary step previous to update
the map is to estimate the relative movement of the robot with respect to its last
known position. This is equivalent to finding a geometrical transformation between
the corresponding point clouds. One of the most widely utilized methods for calculat-
ing such transformation is the iterative closes point (ICP) algorithm [86; 87]. In ICP,
pairs of points between the two point clouds being compared are iteratively selected
until a transformation with acceptable pairing error distance is found. In this case the
acceptable error is chosen to be 0.6 m, which is 30% of the mean nearest neighbor
tree distance. The mean nearest neighbor distance is a dynamical entity (2.0 m in this
case), which can be observed from the global map in each case.
Several variants of the ICP algorithm are available through the open-source point
cloud library (PCL) [88]. Other popular methods include the perfect match (PM) [89],
or normal distribution transforms (NDT) [90] algorithms. Most of these algorithms
can be extended and utilized for map-based localization, where a global point cloud
is available and a given LiDAR scan is matched with only a subset of the global
map. However, in these cases, an estimation of the previous position is needed, since
otherwise a standard ICP routine could enter a global search phase, which is compu-
tationally much more costly due to convergence safeguards, etc. See e.g., [91] about
the intricacies of the guaranteed global convergence. Since our objective is to enable
global localization without complete dependence on previous states, we develop a
novel method that takes into account the geometry of the environment instead of
using the full point cloud.
Feature-based matching. Generic point cloud matching algorithms do not take
into account the structure of any potential features within the scans that are be-
ing compared. When the environment where robots operate has known structures,
feature-based scan matching can significantly enhance the accuracy of the matching
process [92; 93; 94]. Moreover, owing to the preprocessing step in which raw data is
transformed into features, the computational time required to calculate the transfor-
mation can be reduced. In this direction, Zhang et al. presented the LiDAR odometry
and mapping (LOAM) method [13], which assumes a structured environment where
planes and edges can be detected. Then, the transformation between two point clouds
can be estimated by aligning the planar and edge feature points from each of them.
Multiple algorithms have extended the original LOAM method for other types of
features. Among them, Shan et al. presented the ground-optimized Lego-Loam [14],
which delivers optimized real-time three-dimensional point cloud matching in small
scale embedded devices. While feature-based matching algorithms yield accurate re-
35
Qingqing Li
sults, their applicability is limited in forests, with most methods focusing on urban or
indoor environments. In forests or other unstructured environments, edge and planar
features are either noisy or missing. In this work, we still rely on the state-of-the-art
Lego-Loam algorithm for building a map of the forest prior to the operation of the
robots. This process is done offline. However, we develop an alternative method for
localization because of our needs for real-time operation and global localization.
Geometrical matching methods. Geometry-based localization algorithms have
the potential to recognize the position in constant time, which is an attractive property
towards real-time localization with embedded processors. Geometrical methods are
often applied to graph-like structures to find geometric transformations that match
either the complete graph or a set of subgraphs. Thrun et al. developed the notions
of sparse extended information filters (SEIFs) which exploit the structure inherent
through the local web-like networks of features maps [95; 96]. In a similar direction,
Ulrich et al. proposed a global topological representation of places with object graphs
serving as local maps for place classification and place recognition [97; 98; 99].
Among the most relevant approaches to our work are place-recognition algorithms,
such as Bosse’s work on a keypoints similarities voting method in 2D and 3D LiDAR
data [100; 101; 102].
3.2 Delaunay Triangulation Based Localization
In this section we introduced the proposed localization algorithm. Figure 13 shows
an overview of our proposed method. The complete system takes as a sole input
the point cloud data from the 3D LiDAR, and outputs the position and orientation
estimation in the reference of the given global map. The system can be divided
into six steps as illustrated in Figure 14. The former two steps are done offline by
processing the point cloud data defining the global forest map. These steps have
a computational complexity that grows linearly with the number of LiDAR points
(first step) and with the number of trees in the map (second step). The latter four
steps are then done online on-board the harvester to estimate its position in real time.
The steps are:
1. Point-cloud trunk segmentation from the global map (offline).
2. Delaunay triangulation of the global map from the segmented trunk points
(offline).
3. Aggregation of 3D LiDAR scans into a local point cloud defining the robot’s
position (online).
4. Segmentation of trunks from the local point cloud (online).
5. Delaunay triangulation from the local segmented trunk points (online).
36
Accurate Localization in Large-scale Forest Environments
Real-time	Lidar
Point	Cloud Lidar	Mapping
Point	Cloud
Forest	Map Trunk	Extraction
Delaunay
Triangulation
Location
Estimation
Trunk	Extraction DelaunayTriangulation
(ONLINE)
(OFFLINE)
Figure 13. System overview: sensor data is matched with a global map to produce a global
position estimation. Consecutive runs are independent.
6. Estimation of a geometrical transformation that matches the local Delaunay
triangulation with a subset.
A 2D Delaunay triangulation (DT) process [103] takes in a set of planar points
𝑉 ⊂ R2 and produces a triplet 𝐺 = (𝑉,𝐸, 𝑇 ), where 𝐸 ⊂ 𝑉 2 is a set of edges and
𝑇 ⊂ 𝑉 3 is a set of triangles with the so called Delaunay property i.e., the circum-
scribed triangle associated with each triangle 𝑡 ∈ 𝑇 contains no points 𝑣 ∈ 𝑉 others
than the three vertex points of the triangle 𝑡.
The first step in the proposed method takes a global point cloud 𝑃𝐶𝑚𝑎𝑝 =
{𝑃𝑖}𝑖=1...𝑛𝑚 ⊂ R3, representing the map of the forest, and produces a robust set of
trunk positions 𝑃𝐶𝑡𝑟𝑢𝑛𝑘𝑠 = {𝑃𝑖}𝑖=1...𝑛𝑡 ⊂ R2. The accuracy of these positions de-
pends on the accuracy of the LiDAR being utilized and the size of the trees, as they
are calculated as the Euclidean average of all LiDAR points defining a tree trunk.
Note that 𝑃𝐶𝑡𝑟𝑢𝑛𝑘𝑠 is not a subset of 𝑃𝐶𝑚𝑎𝑝. The second step in our method then
subjects the horizontal plane projection 𝑃𝐶𝑚𝑎𝑝 to the Delaunay triangularization
graph 𝐺𝑚𝑎𝑝 = (𝑉𝑚𝑎𝑝, 𝐸𝑚𝑎𝑝, 𝑇𝑚𝑎𝑝) for matching. As we mentioned earlier, both of
these steps happen offline before the robot is deployed, or whenever the robot gets a
global map update. A global map update only happens when the robot changes to a
new location, or enters an area from which it previously had no information.
The third, fourth and fifth steps, which happen online, cover the generation of
a local, real-time Delaunay triangulation of the trees within the field of view of the
robot. First, we accumulate and aggregate several raw point cloud scans from the 3D
LiDAR and generate a local point cloud 𝑃𝐶𝑙𝑜𝑐𝑎𝑙 ⊂ R3. The aggregation relies on
real-time LiDAR odometry from LOAM [13]. Then, following the same procedure
as with the global map, we generate a robust two-dimensional set of trunk positions
𝐿𝑡𝑟𝑢𝑛𝑘𝑠 ⊂ R2. From the set of trunks, we can define the local DT graph 𝐺𝑙𝑜𝑐𝑎𝑙 =
(𝑉𝑙𝑜𝑐𝑎𝑙, 𝐸𝑙𝑜𝑐𝑎𝑙, 𝑇𝑙𝑜𝑐𝑎𝑙).
Finally, in the sixth step of the process we calculate a rigid body transformation
(also called Euclidean transformation or Euclidean isometry), defined by a rotation
𝜃 and a translation 𝑝𝑡, between the local DT graph 𝐺𝑙𝑜𝑐𝑎𝑙 and a subset of the global
DT graph 𝐺𝑚𝑎𝑡𝑐ℎ ⊂ 𝐺𝑚𝑎𝑝. The transformation relates directly to the robot posi-
tion and orientation in the global map. Instead of matching large quantities of the
37
Qingqing Li
3D point cloud, the proposed system seeks to match a triangularized representation
of the local data against a similar precomputed triangularized global representation.
The resulting process is computationally efficient and with a low memory demand.
The approach allows noise in the trunk detection in that the set of triangles that
we consider for matching are selected to be best matches (different sets at each loca-
tion), and the final transformation is averaged by this set.
In the rest of this section, we further describe the process outlined above and
delve into the details of the most critical steps. In general, the key idea is building a
unique 2D graph topology from the 3D point cloud map, and finding the best match
between local and global topology.
3.2.1 Trunk Point Cloud Segmentation
With a given map 𝑃𝐶𝑚𝑎𝑝 of the forest, an essential step in the proposed method is
to extract the trunks points that define 𝑃𝐶𝑡𝑟𝑢𝑛𝑘𝑠 as a set of landmarks from 𝑃𝐶𝑚𝑎𝑝.
Compared to the other environmental features such as branch structures, forest floor
vegetation and the bushes in the forest, tree trunks are a natural choice as geomet-
ric features.
To segment the trunk points, the first step is extracting the trunk point cloud
𝑃𝐶𝑡𝑟𝑢𝑛𝑘𝑠 from the map point cloud 𝑃𝐶𝑚𝑎𝑝. We employ the Kd-Tree space partition-
ing structure to accelerate the neighborhood search [104], together with a Euclidean
Cluster to find the trunk point cloud [105]. Instead of focusing on finding every trunk
in the 𝑃𝐶𝑚𝑎𝑝, we focus on extracting the most significant ones. We define the most
significant trunks in this work as those that show observational stability, i.e., they
can be assumed to be easy to observe by the robot in the near future locations. Thus,
we do not consider small trunks or bushes, which are not a worthwhile computa-
tional investment to locate and give them a landmark status. As a typical trunk is
an approximately vertical object, we assume that the point from a stable trunk has
a neighboring point located 2𝑚 above it. This naturally limits the detected trunks,
but allows the rest of the algorithm to proceed, since it is enough to get a number of
matches from the most prominent objects for the further triangle matches to succeed.
Another point is that the trunk detection scheme can be altered in a modular fashion
in the future.
As outlined in Algorithm 2, our objective extract all the tree points which are
above the ground from 𝑃𝐶𝑚𝑎𝑝. We traverse all points in the 𝑃𝐶𝑚𝑎𝑝, and for each
point 𝑝𝑖 = (𝑥𝑖, 𝑦𝑖, 𝑧𝑖) ∈ 𝑃𝐶𝑚𝑎𝑝 we find the nearest neighbor of a hypothetical point
𝑝′𝑖 = 𝑝𝑖 + 𝑒3𝑡ℎ, where 𝑒3 represents the vertical unit vector and 𝑡ℎ = 2m above
the point 𝑝𝑖. For all such points 𝑝𝑖 which do have other points above them, we set
the z-axis value to zero and add them to the set 𝑃𝑡. This latter set thus contains
the projections of all the points belong to the tree into two dimensions. Following
this, we remove the branch points in 𝑃𝑡. To do this, we traverse all points in 𝑃𝑡,
38
Accurate Localization in Large-scale Forest Environments
1)	Pre-processing	:	point	cloud	global	trunk	segmentation 2)	Pre-processing	:	global	DT	graph	generation
3)	Local	submap	building 5)	Local	DT	graph	generation
6)	Geometrical	transformation	estimation
Match	the	similar	corresponding
Triangles	vertices,	estimate	the
transltion	and	orientation	between
local	and	global
P2
P1
P3
P4
P6
S1
S0
S2
S3
P5
4)	Local		trunk	segmentation
Search	the	similar	Triangls
on	the	local	DT	graph	and	
global	DT	graph
x/m
y/m
x/m
y/m
Local and Global DT graph Matched Result Matched DT trangles in Global map
Figure 14. Schematic representation of the proposed method to predict the transformation
between a local observation sub-map and the global map.
and find all the points 𝑝 that have at least one other neighbor point at a distance less
than 𝑡𝐻 < 5 cm. All points meeting the aforementioned condition are added to the
set 𝑃𝐶𝑡. Then, we form Euclidean clusters (formed by close points) and drop the
clusters that have a size in points smaller than a threshold value 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑.
This value has been defined manually taking into account the existing knowledge
on the type of trees in the objective forest area. If larger or smaller trees of inter-
est are present in the forest, then this must be taken into account and the value of
𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 adjusted accordingly. Finally, we compute the mean 𝑝 ∈ R2 of
the horizontal projections of the cluster points to represent a landmark trunk. These
39
Qingqing Li
landmarks then compose the 2D trunk map 𝑀𝑡𝑟𝑢𝑛𝑘𝑠. Figure 15 show the proposed
trunk point cloud segmentation result. Figure 15a shows the map to be processed,
Figure 15b shows the extracted point cloud after removing the ground point cloud
and branch point cloud from the input map. Figure 15c shows the final extracted
stable trunk point cloud after cluster processing.
(a) (b) (c)
Figure 15. Trunk point cloud segmentation approach. (a) The original point cloud map. (b)
The extracted trunk point cloud (blue and green) after removing ground and branch point
cloud (red). (c) The final extracted trunk point cloud (green) after clustering process.
3.2.2 Global Map DT Graph Generation
A DT has multiple beneficial properties for our problem setting [106], including the
fact that it maximizes the minimal angle at all the vertices of the triangulation. This
means that the noise in angular values is minimal in a noisy point set. A DT also
defines the so called natural neighborhood around a point (the point set connected to
a given point along the triangle edges), which solves the problem of setting the local
point cloud feature scope or number of neighbors in an intuitive way [103].
Our objective is to match the local trunk graph 𝐺𝑙𝑜𝑐𝑎𝑙 against a subset of the
global trunk graph map 𝐺𝑚𝑎𝑝, and the prerequisite is that there exist some sets of
geometric structures that can be uniquely identified in both graphs, i.e., geometrically
similar structures such as triangles or sets of triangles that are defined utilizing the
trunks as vertices. From the trunk landmarks 𝑀𝑡𝑟𝑢𝑛𝑘𝑠 of the 𝑃𝐶𝑚𝑎𝑝 obtained as
described in Section 3.2.1, we get the Delaunay triangulation 𝑀𝑡𝑟𝑢𝑛𝑘𝑠 → 𝐺𝑚𝑎𝑝 =
(𝑉𝑚𝑎𝑝, 𝐸𝑚𝑎𝑝, 𝑇𝑚𝑎𝑝).
3.2.3 Local Map and Point Cloud aggregation
In a dense forest environment, nearby trunks may be blocked from the LiDAR view.
This may result in not enough trunks which can be reliably observed also in the
future locations. Thus, in our case, we will employ the LOAM method, which builds
a local map from aggregating several consecutive observations. However, we will
also explore the direct construction using only one frame observation.
40
Accurate Localization in Large-scale Forest Environments
Algorithm 2: Extracting trunks from the point cloud map.
Data: A 3D point cloud map 𝑃𝐶𝑚𝑎𝑝 = {𝑝𝑖}𝑖=1..𝑛𝑚
Result: Trunk segmentation point cloud 𝑃𝐶𝑡𝑟𝑢𝑛𝑘𝑠 = {𝑃𝑗}𝑗=1...𝑁𝑡 ,
2D Trunk map 𝑀𝑡𝑟𝑢𝑛𝑘𝑠 = {𝑀𝑘}𝑘=1...𝑛𝑡
foreach 𝑝 ∈ 𝑃𝐶𝑚𝑎𝑝 do
𝑆𝑒𝑎𝑟𝑐ℎ𝑃𝑜𝑖𝑛𝑡 : 𝑝𝑠𝑝 = (𝑝.𝑥, 𝑝.𝑦, 𝑝.𝑧 + ℎ𝑡ℎ);
𝑁𝑒𝑎𝑟𝑒𝑠𝑡𝑃𝑜𝑖𝑛𝑡𝑆𝑒𝑎𝑟𝑐ℎ : 𝑝𝑠𝑛 = 𝑛𝑒𝑎𝑟𝑒𝑠𝑡𝐾𝑆𝑒𝑎𝑟𝑐ℎ(𝑃𝑚, 𝑝𝑠𝑝) ;
𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 : 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑝𝑠𝑝, 𝑝𝑠𝑛);
if 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 < 𝑑 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 then
𝑃𝑡 ← 𝑝𝑖 // Add point to trunks point cloud;
end
end
foreach 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 ∈ 𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑠 do
if 𝐶𝑙𝑢𝑠𝑡𝑒𝑟.𝑝𝑜𝑖𝑛𝑡𝑠.𝑠𝑖𝑧𝑒() < 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 then
Delete 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 from 𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑠
else
𝑝 = mean𝑝∈𝐶𝑙𝑢𝑠𝑡𝑒𝑟 𝑝 ;
𝑀𝑡𝑟𝑢𝑛𝑘𝑠 ← 𝑝 ;
end
end
Both the construction of the ground truth map and the aggregation of consecutive
LiDAR frames rely on the LOAM method [13]. It operates as follows. Let 𝑃𝑡 =
{𝑝𝑖}𝑖=1...𝑛 represent a raw scanned point cloud received at time 𝑡. All these raw
point cloud’s are processed with the LOAM algorithm to build a local map based on
one scan or the aggregation of several consecutive scans. The LOAM algorithm is
a state-of-the-art feature-based LiDAR odometry algorithm. LOAM receives the 3D
point cloud from the LiDAR, and projects the point cloud onto a range image for
feature extraction. By calculating the curvature and some features from each row of
the range image, the registration process selects subsets 𝑃𝑒 and 𝑃𝑝 (edge and plane
points, respectively). Instead of comparing all the points, LOAM utilizes only those
two subsets to find a transformation between scans, and then a two-step Levenberg–
Marquardt optimization method is employed to optimize the six-degree-of-freedom
transformation across consecutive scans. The complete LiDAR odometry problem
gets solved with a speed of 1 Hz resulting the local map 𝑃𝐶𝑙𝑜𝑐𝑎𝑙 with the computing
platform utilized in our experiments.
In this study, the local point cloud 𝑃𝐶𝑙𝑜𝑐𝑎𝑙 was generated with one or several
scans by utilizing the estimated position from the LOAM method. The raw LiDAR
output frequency is 30 Hz, but not every scan from the LiDAR stream needs to be
used to calculate the robot odometry. To reduce the accumulation error and balance
41
Qingqing Li
the computation burden, LOAM only considers those scans that the Euclidean dis-
tance between two observation positions is longer than a certain threshold distance
(e.g., 30 cm) or the angular change is larger than a certain threshold angle (e.g. 30°).
As the nearby trunks or objects may block a sector of the view, aggregating more
frames observation from consecutive positions helps to increase the number of the
stable trunks registered into the map 𝑃𝐶𝑙𝑜𝑐𝑎𝑙.
After we have obtained the local map 𝑃𝐶𝑙𝑜𝑐𝑎𝑙, the next step is to generate the DT
graph 𝐺𝑙𝑜𝑐𝑎𝑙 = {𝑉𝑙𝑜𝑐𝑎𝑙, 𝐸𝑙𝑜𝑐𝑎𝑙, 𝑇𝑙𝑜𝑐𝑎𝑙}, just as with the global map. The methodology
explained in Section 3.2.1 applies in this case as well, but this time producing a local
map 𝐿𝑙𝑜𝑐𝑎𝑙, its 2D projected subset 𝐿𝑡𝑟𝑢𝑛𝑘𝑠 and a local DT graph 𝐺𝑙𝑜𝑐𝑎𝑙.
3.2.4 Local and Global DT Graph Matching
The DT graph obtained from the global point cloud map, 𝐺𝑚𝑎𝑝 = (𝑉,𝐸, 𝑇 ), defines
a set 𝑉 of vertices, a set 𝐸 ⊆ 𝑉 2 of edges and a set 𝑇 ⊆ 𝑉 3. Each vertex represents
a trunk in the point cloud map. Edges establish a relation between a point and its
natural neighbors it is connected within the graph. For each local DT graph that we
obtain by aggregating consecutive LiDAR frames in real-time, our objective is to find
a matching subgraph for 𝐺𝑙𝑜𝑐𝑎𝑙 in the graph 𝐺𝑚𝑎𝑝. We proceed as follows in order to
obtain such subgraphs.
Triangle search based on dissimilarity. A dissimilarity 𝑑(., .) of two triangles
𝑡1 and 𝑡2 has two properties: it is always positive, 0 ≤ 𝑑(𝑡1, 𝑡2), and zero in case of
identity, 𝑑(𝑡1, 𝑡1) ≡ 0. Typical triangle dissimilarities use an intermediate vector of
two descriptors, and some sort of norm between these vectors. For instance, in [107]
the authors utilize the ratio of the lengths of the shortest and longest edges, and the
angle between those edges. We utilize the intermediate vector (𝐴, 𝑙) of a triangle 𝑡
with an area 𝐴 and 𝑙 = 𝐿2, where 𝐿 is the perimeter length. The vector components
have the dimensionality of area (in square meters). Then, the dissimilarity 𝑑(., .) is
defined as: 𝑑(𝑡1, 𝑡2) = |𝐴2 − 𝐴1|+|𝑙2 − 𝑙1|.
To speed up the real-time matching process, all triangle perimeters and areas
of the global DT graph have been computed offline prior to the real-time matching
process. The information utilized by our algorithm while operating for real-time
localization is therefore not the raw global map but instead the global DT graph
that has been precomputed. In terms of comparing triangles based on the magnitude
of the difference of (𝐴, 𝑙) vectors, the triangles composed by peripheral points in
a graph are usually different because of fewer observed points, so only triangles
𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 which not include a peripheral point in the local graph 𝐺𝑙𝑜𝑐𝑎𝑙 are selected.
Therefore, a set of candidate triangles {𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑} are selected and used to find a
matching subset in 𝐺𝑚𝑎𝑝. From {𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑}, we build the corresponding graph.
There may exist multiple closely similar triangles 𝑇𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 in 𝐺𝑚𝑎𝑝 to any spe-
cific selected triangle 𝑇 ∈ {𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑} in 𝐺𝑙𝑜𝑐𝑎𝑙. Thus, the next step is to compare the
42
Accurate Localization in Large-scale Forest Environments
P2
P1
P3
P4
P6
S1
S0
S2
S3
P5
(\textbf{b})
(a) (b)
(c)
x/m
x/m
y/m
y/m
Figure 16. Semantic trunk triangulation on local and global map. (a) The target triangle 𝑆0
with its neighbors 𝑆𝑖, 𝑖 = 1, 2, 3. (b) All the similar triangles found in the local trunk map,
(c) the corresponding triangles found in global trunk map.
neighboring triangles too. As we have excluded the peripheral triangles, the global
subset 𝐺𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 that will be the match of {𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑} should also have three neighbor
triangles. The triangle neighborhood vote is performed using the same dissimilarity
measure. As we show in Figure 16, the triangle 𝑆0 is one of the selected triangles
𝑆0 ∈ 𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 in the local graph 𝐺𝑙𝑜𝑐𝑎𝑙 and the three triangles 𝑆1, 𝑆2, 𝑆3 are its neigh-
bors. A feature vector of the triangle 𝑆0 used in the match process is combined from
the intermediary dissimilarity vectors of the triangle 𝑆0 and its neighbors (called a
triangle star) according to Equation (1).
𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠(𝑆0) = [𝐴0, 𝑙0, 𝐴1, 𝑙1, 𝐴2, 𝑙2, 𝐴3, 𝑙3] (1)
By comparing the 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠(𝑆0) of a triangle 𝑆0 to feature vectors in 𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑,
a set of matching triangles 𝑇𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 meeting the error tolerance may be found. Af-
ter a pair of similar triangles (𝑆0, 𝑆′0) ∈ 𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 × 𝑇𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 ⊂ 𝐺𝑙𝑜𝑐𝑎𝑙 × 𝐺𝑚𝑎𝑝 are
found, the next step would be to estimate the position 𝑝𝑡 and the orientation 𝜃, which
matches 𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 with 𝐺𝑚𝑎𝑝.
43
Qingqing Li
Calculation of corresponding vertex pairs. In order to describe the matching
process, we use the following notation. A triangle 𝑆0 = 𝐴𝐵𝐶 consists of vertices 𝐴,
𝐵 and 𝐶, and an edge vector ?⃗? = 𝐴𝐵 of an edge 𝑒 = 𝐴𝐵 is oriented and signified
with its end points.
To solve the transformation parameters 𝑝𝑡 and 𝜃, we first need to find the corre-
spondence between vertices of triangle stars 𝒮 = {𝑆𝑖}𝑖=0..3 ∈ 𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 and 𝒮 ′ =
{𝑆′𝑖}𝑖=0..3 ∈ 𝑇𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒. The definition of a triangle star is easily seen from Fig-
ure 17a. As the figure shows, the vertex match between 𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 and 𝑇𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 can
be divided into three steps. The first step is illustrated in detail in Figure 17b. We then
find the first matching pair 𝑆0 = 𝐴𝐵𝐶 ∈ 𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 and 𝑆′0 = 𝑀𝐻𝑁 ∈ 𝑇𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒.
Through comparing the side lengths between two triangles, an edge 𝐵𝐶 of the 𝐴𝐵𝐶
can be selected so that there is an edge |𝑁𝐻| with similar length in the 𝑀𝐻𝑁 , so
the remaining vertices 𝐴 and 𝑀 are a pair of corresponding vertices. For the second
step, the goal is to find the other corresponding vertices in 𝐴𝐵𝐶 and 𝑀𝐻𝑁 . As Fig-
ure 17c shows, selecting one edge which has the vertex 𝐴 as already known, and one
edge contains the vertex 𝑀 in the△𝑀𝐻𝑁 , then we compute which side of the edge
the remaining vertices 𝐶 and 𝐻 are located. This happens by inspecting the value
𝑎 ∈ R of the following formula:
𝑎 = (𝐶𝐴× 𝐶𝐵) · (𝐻𝑀 ×𝐻𝑁) (2)
If 𝑎 > 0, the vertices 𝐶 and 𝐻 are on the same side of the edge 𝐴𝐵 and edge
𝑀𝑁 , otherwise the two vertices are on the opposite. In the case shown in Figure 17c,
the result is 𝑎 < 0, so the vertex pairs are (𝐶,𝑀) and (𝐵,𝐻). The last step is to
match vertices in the triangle neighbors. Since we already know the correspondence
between triangles 𝐴𝐵𝐶 and 𝑀𝐻𝑁 , we just need to get which two vertices are in-
cluded in the neighboring triangles. For example, as we get the corresponding pairs
of vertices (𝐴,𝑀) and (𝐵,𝐻) from last two steps, the last vertices 𝐸 and 𝑄 are a
match, too.
The match between vertices of two triangle sets is now complete. The match
relation is one-to-one, so it can be recorded as a permutation function: 𝑓 : [1, 6] ⊂
N → [1, 6] ⊂ N. This way each vertex 𝑉𝑖 ∈ 𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 has a matching vertex 𝑉𝑓(𝑖) ∈
𝑇𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒. Next we will use the permutation 𝑓 while estimating the best possible
translation and rotation.
Rotation and translation estimation
By comparing the vertex orientations in the found two triangle sets 𝒮 = {𝑆𝑖}𝑖=0..3 ∈
𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 and 𝒮 ′ = {𝑆′𝑖}𝑖=0..3 ∈ 𝑇𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒, we can find a unifying 2D rigid body trans-
formation 𝑇 [𝑝𝑡, 𝜃] between the local sample and the global map. The transformation
𝑇 consists of a rotation of the angle 𝜃 followed up with a translation 𝑝𝑡.
44
Accurate Localization in Large-scale Forest Environments
To calculate the transformation parameter 𝑝𝑡 we utilize the definitions in Equa-
tion (3), where the translation 𝑡𝑝 is the best possible one estimate, since it equates the
mean of two patterns. The angle 𝜃 is defined by measuring how large a rotation is
needed to transform a vectors 𝑉𝑖−𝑉 to 𝑉 ′𝑓(𝑖)−𝑉
′, where the 𝑉 represents the mean
of 𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 and 𝑉 ′ is the mean of the 𝑇𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒.
There are six pairs 𝑖 = 1...6 of vertices coupling 𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑, 𝑇𝑙𝑜𝑐𝑎𝑙 and 𝑇𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒
from 𝐺𝑚𝑎𝑝, and six possible candidates 𝛽𝑖 to be chosen as the final local rotation
𝜃. We choose the value of 𝛽𝑖 which fits the triangle patterns 𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 and 𝑇𝑙𝑜𝑐𝑎𝑙 with
the least error. One important detail is arranging a signed angular measure between
two vectors. We use a rectangular rotation matrix 𝑃 = ( 0 −11 0 ) to get a positive
𝑠𝑖𝑔𝑛(𝑃𝑎 · 𝑏) for rotation angles, which move a vector 𝑎 to a vector 𝑏. We omit
possible fit minimizing values in between the angles {𝛽𝑖}𝑖=1..6 because the objective
at this stage is to only have a close estimate and not the final position estimation.
Using a zeroth power as a shorthand when producing unit vectors 𝑎0 = 𝑎/‖𝑎‖,
we define counter-clockwise oriented angles 𝛽𝑖, which rotate vectors 𝑉𝑖 parallel to
vectors 𝑉𝑓(𝑖). Finally, we define approximate values for the translation 𝑝𝑡 and the
rotation 𝜃 according to Equation (3).
⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
𝑉 = mean
𝑉 ∈𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑
𝑉
𝑉
′ = mean
𝑉 ′∈𝑇𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒
𝑉 ′
𝑝𝑡 = 𝑉
′ − 𝑉
𝛽𝑖 = 𝑠𝑖𝑔𝑛
(︁
𝑃𝑉𝑖 · 𝑉𝑓(𝑖)
)︁
arccos
(︁
𝑉 0𝑖 · 𝑉 0𝑓(𝑖)
)︁
, 𝑖 = 1..6
𝜃 = argmin
𝛽𝑖, 𝑖=1...6
∑︁
𝑗=1...6
⃦⃦⃦⃦
⃦⃦
⎡⎣cos 𝛽𝑖 − sin 𝛽𝑖
sin 𝛽𝑖 cos 𝛽𝑖
⎤⎦ [︁𝑉𝑗 − 𝑉 ]︁𝑇 − [︁𝑉𝑓(𝑗) − 𝑉 ′]︁𝑇
⃦⃦⃦⃦
⃦⃦
(3)
Geometric verification, final translation and rotation estimation
From the 𝐺𝑙𝑜𝑐𝑎𝑙 match against 𝐺𝑚𝑎𝑝, usually multiple matches {(𝒮𝑘,𝒮 ′𝑘)}𝑘=1...𝑚 be-
tween 𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 and 𝑇𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 can be found. Therefore, we need to select one among
𝑚 transformation candidates with rotation and translation parameters 𝜃𝑘 and 𝑡𝑘. This
final step is to find the best estimation between 𝐺𝑙𝑜𝑐𝑎𝑙 and 𝐺𝑚𝑎𝑝. Let the corre-
sponding points of a match candidate be (𝑉𝑙𝑜𝑐𝑎𝑙, 𝑉𝑚𝑎𝑝). Starting from the previously
computed initial guesses 𝜃 and 𝑝𝑡, we select the final solution from Equation (4).
Note that this does not ensure global convergence, and unexpected results might be
obtained. For instance, if the global map is too homogeneous, then different regions
45
Qingqing Li
A
D
E
F
(a)
B
C
P
M
N
H
P'
H
N
P'
M
B C
P
PA
B C
P
PA
H
M
N
P'
Q
R
T
A
D
E
N
H
P'
Q
R
T
M
F
B
C
P
(b) (c) (d)
Figure 17. Semantic triangle matching process between local and global maps. (a) The
selected triangle 𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 in 𝐺𝑙𝑜𝑐𝑎𝑙 with candidate matched triangle 𝑇𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 in 𝐺𝑚𝑎𝑝.
(b) Find the first corresponding point 𝐴 in 𝑇𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 with its 𝑀 in 𝑇𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 by finding the
similar featured side length. (c) Find the rest points 𝐵, 𝐶 and its corresponding points 𝐻 , 𝑁 .
(d) Find the corresponding neighbor points 𝐸, 𝐷, 𝐹 and its corresponding points 𝑄, 𝑅, 𝑇 .
might have sets of trees that yield similar local maps. The risk for this can be re-
duced by taking into account the previous locations, or accumulating larger local
maps until the matching gives a single matched subgraph from the global map with
low error. Another risk is that not enough trees might be detected, which could again
result in multiple global subgraphs matching with similar error. Finally, if the map
is too large then multiple positions could also yield similar error. This latter scenario
can be avoided by creating a set of smaller (non-disjoint) maps from the global map.
Through our experiments, nonetheless, we have been able to confirm stable position
estimation when enough consecutive LiDAR scans were aggregated.
(𝜃, 𝑡𝑥, 𝑡𝑦) = arg𝛽,𝑥𝑡,𝑦𝑡 min𝛽∈𝐵
𝑥∈𝐷𝑥
𝑦∈𝐷𝑦
∑︁
𝑗=1...𝑚
⃦⃦⃦⃦
⃦⃦⃦
⎡⎢⎣cos 𝛽 − sin 𝛽 𝑥sin 𝛽 cos 𝛽 𝑦
0 0 1
⎤⎥⎦ [︁𝑉 𝑇𝑗 − 𝑉 𝑇𝑓(𝑗)]︁
⃦⃦⃦⃦
⃦⃦⃦ , (4)
where 𝑚 is the number of local matches, 𝛽 ∈ 𝐵 = [min 𝜃𝑗𝑗=1...𝑚,max 𝜃𝑗𝑗=1...𝑚] ⊂
R goes through the scope occurring in the local rotation angles and 𝑥 goes through
a similar scope of 𝐷𝑥 occurring in translations along the x axis. 𝐷𝑦 is defined cor-
respondingly on the 𝑦 axis. Note that both Equations (3) and (4) require the vertices
to be in a homogeneous form 𝑉 → [𝑉 1]. The local start point is (𝑥0, 𝑦0) with an
orientation of 𝜃0, and the final estimation of the robot position is (𝑥0 + 𝑡𝑥, 𝑦0 + 𝑡𝑦)
with an orientation 𝜃 in the global map.
46
Accurate Localization in Large-scale Forest Environments
0 100 200 300 400 500 600 700
Frames
0
25
50
75
100
125
150
175
Tr
un
ks
 O
bs
er
ve
d
1 frame Observation
3 frames Observation
5 frames Observation
10 frames Observation
Figure 18. Number of trunks observed for different number of consecutive aggregated
frames of LiDAR data.
3.3 Experimental Evaluation
This section presents our experimental results. We analyze the performance of our
method from the point of view of the trunk segmentation as well as the DT matching
between the local graph and a subset of the global graph.
3.3.1 Study Area & Hardware
The study area covers one mature pine stand ready for second thinning, which rep-
resents a typical second thinning Finnish forest. This second thinning increased the
mean distance of trees from 4.4 m to approximately 8 m judging by a before and after
point cloud maps. The location of the study area is illustrated in Figure 20. The ter-
rain profile was mostly flat, the maximum height difference over the site was 8 m
on the covered area of approximately 200 m by 300 m. The overall trail length is
approximately 200 m covering roughly 2200 trees.
The mobile platform was a Komatsu Forest 931.1 forest harvester with a GNSS
unit and the LiDAR unit attached to the top of the cabin. The harvester has physical
dimensions (length, width, height) of 7.6 m by 2.9 m by 3.9 m, and its mass is 19,610
kg.The maximum driving speed is 8 km/h off-road and 25 km/h on road.
The on-board LiDAR is Velodyne-16, a 16 channel LiDAR with 3 cm distance
accuracy, 360°horizontal field of view, 30°vertical field of view with ±15∘ up and
down, 5–20 Hz scanning frequency and 100 m scan range. As Figure 21 shows,
the LiDAR unit is fixed on the front of the harvester with a 210°horizontal view.
The harvester has a folding frame, which means the point cloud frames scanned have
47
Qingqing Li
0 100 200 300 400 500 600 700
Frames
0
10
20
30
40
Fe
at
ur
e 
H
it
1 frame Observation
3 frames Observation
5 frames Observation
10 frames Observation
Figure 19. Number of trunks matched between 𝐺𝑙𝑜𝑐𝑎𝑙 and 𝐺𝑚𝑎𝑝 for different number of
consecutive aggregated frames of LiDAR data.
(a) (b) (c)
Figure 20. (a) The test site in Lieksa, Eastern Finland. (b) The forest canopy map from
Google maps. (c) Point cloud map of the study area.
constantly alternating horizontal orientation during the work cycle.
48
Accurate Localization in Large-scale Forest Environments
(a) (b)
Figure 21. Sensor and harvester platform. (a) Komatsu Forest 931.1 forest harvester. The
LiDAR scanner is marked by a red circle in the front of the windshield of the cabin. (b) A
close-up of the LiDAR scanner.
The GNSS data was recorded by Spectra SP60 GNSS unit. This unit fully uti-
lized all six GNSS systems: GPS, GLONASS, BeiDou, Galileo, QZSS and SBAS.
In SBAS (WAAS/EGNOS/MSAS/GAGAN) mode, the horizontal position error smaller
than 50 cm, and the vertical error is smaller than 85 cm. In differential GNSS mode,
the accuracy is able to reach 25 cm in horizontal and 50 cm in vertical accuracy.
In Real-Time Kinematic position (RTK) mode, the accuracy can reach 8 mm in hori-
zontal and 15 mm in vertical accuracy. Nonetheless, these values are not achievable
under dense foliage in a forest environment. Regarding the computing platform,
the proposed methods have been tested on an Intel Core i7-9700K CPU (8 cores,
up to 3.60 GHz), and 16 GB of RAM. The GPU was not utilized for the implemen-
tation of any of the algorithms involved in the localization process introduced in
this work.
This study concerns the development of a localization algorithm for a forwarder
unit to use a tree map during autonomous navigation under potentially heavy canopy.
For that purpose, the forwarder scans were simulated from the existing LiDAR pro-
duced by a forest harvester. The harvester fells trees and a sector of the view is
constantly obstructed by a tree being processed. Effects of this disappear when the
initial tree map is being constructed using normal SLAM procedures. This study
concerns the utility of the local scans done by a forwarder to be used in comparison
to the existing map.
3.3.2 Trunk Segmentation Performance
Figure 18 shows the number of trunks that can be observed from a single LiDAR
scan, and when aggregating three, five or ten consecutive frames. Figrue 19 shows
49
Qingqing Li
Table 6. Number of trunks detected and success match for different number of consecutive
aggregated frames (average number of trunks matched per scan and total ratio of successful
matches).
#Frames #Trunks Average #Matching Triangles Success Rate
1 54.09 1.71 22.89%
3 122.98 4.97 88.79%
5 139.54 10.22 99.28%
10 133.96 14.37 100%
the number of features matched between 𝐺𝑙𝑜𝑐𝑎𝑙 and 𝐺𝑚𝑎𝑝 for different number of
consecutive aggregated frames. In Table 6 we show the average number of trunks
observed, as well as the average number of trunks that can be matched with the global
map after generating the DT graph. We also include the overall matching success
rate, which is defined as the number of local DT graphs that have been successfully
matched with the corresponding subsets of the global DT graph. From individual
LiDAR frames, we were only able to obtain an average of 54.09 valid trunks in total.
When aggregating consecutive scans, we were able to obtain approximately a twofold
increase in the number of detected trunks, and a tenfold increase in the number of
matched trunks. The number of aggregated frames thus has a significant quantitative
impact on the performance of our algorithm with a difference of up to an order of
magnitude in the number of matched trunks. As the number of trunks and their
positions determines the appearance of the local DT graph, this will directly affect the
number of features detected by our descriptor. The number of trunks observed is the
key factor influencing the overall results of our algorithm. From Table 6 we can see
the relationship between the DT matching success and the number of trunks. With a
single frame observation, only an average 1.71 trunks are successfully matched; with
three aggregated frames, there are in average 4.97 trunks matched; with 5 and 10
aggregated frames, there are more than 10 trunks in average that our algorithm can
match with the global map, taking the overall DT match success rate to 100% in the
latter case.
3.3.3 Computational Time and Accuracy
The key matching process is based on the calculation and matching of the Delaunay
triangulations, which is in turn based on the aggregation of 1, 3, 5 or 10 LiDAR
frames. These four possibilities with a varying number of aggregated frames are la-
beled as C1, C3, C5 and C10 in Figure 22. The aggregation of frames adds complex-
ity and requires computing time both in the local map creation phase (black line) and
the similarity matching time (blue line). We can see that the rotational location error
significantly improves when aggregating more frames. The computing time of the
match process grows approximately in terms of 𝑂(Δ−2𝑟 ), where Δ𝑟 is the angular ac-
50
Accurate Localization in Large-scale Forest Environments
curacy achieved by different choices C1, C3, C5, C10. In summary, the computation
time is about 0.1 seconds for a single frame (C1) and 0.6 seconds when aggregating
10 frames (C10).
To test our pose estimation accuracy, we used the original LOAM position as
the ground truth. Figure 22 depicts only the rotational error, since the translation
error was approx. 0.2 m with a standard deviation of 0.14 m in all cases. It is worth
noting that the rotational error is not cumulative, since the match between the local
and global DT graphs is being done separately and independently at each location.
The previous location is only used as the initial state. The maximum occurred trans-
lation error was approx. 0.5 m and the maximum rotation error 3.23°, both for the
C1 scenario.
We can also see that the match time is relatively stable in the C1-C5 scenar-
ios. The local map creation time (aggregation of frames) stabilizes as the number
of frames increases, owing to the stabilization in the number of trees that can be
observed. The segmentation algorithm is implemented in C++, while the matching
process is implemented in Python.
0.2 0.25 0.3 0.35 0.4 0.45
r
 (o)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Co
m
pu
tin
g 
tim
e 
(se
c)
Accuracy vs. computing speed
local map creation
match time
C1
C1
C3
C3
C5
C5
C10
C10
Figure 22. Dependency of the rotational accuracy and computation time on the choice of
the number of aggregated frames. Both the local map creation time (black) and the
similarity matching time (blue) are depicted. Number of frames ranges from 1 to 10 (C1, C3,
C5, C10). Ellipses represent the 2 s.t.d. of the measurements. Δ𝑟 is the angular accuracy
achieved by different choices C1, C3, C5, C10.
51
Qingqing Li
3.4 Discussion
An application topic of the proposed navigation system is an automated transport
robot, which moves logs from the forest to the collection road. Since work happens in
two phases, where a forest harvester produces stripping roads and fells trees, and the
collection phase, a robot can assume to have the global map built by the harvester,
which is not necessarily autonomous. This way its task is just to orientate itself
along the strip road. A challenge arises from the high utility load of the transfer
robot, which can be up to 8000 kg. A transfer robot may not have an as big capacity,
but the ability to locate itself, and to detect possible stability risks is a requirement.
In addition, a transfer robot moves faster than the harvester going through its
work cycle. A good estimate in rough terrain is 1.0 m/s. Thus, a fast and accurate
odometric computation is essential.
3.4.1 Topology Mapping
In this work, the forest trunk topology map is generated from the previous LiDAR
data by the harvester. There are also two alternative ways to generate the global map:
a large-scale (e.g., nationwide) digital forest inventory, or an aerial laser scan (ALS)
from a drone working as an autonomous team member of the harvester and the robot
carrier. Compared to the point cloud map, the topology map has the potential to do
fast localization and dynamic update without much computation burden and makes
a three actor teams (harvester-drone-robot) an interesting target of further study.
Even the data utilized in our experiments were gathered in a relatively flat forest
stand (approximately 8 m height difference across the site), the proposed approach
does not make any assumptions regarding the inclination of the terrain, but instead
we assume that trees grow relatively upright. This is because the tree stem detection
is aware of the vertical orientation independent of the dominant terrain orientation.
There are only few relatively simple modifications to the presented algorithm, where
the triangle matches and triangle stars would be defined in actual 3D. The actual ef-
fect of topographic features has to do with the reduced scope of the LiDAR scanning.
This could be alleviated by an aided UAV unit, but our first approach is studying the
usability of the tree maps in relatively benign conditions.
3.4.2 Potential Accuracy Improvements
As illustrated in Figure 13, each location operation is independent, except naturally
each step will be initialized by the previous location to gain advantage in the corre-
sponding search operation. In addition, one can reduce the number of localizations
e.g., by assuming an interpolated movement in between sampled positions. Since
this is a registration task, there is no actual error accumulation per se. The location
52
Accurate Localization in Large-scale Forest Environments
accuracy and computation time are tightly related as seen in Figure 22.
From the experiments, we can see that the number of trunks detected in each
observation significantly influence the result of the success matching ratio. In our
method, due to the LiDAR sensor characteristic, far trunks are unable to be recog-
nized because of too few LiDAR points sampled from the appearance. Therefore, we
can employ other sensors producing photogrammetric point clouds which can obtain
more detail of the trunk appearance, which lets the system get more capability to
recognize the trunks from PC.
Another significant source of error can be found in the trunk position estima-
tion process. In our method, the 𝐺𝑚𝑎𝑝 and 𝐺𝑙𝑜𝑐𝑎𝑙 are generated from the segmented
trunks points in 𝑃𝐶𝑚𝑎𝑝 and 𝑃𝐶𝑙𝑜𝑐𝑎𝑙 by calculating the average coordinate of points.
However, it is impossible to get each individual trunk whole appearance observation
with several consecutive LiDAR observations. As the observation in different posi-
tions will get different point cloud data of different sides of the trunk, the estimation
pose of the trunk will be different in each local graph. However, the global graph has
a more accurate trunks position as more details of each trunk collect from different
sensors and different views, so the more accurate trunk position estimation in the
local map can increase the chance of successful matching between local graphs and
global graph. To reach a more robust system, we also can utilize other sensors like a
camera through sensor fusion to get more details about each individual trunk.
3.4.3 Potential Computational Improvements
We proposed an efficient, robust framework to locate the robot harvester in large
area forest and we tested it on a real harvester. In our case, the GNSS/GPS info is
not taken into account. The location accuracy is approx. 0.3 m from consequential
field measurements and approx. 2 m when comparing to general odometric result.
This accuracy is enough to give a robot an initial position, and this can accelerate the
matching process by conveniently initializing the search with a close match. For a
real application, the GNSS also can help decide which global point cloud map it will
access from the cloud server for the localization.
The local Delaunay triangularization generating the local trunk map 𝑇𝑙𝑜𝑐𝑎𝑙 can be
implemented in a radius-limited way by using the S-Hull method [108], which limits
the size 𝑛 of the computational task with the well-established complexity 𝑂(𝑛 log 𝑛).
A selection of an active subset of triangles can be made in the global map 𝑇𝑙𝑜𝑐𝑎𝑙 based
on the previous localization result. Together these improvements have the potential
to make the matching process much faster.
There is a possibility to speed the convergence of the search of the final transfor-
mation in Equation (4) by using a branch-and-bound algorithm [109] with properly
set estimates for local extreme. This possibility is left for future research.
All in all, the C1-C5 cases are applicable to the intended situation where the
53
Qingqing Li
robot moves at approx. 1 m/s and has a sampling rate 2 scans/sec. The case C10 is
within the reach after implementing the above-mentioned improvements.
3.5 Summary and conclusions
In this chapter, we proposed a simple yet effective segmentation-based approach
to detect trunk position and Delaunay triangulation (DT) graph-based localization
method for autonomous robots navigating in a forest environment. The proposed
methods can provide accurate positioning based only on real-time LiDAR data pro-
cessing in the unstructured and relatively complex environments that forests repre-
sent. The proposed method can be utilized for harvesters or other autonomous robots
enabling fast global localization and recognizing individual trunks in real-time.
The experiments show the proposed method reach accurate global localization
precision without a good initial pose or GNSS signal. The proposed method is simple
and efficient, and it is a sensible solution to meet localization needs of harvesting
operations in the forest. In future work, we plan to explore the forest localization
algorithm in the context of significantly larger forests and to apply the proposed
method at a system level for map updating or within the SLAM stack. Gathering
more data, we will also be able to further analyze the performance of our algorithm
when the sensors have different points of view, or when the global map is gathered in
different conditions than the real-time LiDAR data of the forwarder, such as different
vehicles or sensor settings.
54
4 Multi-Sensor Fusion for Accurate
Localization in Urban Environment
In dense urban environments, accurate localization is a paramount aspect of a robot’s
autonomous operation [110; 44]. However, smaller pathways restrict robot move-
ment, while dynamic environments require continuous adaptation to unpredictable
changes, posing significant technical challenges for autonomous navigation and map-
ping [45]. In addition, the robot mission often adds accuracy requirements, such as
in autonomous post delivery [111]. Camera and LiDAR-based perception systems
are extensively used in autonomous navigation and operation in urban environments.
While camera data captures visual details such as colors, textures, and shapes, offer-
ing more semantic and qualitative information [112; 113], LiDAR measurements are
more accurate and provide precise geometric descriptions of objects, including their
shape, size, and distance from the sensor [114].
The objective of this chapter is to analyze and compare different approaches for
vehicle localization estimation while focusing on developing a sensor fusion tech-
nique for precise localization in dense urban environment. Additionally, this research
aims to propose a strategy for reconstructing a section of a local map in scenarios
where data corruption or substantial environmental modifications have occurred.
4.1 SLAM and Autonomous Driving Vehicle
The past decade has seen a boost in the development of autonomous vehicles for
civilian use. Google started the development of its self-driving technology for cars
in 2009 [115], and since then a myriad of industry leaders [116; 117], start-ups [118],
and academic researchers [119] have joined the race in the technology sector, a race
to make everything autonomous. In any mobile robot or vehicle, SLAM algorithms
are an essential and crucial aspect of autonomous navigation [120; 44].
Autonomous robots or self-driving cars will potentially disrupt the logistics in-
dustry worldwide [121]. Autonomous trucks or autonomous cargo vessels are al-
ready in advanced stages of development and might be seen in operation within the
next five or ten years [122]. However, both technological and legal challenges remain
within the so called ”last-mile” delivery [123]. Last-mile refers to the last step in the
delivery of goods from a local logistics or supply center to the clients’ door. In this
chapter, we utilize data gathered using a small delivery robot from Jingdong, one of
55
Qingqing Li
the top two e-commerce platform and B2C online retailer in China.
The development of simultaneous localization and mapping (SLAM) algorithms
has seen a rapid evolution over the past two decades [13; 124]. In SLAM algo-
rithms, information from a wide range of sensors can be used to map the environment
and localize the vehicle or robot in real time. These include inertial measurement
units, monocular or binocular cameras, GNSS sensors, LiDARs, ultrasonic sensors
or radars, or wheel encoders [125]. Detailed 3D maps in the form of point clouds can
be generated, for instance, from 3D LiDARs or with stereo vision [114]. We focus
on the study and comparison of different localization methods for a small delivery
robot in dense urban environments. In these scenarios, an existing map of the oper-
ating environment is obtained in advance and either pre-loaded or accessible by the
autonomous robot. The map is used in order to obtain more accurate localization by
matching each scan with a certain area of the map in real-time [126; 44].
The local motion of a robot or vehicle can be estimated directly by integrat-
ing data from inertial measurement units, including accelerometers, gyroscopes and
compasses. Alternatively, different odometry methods can be applied based on non-
inertial sensors. Visual odometry algorithms utilize feature extraction and tracking,
while LiDAR-based odometry uses mostly geometric information [13; 124]. Iner-
tial measurement units can be easily combined with wheel encoders. Differential
GNSS measurements also provide accurate local motion estimation [127]. Global
localization can be estimated either with GNSS data, or by comparing sensor data
with predefined maps or information gathered a priori. For instance, different meth-
ods exist to match a LiDAR scan with section of a 3D point cloud that defines a
map of the operational area [126; 44]. Over the past decades, researchers from both
industry and academia have been exploring the utilization of these methods and their
combinations to obtain accurate mapping and localization. More concretely, scholars
often refer to the combination of different sensor data as sensor fusion or data fusion.
In this chapter, we compare different techniques and provide arguments on the best
sensor fusion techniques for a small delivery robot for last-mile delivery.
The algorithms, analysis and results presented in this chapter were mostly devel-
oped during the JD Digital (JDD) Globalization Challenge Competition in ”Multi-
sensor fusion localization for autonomous driving”. The challenge was a global com-
petition, with 4 classification divisions depending on the geographical location of
the team. Our team ranked first in the US division semi-final and qualified for the
global final in Beijing, China, where the four semi-final winners competed for the
first prize. The available sensor data during the competition was GNSS and gyro-
scope data, wheel odometry and the output from a 16-channel 3D LiDAR. A map
of the area was given as a 3D point cloud. Multiple datasets exist to test and bench-
mark different localization algorithms. However, the most accurate algorithms are
obtained through fine tuning and parameters specific for the dataset, with different
parameters being potentially necessary to achieve the optimal accuracy in a different
56
Multi-Sensor Fusion for Accurate Localization in Urban Environment
Figure 23. Illustration of the matching process between the pre-acquired map (red) and
current LiDAR scan (green-blue).
sensor or environment setup [128]. Therefore, in this chapter we have utilized the
data provided as part of the JDD competition as it was gathered from the sensors
on-board the vehicle in order to compare a variety of methods. This ensures that our
algorithms can be implemented on the same robot without a significant impact to
performance.
4.2 JDD Dataset and Challenge
In this section, we describe the dataset that we have utilized and the different local-
ization approaches. We also introduce a strategy for situations where the existing
map data is corrupted, outdated, or part of the data is missing.
The data utilized in this chapter was provided as part of the JD Discovery Global
Digitalization Challenge from December 2018 to January 2019. The data was gath-
ered using JD’s autonomous last-mile delivery robot. The data includes: (1) GNSS
directional and positional data referenced in the World Geodetic System (WGS84)
format; (2) LiDAR data as a 3D point cloud; (3) raw accelerometer and gyroscope
data; and (4) wheel speed meter. Ground truth data is provided as well. The output
of the 3D LiDAR is given at 10 Hz, IMU data is acquired at 100 Hz and GNSS data
is updated at 5 Hz. In addition, a map of the objective operation area is given. The
map represented as a point cloud is shown in red in Fig. 23. The dataset contains
sensor data recorded in an 800-second long closed loop movement.
57
Qingqing Li
Figure 24. Corrupted map reconstruction: On the top, a map with noise added in some
areas to simulate corrupted data. In the bottom, one of the two corrupted sections of the map
has been restored using GNSS, IMU, and LiDAR odometry.
In order to both read and process data, ROS has been utilized. ROS (Robot
Operating system) is an open source operating system for robots, which provides
a publish-subscribe communication framework that allows for rapid development of
distributed robotic systems [129]. ROS provides algorithm reuse support for robot re-
search and development, as well as abstraction of data models for easier integration
of different modules. PCL (Point Cloud Library) is a cross-platform open source
C++ library, which implements common algorithms and data structures of point
58
Multi-Sensor Fusion for Accurate Localization in Urban Environment
clouds [130]. It can realize point cloud acquisition, filtering, segmentation, regis-
tration, retrieval, feature extraction, recognition, tracking, surface reconstruction, vi-
sualization and so on. If OpenCV is the crystallization of 2D visual data acquisition
and processing, PCL has the same position in the 3D geometrical data domain.
4.3 Localization Methods
Based on the available sensor data, we have utilized five different approaches to
estimate the vehicle’s localization. In each approach, we use a different combination
of sensors and describe how the robot position is calculated based on their data.
4.3.1 GNSS-based localization
One of the most traditional methods for outdoor robot localization is to use a global
navigation satellite system. In this case, data from multiple satellite constellations
was available and used for increase accuracy. GNSS data error are mostly caused
by the atmospheric conditions and multi-path interference. The effect of the envi-
ronment in a larger scale and the atmospheric conditions can be minimized using
differential GNSS readings, and assuming that the real-time error is equivalent to
the error obtained in a near known location with which the system is synchronized.
However, in this work we have not relied on differential GNSS.
4.3.2 GNSS+IMU Localization
We can easily combine GNSS data with inertial data, including both accelerometer
and accuracy. As differential GNSS has not been implemented in this case, instead,
the results labelled as ”IMU” utilize the IMU readings for local motion estimation,
and the GNSS reading for an initial global estimation and estimations when the robot
movement is almost zero for a prolonged period of time.
4.3.3 LiDAR Odometry
Zhang et al. introduced LiDAR odometry as an alternative to the more classical vi-
sual odometry techniques [13]. As with many odometry approaches, features are
extracted from data and compared within consecutive frames, or scans in the case
of a LiDAR. Features extracted from LiDAR data are usually based on geometrical
aspects. These include corners and surfaces, for instance. Because LiDARs are able
to provide high accuracy distance measurements even for objects far away from the
sensors, LiDAR-based odometry is able to provide higher accuracy than visual-based
odometry in open space situations with clearly differentiated objects. An implemen-
tation based on Zhang’s algorithm has been used in this case.
59
Qingqing Li
4.3.4 NDT-Based Localization (NDT+)
The NDT algorithm is a kind of registration algorithm that uses the existing high-
precision point cloud map and real-time 3D LiDAR point cloud data to achieve high-
precision localization.
NDT algorithm does not directly compare the distance between points in point
clouds map and points in LiDAR point clouds. First, the NDT algorithm will trans-
form the point cloud map into the normal distribution in three-dimensional space.
If a variable 𝑋 is normal distribution 𝑋 ∼ (𝜇, 𝛿), then it can be described as:
𝑓(𝑥) = 1
𝛿
√
2𝜋
𝑒
−(𝑥−𝜇)2
2𝛿2 (5)
where 𝜇 is the mean of the variable distribution and 𝛿2 is the variance. For a
multivariate normal distribution, its probability density function can be expressed as:
𝑓(?⃗?) = 1√︁
(2𝜋)𝑘 |∑︀|𝑒−(?⃗?−?⃗?)𝑇
∑︀−1(?⃗?−?⃗?) (6)
Where ?⃗? represents the mean vector, 𝑘 is the variable amount, and
∑︀
represents
the covariance matrix.
The first step of the NDT algorithm is to divide the point cloud into a 3D grid
coordinate. For each cell, the probability distribution function(PDF) is calculated
based on the points distribution density in the grid.
?⃗? = 1
𝑚
𝑚∑︁
𝑘=1
𝑦𝑘 (7)
𝜎 = 1
𝑚
∑︁
𝑘=1
𝑚(𝑦𝑘 − ?⃗?)(𝑦𝑘 − ?⃗?)𝑇 (8)
Where 𝑦𝑘 = 𝑦1, 𝑦2, 𝑦3, ..., 𝑦𝑚 denotes all LiDAR points in a grid. ?⃗? and 𝜎 repre-
sent the average position and covariance matrix of the points in a grid.Then the PDF
can be expressed as:
𝑓(?⃗?) = 1
(2𝜋) 32
√︀|𝜎|𝑒−(?⃗?−?⃗?)𝑇
∑︀−1(?⃗?−?⃗?) (9)
We use the normal distribution to represent the discrete points of each grid. Each
probability density function can be considered as an approximation of a local surface.
It not only describes the location of the surface in space but also contains information
about the direction and smoothness of the surface.
After calculated the PDF of each grid, then our goal is to find the best trans-
formation. The LiDAR point cloud set is 𝑋 = 𝑥1, 𝑥2, ..., 𝑥𝑛, and the parameter
60
Multi-Sensor Fusion for Accurate Localization in Urban Environment
of transformation is 𝑝. The spatial transformation function 𝑇 (𝑝, 𝑥𝑘) represents us-
ing transformation 𝑝 to move point 𝑥𝑘, combined with the previous calculated state
density function(the PDF of each grid), so the best transformation 𝑝 should be the
transformation of maximum likelihood function:
𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 : 𝜃 =
𝑛∏︁
𝑘=1
𝑓(𝑇 (𝑝, 𝑥𝑘))
And the maximum likelihood is equivalent to minimum negative logarithmic likeli-
hood:
− log 𝜃 = −
𝑛∑︁
𝑘=1
log 𝑓(𝑇 (𝑝, 𝑥𝑘))
The task now becomes to minimize the negative logarithmic likelihood by using
an optimization algorithm to adjust the transformation parameter 𝑝. We can use the
Gaussian Newton method to optimize the parameters.
The main problem of the NDT approach is its stability when used standalone.
As indicated by the authors of previous works, NDT alone has the disadvantage of
being unstable depending on the scenario [131]. Therefore, we utilize GNSS data
for setting the initial position as well as resetting the NDT localization method when
a sudden change in position or orientation estimation is detected. In the results, we
refer to this method as NDT+, and a close implementation to the one provided in the
existing NDT method has been used [131].
4.3.5 NDT+IMU Localization (NDT++)
The final localization utilized in our experiments consisted on integrating the IMU
data into the NDT+ method described above that uses LiDAR and GNSS data. With
this approach, we have been able to eradicate the instabilities of the NDT+ method
and increase its accuracy.
The algorithm workflow is as follows: first, on system start-up or reset, GNSS
data is used in order to obtain an initial estimation of the robot’s location. This es-
timation can be utilized in order to reduce the area of the map in which the NDT
matching will be looked for. Second, when the robot starts moving, an unscented
Kalman filter that uses IMU data as input serves as an estimation between LiDAR
scans. The Kalman filter output is then feeded to the NDT algorithm for scan match-
ing with the predefined map. The GNSS data is still used to avoid instabilities, even
though we have not detected any in the dataset utilized.
The NDT+ and NDT++ approaches have an additional benefit over the LiDAR
odometry method. In autonomous robots moving in an urban environment, it is essen-
tial to react on time to obstacles and to have localization information as frequently
as possible. A LiDAR-only approach has the disadvantage of receiving sensor up-
dates at only 10Hz in this case. With IMU readings having a refresh rate of 100Hz,
61
Qingqing Li
Table 7. Localization error mean and standard deviation
𝜇𝑟𝑜𝑡. 𝜎𝑟𝑜𝑡 𝜇𝑥 𝜎𝑥 𝜇𝑦 𝜎𝑦
GNSS 1.52 0.79 -0.46 0.22 10−4 0.18
IMU -1.24 0.62 N/A N/A N/A N/A
LOAM -0.50 1.88 0.40 0.49 0.10 0.49
NDT+ 0.03 0.87 -0.02 1.51 -0.05 1.13
NDT++ 10−3 0.31 -0.01 0.10 -0.05 0.10
the IMU can be utilized to obtain local movement estimation between LiDAR scan
matches using the NDT approach. This minimizes the possibilities of instabilities in
the NDT algorithm as the matching possibilities are reduced and the goal of the algo-
rithm partly shifts from coarse localization to increasing the accuracy of IMU-based
movement estimation.
4.4 Corrupted Map Reconstruction
In an urban scenario, it is impractical to propose a localization method that has a
high dependency on the existence of an accurate map of the operational area without
a strategy for operating in case the map data is corrupted or outdated. In Fig. 24, we
show the map of the operation area (on the left, in black and white), with two areas
where the data has either been removed completely or noise has been added to render
the NDT algorithm unusable. When the robot approaches these areas, we are able to
detect them by monitoring the difference between the NDT localization and GNSS
and IMU positioning. When part of the map cannot be matched with current scans,
we utilize LiDAR odometry and mapping in order to rebuild the corrupted or missing
data. The result of this process is shown on the right side of Fig. 24, where one of
the corrupted map areas has been restored in real-time while the robot was travelling
through it using LiDAR odometry and mapping. Even though it is not visible in the
image, there is a relatively small mismatch in the map in the area where the robot
emerges again into a mapped environment.
4.5 Experiment and Results
We have applied the five approaches proposed to the given dataset. The results show-
ing the translational and rotational localization and orientation error are shown in
Fig. 25b and Fig. 25a, respectively. In these figures, the NDT++ method shows a
stable and very small error though time, both in position and rotation estimation.
The NDT+ without taking into account inertial data shows a larger error but, more
importantly, shows several instabilities that are corrected from the GNSS data.
In order to be able to compare in more detail the different methods, and to see
whether there exist some background error or drifting, we show the variability of
62
Multi-Sensor Fusion for Accurate Localization in Urban Environment
(a) Rotational errors of the proposed approaches over time.
(b) Translation errors of the proposed approaches over time.
Figure 25. Rotational and Translation errors of the proposed approaches.
63
Qingqing Li
the localization error though two sets of boxplots in Fig. 26a and Fig. 26b. The
specific values are also listed in Table 7. We have omitted the location errors of stan-
dalone IMU motion estimation as the error is significantly higher than the proposed
approaches. However, inertial data is still valuable for local movement estimation
and for orientation estimation.
The translational errors are shown in Fig. 26a, where the red boxes show the
error in the 𝑥 coordinate, and the blue boxes refer to the 𝑦 coordinate error. We have
separated the error because in some cases the error mean differs between them. That
is the case of the GNSS data, which due to atmospheric or environmental conditions
shows a steady negative drift in the 𝑥 direction. If consistent through large periods
of time, it can be assumed that it is due to a sensing error in the device itself, or
to environment conditions such a specific multi-path occurring in the operating area.
Therefore, this value can be utilized to decrease the sensing error in real time during
operation. In order to have a deeper understanding of the distribution of the GNSS
error, we show the histogram of the three errors (two translational, one rotational) in
Fig. 27. Only the error in the 𝑦 direction has a mean of 0, while the distribution of
the 𝑥 error is symmetrical and narrow. Therefore, it is possible to fix the drift while
keeping the same variance for both components of the translational error. In the case
of the rotational error, it is more complex to correct even though the distribution is
still symmetric.
From Fig. 26a and Fig. 26b we can see that the most stable methods are NDT++
and GNSS, with the NDT+ method providing accurate results for rotation estima-
tion. However, the NDT+ is highly unstable for position estimation, with the highest
variance of all presented approaches. In position estimation, all approaches have a
relatively small error after 800 seconds of movement, except for the LOAM method,
which error drifts away from 0 towards the end of the available data set. Similarly,
the IMU constantly drifts in terms of orientation estimation but it provides a more
stable measurement than LOAM, NDT+ or GNSS.
In summary, LiDAR scan matching with a 3D map provides the highest accuracy
for localization, both in terms of position and orientation. Nonetheless, it is essential
to take into account other sensor data in order to implement a more robust approach
that is less prone to instabilities and depends less on the operational environment.
GNSS and inertial data are essential for increasing the localization accuracy but also
for minimizing the possibilities of unexpected behaviour in the algorithm.
4.6 Summary and Conclusion
Accurate localization in dense urban areas is paramount in order to solve the au-
tonomous last-mile delivery problem. Nonetheless, it still presents important chal-
lenges. We have explored the possibilities for localization in a city environment
using 3D LiDAR data complemented with GNSS and inertial data using a delivery
64
Multi-Sensor Fusion for Accurate Localization in Urban Environment
(a) Boxplots of translational errors for the different approaches: x-error (red) and
y-error (blue).
(b) Boxplots of rotational errors for the different approaches in degrees.
Figure 26. Boxplots of rotational and translational errors with selected localization
methods.
65
Qingqing Li
Figure 27. Distribution of GNSS signal errors in position (X, Y) and orientation (Yaw).
robot from JD. We have shown the accuracy of different approaches, assuming that a
map of the operation area is given in the form of a point cloud. In addition, we have
presented a strategy for situations where the map might be corrupted or the scenario
might have undergone significant changes that rendered the map outdated. We have
shown that 3D scan matching is the best approach for localization when properly
complimented with IMU data within an unscented Kalman filter, and GNSS data.
66
5 Robust Multi-Modal
Multi-LiDAR-Inertial Odometry and
Mapping
Multiple studies in the literature have concentrated on enhancing LiDAR maps by
integrating point clouds from multiple LiDAR sensors, aiming to improve the sit-
uational awareness capability [35; 40; 8; 10]. However, the low frame publishing
frequency of typical LiDARs (e.g., 10 Hz) can hinder an accurate 6-DOF pose esti-
mation in multi-LiDAR systems. Therefore, IMUs have been widely used in state-of-
the-art SLAM systems [41; 16], due to their ability to measure acceleration and angu-
lar velocity at a high frequency (e.g., 200 Hz) in three-dimensional space. Nonethe-
less, there remains a lack of methods that can effectively exploit multi-LiDAR iner-
tial systems for odometry estimations.
This chapter aim to improve the robustness of the SLAM system and introduce a
novel tightly-coupled multi-modal multi-LiDAR-inertial odometry and mapping sys-
tem, which takes advantage of both the large horizontal FoV from a spinning LiDAR
and the dense measurements from a solid-state LiDAR as Table 8 shows. The pro-
posed system first performs spatial-temporal calibration to align the timestamp and
calibrate the extrinsic parameters between sensors. Then, we extract two group fea-
ture points, edge and planar points, from LiDAR data. Next, with pre-integrated IMU
data, an un-distortion module is employed on LiDAR point cloud data. Finally, the
un-distorted point cloud is merged into one point cloud and sent to sliding window
based optimization module.
This proposed method, to the best of our knowledge, is the first multi-LiDAR-
inertial SLAM system able to effectively integrate multiple modal LiDAR sensors
with heterogeneous scan modalities within a single estimation and optimization frame-
work. This work is inspired by the limitations found in state-of-the-art algorithms for
different LiDAR sensors in our previous works [132; 39], where we show that low-
cost solid-state LiDARs outperform high-resolution spinning LiDAR in an outdoor
environment, while at the same time perform poorly in indoor environments. The
unique characteristics and main contributions of our work can be summarized as
follows:
1. Present a complete solution for multi-modal LiDAR spatio-temporal calibra-
tion and feature extraction. The method adopts an ICP-based scan-matching
67
Qingqing Li
(a) Mapping result with the proposed system at a hall environment. Thanks to the high resolution of
solid-state LiDAR with a non-repetitive scan pattern, the mapping result is able to show clear detail of
object’s surface.
(b) Hardware and Mapping result in long corridor environment. Our proposed methods show robust
performance in long corridor environments and survive in narrow spaces where 180° U-turn occurs.
Figure 28. Our proposed methods show high-resolution mapping results and robust
performance in challenging environment
approach to obtain the extrinsic parameters, split-and-merge based timestamp
alignment, and unified channel based feature extraction for both spinning and
solid-state-LiDAR.
2. Design and implementation of a novel tightly-coupled multi-modal multi-LiDAR-
inertial mapping framework that is able to combine LiDARs with different
68
Robust Multi-Modal Multi-LiDAR-Inertial Odometry and Mapping
Table 8. Characterization of off-the-shelf LiDAR sensors based on horizontal resolution (H.
Res.), vertical resolution (V. Res.), and cost.
LiDAR Types High H. Res. High V. Res. Low-cost
Spinning, 64+ channels ! ! %
Spinning, 16-32 channels ! % !
Solid-State % ! !
Ours (solid-state + spinning-16) ! ! !
scanning modalities and IMU for odometry estimations.
3. The demonstration of a SLAM method for taking advantage of low-cost spin-
ning LiDARs and solid-state LiDARs that outperform the state-of-the-art in
high-resolution mapping with high levels of detail.
Figure 29. The pipeline of proposed multi-modal LiDAR-inertial odometry and mapping
framework. The system starts with preprocessing module which takes the input from sensors
and performs IMU pre-integration, calibrations, and un-distortions. The scan registration
module extracts features and sent the features to a tightly coupled sliding window odometry.
Finally, a pose graph is built to maintain global consistency.
5.1 Multi modal LiDARs and IMU platform
To simplify the system design, we have made follow assumptions:1) The LiDARs are
synchronized at the software level. The time offset between sensors is not considered
in our system. 2) The extrinsic parameters between IMU and at least one LiDAR are
known. In our case, we use LiDAR’s built-in IMU and use the extrinsic parameters
from factory settings.
The design of our system is based on by our previous work in Chapter 2 where
solid-state LiDAR shows significant performance outdoors but failed all tests in-
69
Qingqing Li
Figure 30. Extracted features points in office room environment(left) and zoom-in view of
one gate object (right). Plane and edge feature points from Velodyne are in red and blue, and
from Horizon are in green and white separately.
Figure 31. Multi-LiDAR extrinsic parameters calibration result in an office room
environment. Red points come from Horizon and green points from VLP-16. Top view
(left), and detail of the matching (right).
doors [132]. To combine the high situation awareness ability and robust performance,
here we proposed multi-modal LiDAR-inertial odometry and mapping scheme. In
this chapter, we consider a perception system consisting of multiple modal LiDARs
and IMU, and LiDAR sensors are not triggered by hardware based external clock.
The pipeline of the proposed method is illustrated in Figure 29 and the hardware sys-
tem is shown in Figure 1. The hardware is composed of a spinning LiDAR Velodyne
VLP-16, a low-cost solid-state LiDAR Livox Horizon, and its built-in IMU.
70
Robust Multi-Modal Multi-LiDAR-Inertial Odometry and Mapping
5.1.1 System pipeline
The system starts with a data pre-processing module, in which IMU data are pre-
integrated, extrinsic parameters between sensors are calibrated, timestamps of clouds
with different starting times are aligned, the clouds from multiple LiDARs are un-
distorted with IMU pre-integrated results. After pre-processing, feature point clouds
representing plane and edge points are extracted and merged into one cloud. The
merged feature cloud will be sent to the sliding window odometry module where the
feature cloud will be matched against the local map. Together with pre-integrated
IMU, fixed size of feature clouds, and local map, six-DoF egomotions and IMU
parameters are estimated and optimized by keyframe-based sliding window opti-
mization. At the backend, the system maintains a global pose graph with selected
keyframes. Loop closure is detected in a keyframe basis graph using ICP, and a
global graph optimization is invoked to guarantee the reconstructed map is globally
consistent.
5.1.2 Problem Formulation
We treat IMU coordinate as the base local coordinate indicated as ()𝑏/()𝐼 . The
merged cloud will be transformed to ()𝑏/()𝐼 . We use the first keyframe received
by the system as the origin of the world coordinates denoted as ()𝑤. The coordinate
of spinning LiDAR is denoted as ()𝑣, and the coordinates of solid-state LiDAR are
denoted as ()ℎ. We use 𝒫𝑣𝑡1 = {p𝑣1,p𝑣2, ...p𝑣𝑛} be the point cloud acquired at time
𝑡1 with spinning LiDAR, and 𝒫ℎ𝑡2 = {pℎ1 ,pℎ2 , ...pℎ𝑛} be the cloud acquired at time 𝑡2
with solid state LiDAR, where p𝑣𝑖 and pℎ𝑗 are a point in 𝒫𝑣𝑡1 and 𝒫ℎ𝑡2 .
We denote F𝐸𝑘 and F𝑃 𝑘 as the edge and plane feature point cloud extracted from
the LiDARs’ data at time 𝑘. The transformation matrix is denoted as T𝑏𝑎 ∈ 𝑆𝐸(3),
which transforms a point from frame ()𝑎 into the frame ()𝑏. R𝑏𝑎 ∈ 𝑆𝑂(3) and t ∈ 𝑅3
are the rotation matrix and the translation vector of T𝑏𝑎 respectively. The quaternion
q𝑏𝑎 under Hamilton notation [133] is used, which corresponds to R𝑏𝑎. ⊗ is used for
the multiplication of two quaternions. q𝑏𝑎 and R𝑏𝑎 can be converted by Rodrigues
formula [134]. With a given point cloud from multi-modal LiDAR sensor and IMU
info, the state needs to be optimized for keyframe 𝑘 is defined as (10) where t𝑘 is the
translation vector, q𝑘 represents orientation in quaternion, v𝑘 is the velocity, b𝑎𝑘 and
b𝑔𝑘 are the bias vector of the accelerator and gyroscope.
X𝑘 = [p𝑘,q𝑘, v𝑘,b𝑎𝑘 ,b𝑔𝑘 ] ∈ R3 × S3 × R3 × R3 × R3 (10)
5.1.3 Spatial-temporal Calibration and Initialization
As the extrinsic parameters between IMU and one LiDAR is known, so here we focus
on calibrating the extrinsic parameters between two LiDARs. We assume the sensor
71
Qingqing Li
platform is stationary during the extrinsic calibration process. As solid-state LiDAR
with the non-repetitive pattern is able to obtain more details from the environment
within the FoV, therefore, we integrated 𝑛 consecutive frames(e.g., ten) to new point
cloud 𝒫ℎ𝑡1∼𝑡𝑛 for the calibration process. Let 𝒫𝑣𝑡𝑚 be one cloud data obtained at 𝑡𝑚 ∈
(𝑡1, 𝑡𝑛) from spinning LiDAR. Generalized Iterated Closest Point (GICP)[12] method
is employed to caculate the relative transformation matrix Tℎ𝑣 between 𝒫ℎ𝑡1∼𝑡𝑛 and
𝒫𝑣𝑡𝑚 from the overlapped region. The extrinsic calibration results with data collected
from a classroom environment are shown in Figure 31. As the transformation matrix
T𝑖ℎ between IMU and one LiDAR given by the factory, we can get T𝑖𝑣 by T𝑖𝑣 =
Tℎ𝑣 * T𝑖ℎ.
We consider a system where LiDARs are not triggered with an external clock
(e.g., GNSS) as discussed in [35]. Each cloud 𝒫ℎ and 𝒫𝑣 is collected at different
starting timestamps. To merge these clouds into one combined cloud 𝒫𝑚, it is neces-
sary to align the starting and ending timestamps. We adopt a split-and-merge method
similar to [40]. The individual timestamp of 𝑝ℎ𝑖 ∈ 𝒫ℎ and 𝑝𝑣𝑖 ∈ 𝒫𝑣 can be obtained
from the sensor’s driver. If the timestamp for a point 𝑝𝑣𝑖 ∈ 𝒫𝑣 is not available, it can
be calculated using orientation difference as described in [13]. When a new cloud
𝒫ℎ𝑡𝑘 is received at time 𝑡𝑘, we put all points 𝑝𝑣𝑖 ∈ 𝒫ℎ𝑡𝑘 into a queue Qℎ ordered by
timestamp. When a new cloud 𝒫𝑣𝑡𝑚 is received at time 𝑡𝑚, we first obtain its start
time 𝑡𝑚𝑠 and end time 𝑡𝑚𝑒 . Then, all points in Qℎ with timestamp 𝑡𝑖 < 𝑡𝑚𝑠 will be
dropped, and points with timestamps 𝑡𝑖 ∈ (𝑡𝑚𝑠 , 𝑡𝑚𝑒) will be moved to a new frame
𝒫ℎ𝑡𝑚 , which shares the same time domain with 𝒫𝑣𝑡𝑚 .
5.1.4 IMU Initialization and Preintegration
The IMU sensor output angular velocity and acceleration measurements are defined
as ?˜?𝑡 and a˜𝑡 using equations. 11 and 12:
𝜔𝑡 = 𝜔𝑡 + b𝜔𝑡 + n𝜔𝑡 (11)
a˜𝑡 = R𝑊𝐿𝑡 (a𝑡 − g) + bta + nta (12)
Where b𝑡 is the measurement bias and n𝑡 is white noise. R𝑊𝐿𝑡 is the rotation
matrix from World coordinate ()𝑊 to local coordinate ()𝐿, g is the gravity vector in
world coordinate. Based on the raw measurement 𝜔𝑡 and a˜𝑡, we can infer the motion
of the robot as follows:
p𝑡+Δ𝑡 = p𝑡+v𝑡Δ𝑡+
1
2gΔ𝑡
2+
1
2R𝑡(a^𝑡 − b
a
𝑡 − na𝑡 )Δ𝑡2
(13)
v𝑡+Δ𝑡 = v𝑡 + gΔ𝑡+R𝑡(a^𝑡 − ba𝑡 − na𝑡 )Δ𝑡 (14)
72
Robust Multi-Modal Multi-LiDAR-Inertial Odometry and Mapping
q𝑡+Δ𝑡 = q𝑡 ⊗ qΔ𝑡
= q𝑡 ⊗
[︃
𝑒𝑥𝑝(12Δ𝑡(?˜?𝑡 − b𝜔𝑡 − n𝜔𝑡 ))
1
]︃
(15)
Where the p𝑡,v𝑡 and q𝑡 are the estimated position, velocity and orientation in
quaternion at time 𝑡, p𝑡+Δ𝑡,v𝑡+Δ𝑡 and q𝑡+Δ𝑡 are the estimated state at time 𝑡 + Δ𝑡.
We apply the IMU preintegration method proposed in [41]. The relative motion
between two timestamp ΔV,ΔP,ΔQ can be calculated based on equations 13∼15
and will be used for initial guess in 5.2.3.
5.2 Multi-modal LiDAR Pose Estimation
5.2.1 Feature Extraction
For computing efficiency, feature extraction is essential for the SLAM system. We
focus on extracting the general features that exist in different modal LiDARs and can
be shared in the optimization process. Here we extract feature points based on [16]
that selects a set of feature points from measurements according to their continuous
and surface normal vector. We extend the method and adapt it to both spinning and
solid-state LiDARs. The set of extracted features consists of two subsets: plane
points F𝑃 and edge points F𝐸 . By checking the continuity, the edge feature included
two types of points F𝐸𝑙 and F𝐸𝑏 where F𝐸𝑙 represents the line feature where two
surface meets, and F𝐸𝑏 represents breaking points where plane end.
Let 𝒫𝑣𝑡 be the point cloud acquired at time 𝑡 from spinning LiDAR, 𝒫ℎ𝑡 be the
point cloud acquired from solid-state LiDAR at the same time domain after temporal
alignment. If the channel number of each point 𝑝𝑣 ∈ 𝒫𝑣𝑡 is unavailable, then first
project points in 𝒫𝑣𝑡 onto a range image based on the horizontal and vertical angle
w.r.t. the origin. Each row represents data from one channel of the spinning LiDAR.
Then the points are divided into 𝑁 subsets {L𝑣𝑖 }𝑖∈𝑁 where 𝑁 is the total channel
numbers of spinning LiDAR. For point cloud 𝒫ℎ𝑡 , we divide the point based on line
number which can be obtained from Livox ROS driver 1. Similarly, we first divided
the points into 𝑀 subsets {Lℎ𝑖 }𝑖∈𝑀 where 𝑀 is the total line numbers of solid-state
LiDAR. The points in each L𝑣𝑖 𝑖∈𝑁 and Lℎ𝑖 𝑖∈𝑀 are ordered by timestamp. For each
subset L𝑣𝑖 in {L𝑣𝑖 }𝑖∈𝑁 and Lℎ𝑗 {Lℎ𝑗 }𝑗∈𝑀 , we first extract the continues points 𝒫𝑣𝑖 𝐶
and 𝒫ℎ𝑗 𝐶 by checking the depth difference with its neighbour points. If the depth
difference between the point in L𝑣𝑖 or Lℎ𝑗 and nearest neighbor points within the same
subset is smaller than the depth threshold 𝑑𝑡ℎ, then the point is added to continuous
points subset 𝒫ℎ𝑖 𝐶 or 𝒫𝑣𝑗 𝐶 . Then we follow the feature extraction methods in [40],
where a scatter matrix Σ is calculated based on neighbor points. By analyzing the
1https://github.com/Livox-SDK/livox_ros_driver
73
Qingqing Li
two largest eigenvalues 𝜆1 and 𝜆2 of Σ, the plane points are detected and labeled
as a plane, the point where two plane meet is labeled as corner features, the points
where one plane ends and neighboring with discontinuous points are labeled as break
points. We merge the corner points and break points together and use them as edge
feature points. After this process, plane feature points Fℎ𝑃 𝑡 and F𝑣𝑃 𝑡, edge feature
points Fℎ𝐸𝑡 and F𝑣𝐸𝑡 are extracted. We show extracted feature points in the office
room environment in Figure 30
5.2.2 Feature Clouds Merging
Instead of keeping two different system state X𝑤𝑡 for clouds𝒫ℎ𝑡 and𝒫𝑣𝑡 , we merge the
same type of feature point cloud into one frame and use a single system state. As the
cloud 𝒫ℎ𝑡 from solid-state LiDAR can be more easily blocked by near objects, with
extreme cases leading to all points reflecting on a single plane (e.g., a close wall) in an
indoor environment, this means that there is a certain probability of most of the points
in Fℎ𝑃 𝑡 belonging to a single plane. This means that insufficient Fℎ𝐸𝑡 points can be
extracted. In this case, we treat the 𝒫ℎ𝑡 as bad frame, with the corresponding feature
cloud not being considered to the union feature cloud F𝑖𝐸𝑡 and F𝑖𝑃 𝑡. This ensures
more consistent behavior and robust estimation across environments and over time.
To detect such bad frames, We first remove the points in Fℎ𝑃 𝑡 and Fℎ𝐸𝑡 that close to
the origin of the sensor (e.g., 2 m threshold), and then check the amount 𝑛𝑒 of edge
point in the feature cloud Fℎ𝐸𝑡. If 𝑛𝑒 is smaller than the edge feature threshold 𝜏𝑒 (e.g.,
100 in our experiments), then the cloud 𝒫ℎ𝑡 is treated as a bad frame. If no such bad
frame is detected, we transform the complete feature clouds Fℎ𝑃 𝑘, F𝑣𝑃 𝑡, Fℎ𝐸𝑡 and F𝑣𝐸𝑡
to the ()𝑖 coordinate frame and merge them to fused, unified feature clouds F𝑖𝐸𝑘 and
F𝑖𝑃 𝑘 using the extrinsic transformation matrices T
𝑖
𝑣 and T𝑖ℎ, calculated as described
in Section. 5.1.3 with Eq. (16).
F𝑖𝐸 = T𝑖𝑣 * F𝑣𝐸 + T𝑖ℎ * Fℎ𝐸 ,F𝑖𝑃 = T𝑖𝑣 * F𝑣𝑃 + T𝑖ℎ * Fℎ𝑃 . (16)
If Pℎ𝑡 is ”bad frame”, then F𝑖𝐸 = T𝑖𝑣 * F𝑣𝐸 and F𝑖𝑃 = T𝑖𝑣 * F𝑣𝑃 . The union feature
clouds F𝑖𝐸 and F𝑖𝑃 will be down-sampled before sending them into sliding window
optimization module.
5.2.3 Keyframe Selection & Undistortion
Given feature point cloud and preintegrated IMU within the same time domain F𝑖𝐸𝑘,F𝑖𝑃 𝑘, I𝑖𝑝𝑟𝑒𝑔𝑘
and a point cloud feature map M𝑤𝑘 in world coordinate, the registration problem can
be formulated as solving a non-linear least square problem. The initial guess of the
state X𝑤 is estimated with Eq. (17):
p˜𝑘 = p¯𝑘−1 + q¯𝑘−1 *ΔP𝑘𝑘−1
74
Robust Multi-Modal Multi-LiDAR-Inertial Odometry and Mapping
q˜𝑘 = q¯𝑘−1 *ΔQ𝑘𝑘−1, b˜g𝑘 = b¯g𝑘−1 (17)
v˜𝑘 = v¯𝑘−1 + q¯𝑘−1 *ΔV𝑘𝑘−1, b˜a𝑘 = b¯a𝑘−1
As sliding window optimization is a relatively heavy process, therefore, maintain
the sparsity of the frames in the window can significantly affect the real-time per-
formance. Here we check the IMU drift during the time interval of two consecutive
keyframes, and select the frame as keyframe if the orientation difference is higher
than a certain degree (e.g., 30°) or the time difference between a current frame
and the last keyframe larger then certain time (e.g., 2 seconds). Then each point
in the selected keyframe will be un-distorted with Δ𝑄 and Δ𝑃 provided by IMU
pre-integration. Each keyframe contains deskewed feature clouds F𝑖𝐸𝑘 and F𝑖𝑃 𝑘, pre-
integrated IMU I𝑖𝑝𝑟𝑒𝑔𝑘, and initial guess of state X˜
𝑤
𝑘 ∼ [p˜𝑘, q˜𝑘, v˜𝑘, b˜𝑎𝑘, b˜𝑔𝑘] which
will be optimized by sliding window optimization.
5.2.4 Sliding Window Optimization
In this work, we follow keyframe based tightly coupled LiDAR-inertial sliding win-
dow optimization strategy in [16]. The merged feature points F𝑖𝐸𝑘 and F𝑖𝑃 𝑘 of
keyframe 𝑘 are treated as feature clouds that are extracted from single LiDAR sensor
as in [16]. We build a window with 𝜏 consecutive keyframes where the states that
need to be optimized for each frame are X˜𝑤 = [X˜𝑤1 , X˜
𝑤
2 , ..., X˜
𝑤
𝜏 ]. The optimal state
can be obtained by minimizing the function:
min
X˜
{||D𝑝𝑟𝑖𝑜𝑟(X˜𝑤)||2 +
𝜏∑︁
𝑘=1
D𝐿(X˜𝑤𝑘 ) +
𝜏∑︁
𝑘=1
D𝐼(X˜𝑤𝑘 )} (18)
Where ||D𝑝𝑟𝑖𝑜𝑟(X˜𝑤)||2 represents the prior residual term which is generated by
marginalizing oldest frames before the current window via Schur-complement [41],D𝐼(𝑋𝑤)
represents the pre-integrated IMU terms as defined in [16].D𝐿(X˜𝑤𝑘 ) is LiDAR term
defined as (19).
𝑚∑︁
𝑎=1
(D𝑒(X𝑤𝑘 ,p𝑖𝑘,𝑎,M𝑤𝑘 ))2 +
𝑛∑︁
𝑏=1
(D𝑠(X𝑤𝑘 ,p𝑖𝑘,𝑏,M𝑤𝑘 ))2 (19)
D𝑒 is the point-to-edge residual term defined as (20) and D𝑠 point-to-plane resid-
ual term defined as (21).
D𝑒(X𝑤,p𝑖,M𝑤)) =
||(p𝑤 − 𝑒𝑤)× (p𝑤 − 𝑒𝑤)||
||𝑒𝑤 − 𝑒𝑤|| (20)
D𝑠(X𝑤,p𝑖,M𝑤)) = |n𝑇𝑠 p𝑤 + 1/||n𝑠||| (21)
75
Qingqing Li
where p𝑖 represents a feature point belonging to F𝐸𝑘,F𝑃 𝑘. Then, p𝑤 = R(q)p𝑖+
t represents the scan point p𝑖 at local frame ()𝑖, which is transformed to world frame
()𝑤 given the state estimation [q, t] in X𝑤. We denote by 𝑒𝑤 and 𝑒𝑤 the two closest
corresponding edge feature points on the feature map M𝑤, while n𝑤𝑠 is the plane
normal vector that is calculated by neighbor plane feature points in the M𝑤 cloud.
We solve the non-linear Eq. (18) using the Ceres Solver toolbox [135]. To ensure
global consistency, we also maintain a pose-graph structure with optimized states
X𝑤 and pre-integrated IMU measurements as optimization constraints.
5.3 Experimental Evaluation
5.3.1 Sensor Configuration and Implementation
We implement the proposed multi-modal multi-LiDAR-inertial odometry and map-
ping system in C++ with ROS Melodic environment that can be shared within the
robotic community. The system shown in Figure 29 is structured in four nodes: pre-
processing, feature extraction, scan registration, and graph optimization. The factor
graph optimization is maintained by GTSAM 4.0 [136], and non-linear optimization
is performed by Ceres Solver 2.0 [135]. The framework proposed in this work is
validated using datasets gathered by Velodyne VLP-16 (V), Livox Horizon (H) 3D
LiDAR, and its built-in IMU (I). The VLP-16 measurement range is up to 100 m
with an accuracy of ± 3 cm. It has a vertical FoV of 30°(± 15°) and a horizontal FoV
of 360°. The 16-channel sensor provides a vertical angular resolution of 2°and The
horizontal angular resolution varies from 0.1° to 0.4°. For solid-state LiDAR, we se-
lected Livox Horizon, which is designed with an FoV of 81.7° × 25.1°. Horizon was
scanning at 10 Hz and reaches a similar but more uniform FoV coverage compared
with typical 64-line mechanical LiDARs. The extrinsic parameters between Horizon
and its built-in IMU is provided by factory instruction. The sensors are connected to
a laptop directly with Ethernet and synchronized with software-based precise times-
tamp protocol (PTP) [137]. We run ROS drivers for Velodyne and Horizon and
recorded the data in rosbag format.
76
Robust Multi-Modal Multi-LiDAR-Inertial Odometry and Mapping
(a) Map results generated by our methods in HVI mode.
(b) Map results generated by Fast LIO in VI mode.
Figure 32. Qualitative comparison of map details in the office room dataset sequence. The
color of the points represents the reflectivity provided by raw sensor data. The point size is
1 𝑐𝑚3, and transparency is set to 0.05. The middle two columns show the zoom-in view of
the wall (top) and TV (bottom).
77
Qingqing Li
5.3.2 Qualitative Experiment
Figure 33. The trajectory result on dataset Hall. Our proposed methods show the smallest
error when returning to the start point. The trajectory from different methods (bottom), the
zoom-in view of starting and ending point(top left), the changes along Z-axis(top right)
From our previous research [132; 39], a tightly coupled solid-state LiDAR-inertial
system shows competitive performance outdoors but performs poorly in indoor en-
vironments. Therefore, here we aim to compare our proposed system with a typical
and challenging indoor environment: an office room, a long corridor, and a large
hall. The data are gathered with the platform as Figure 1 shows at ICT-City in Turku,
Finland.
We compare our proposed method with several state-of-the-art SLAM algorithms:
LeGO-LOAM [14] 2, Fast-LIO [17] 3 and LILI-OM [16] 4. LeGO-LOAM is a
LiDAR-only odometry, while Fast-LIO and LILI-OM are tightly coupled LiDAR in-
ertial odometry systems capable of working with both solid-state LiDARs and spin-
ning LiDARs. Fast-LIO features a tightly-coupled iterated extended Kalman filter
framework and an iKD-tree data structure, which demonstrate efficient and robust
performance [17]. Similar to our proposed method, LILI-OM employs keyframe-
based sliding window optimization but only fuses single LiDAR and pre-integrated
IMU measurements. Like Fast-LIO, during the experiments, we utilize the default
2https://github.com/RobustFieldAutonomyLab/LeGO-LOAM
3https://github.com/hku-mars/FAST_LIO
4https://github.com/KIT-ISAS/lili-om
78
Robust Multi-Modal Multi-LiDAR-Inertial Odometry and Mapping
configurations from the official GitHub repository, and loop closure detection is
turned off for each method. To compare the odometry accuracy, all three datasets
started and ended at the same place. The mean square distance (MSE) between the
starting and ending positions is treated as the error. The results generated by the
selected methods for all datasets are presented in Table 9.
Figure 34. Mapping results with the proposed method outdoors: urban street (left) and
forest environment (right).
Hall
The data was recorded in a hall environment measuring approximately 127 m ×
35 m, aimed at comparing odometry and mapping performance in a relatively large
indoor environment. The recording commenced in a narrow space and proceeded
with a 180°U-turn where most of the field of view (FoV) of the solid-state LiDAR
was covered by nearby walls. Consequently, solid-state LiDAR-only based methods
may not receive enough features, leading to potential significant drift. The trajectory
is depicted in Figure 33, and mapping results are shown in Figure 28a. According
to the position error displayed in Table 9, our proposed methods exhibit the best
performance, achieving a position error of 5.1 cm, with Fast-LIO (VI) showing the
second-best performance.
Corridor
The corridor environment is another challenging environment as low-resolution Li-
DARs might not get enough feature points from the environment to perform robust
localization. The data sequence was recorded at a 60 m long corridor. The mapping
results are shown in Figure 28b. Fast-LIO (VI) performs the best, while our proposed
method follows closely in performance.
79
Qingqing Li
Office
Another data sequence is recorded in a small office room measuring 12,𝑚× 3.7,𝑚.
To make the mapping task more challenging in this environment, we performed sev-
eral fast 180°U-turns during the data recording. From the results, we can see that
LIOM (VI) exhibits the best performance, while our proposed method with HVI
ranks second.
5.3.3 Outdoor Mapping
For the sake of completeness, we also test the proposed method with an outdoor
data sequence on a city road and a forest environment. The resulting map is shown
in Figure 34. A qualitative analysis of the map point cloud shows a high level of
detail. However, the extrinsic calibration method has been designed for indoor envi-
ronments with rich edge and planar features within the overlapped FoV between the
two LiDAR sensors; therefore, the extrinsic parameters are not well calibrated in the
forest environment, which leads to a decrease in sharpness in the final map. In any
case, this demonstrates the potential for generalization to more unstructured environ-
ments and enables high-density mapping even with sub-optimal extrinsic parameters
calibration
Our proposed multi-modal multi-LiDAR-inertial method demonstrates competi-
tive and consistent performance compared to other selected methods. However, we
observed that our method did not always outperform other approaches, despite uti-
lizing more LiDAR data from the environment. One possible reason for this could
be inaccurate time synchronization between the sensors, our time synchronization
between LiDARs is on the software level with sub-microsecond accuracy. The inac-
curate timestamp for each point will bring the error to the system during the cloud
undistortion and cloud merging steps, which is hard to eliminate.
From the results, the VI-based method is capable of tracking the position of the
sensor in all dataset sequences. However, the HI-based methods fail in most of the
sequences, except for Fast LIO(VI) in the large hall environment. To understand the
difference between each LiDAR, we also tested our methods with Velodyne-Inertial
odometry. The results indicate that the accuracy of the proposed VI method is less
accurate than the HVI method, which suggests that horizon LiDAR can improve the
system’s performance. It is also worth noting that our focus is on integrating various
multi-modal LiDAR sensors at the point cloud registration and feature extraction
stages while aiming for a more consistent framework across environments.
80
Robust Multi-Modal Multi-LiDAR-Inertial Odometry and Mapping
Table 9. End-to-end position error in meters (N/A when odometry estimations diverge; V:
Velodyne VLP-16, H: Livox Horizon, I: IMU). Numbers in bold indicate the best
performance, while underscored numbers indicate the second best in each environment.
Dataset Hall Corridor Office
LeGo (V) 0.567 0.336 0.127
FLIO (HI / VI) 0.109 / 0.069 N/A / 0.062 N/A / 0.188
LIOM (HI / VI) N/A / 0.736 N/A / 1.951 NA / 0.102
Ours (HI / VI) N/A / 0.107 N/A / 0.132 NA / 0.165
Ours (HVI) 0.051 0.085 0.124
Table 10. Analysis of processing time (ms) for the different algorithm stages on an Intel
i7-10875H CPU.
Hall Corridor Office
Pre-processing stage 4.14 4.44 4.37
Multi-LiDAR feature extraction 80.32 86.74 94.21
Pose estimation, optimization 139.71 123.51 132.73
5.3.4 Mapping Quality Comparison
One of the key benefits of the multi-LiDAR system is its high perception awareness
ability. Here we compare the mapping quality in terms of resolution. Part of the
mapping results by our proposed methods has shown in Figure. 28a and 28b where
the color represents intensity value. From the result, we can see many objects (e.g.,
door, wall letters). We compare the mapping result between our proposed method
in Figure 32a and Fast LIO (VI) in Figure 32b. Our method shows the most stable
performance in experiments in an office room environment. By zooming in the same
area, we can see a more uniform point cloud from the wall and a TV in the map
generated with our method.
5.3.5 Runtime Analysis
Our evaluations were conducted on a laptop with an Intel Core i7-10875H CPU
and 64 GB RAM on Unbuntu 18.04.6 LTS system. We show the average runtime
per frame in Table 10 and feature numbers in Table 11. Preprocessing and feature
extraction are lightweight. Runtime is dominated by the sliding-window-based pose
estimation and optimization.
5.4 Summary and Conclusion
We present in this chapter a tightly coupled multi-modal multi-LiDAR-inertial odom-
etry and mapping framework with sliding window optimization for pose estimation
and dense mapping. To our knowledge, this is the first SLAM algorithm that har-
81
Qingqing Li
Table 11. Average number of feature points per frame extracted from the different LiDARs
in the three tested environments. Only points in a range between 2 m, and 50 m are
considered.
V-raw V-edge V-plane H-raw H-edge H-plane
Hall 17370 201 15680 21109 405 1934
Corridor 5965 101 5493 14623 302 4007
Office 20235 329 18430 17462 409 1029
nesses the benefits of both spinning LiDARs and solid-state LiDARs within a unified
framework. Our primary focus is to demonstrate the feasibility of achieving high-
robustness odometry and producing high-quality maps by utilizing an optimal com-
bination of cost-effective sensors. The proposed system effectively integrates the
advantages of dense point clouds generated by solid-state LiDAR for enhanced situ-
ational awareness and the larger Field of View (FoV) offered by spinning LiDARs.
Although our odometry accuracy in some testing environments may similar to other
methods, our approach excels in consistency across varying environments. Notably,
it produces higher-quality maps capable of capturing intricate environmental details.
82
6 Dynamic Object Detection and Tracking
Object detection and tracking play a significant role in the field of robotic. The object
detection and tracking can benefit for static mapping by removing moving objects in
SLAM area [138], novel human-robot interaction algorithm [139; 140] and multi-
robot coorperation tasks [141]. This enable robots to interact with the world at an
object level rather than dealing with pixel-level information in images or individual
points in point cloud data. In this chapter, we mainly discussed two problems in
tracking. First, we explored human detection and tracking with deep learning meth-
ods in Search and Rescue (SAR) tasks with a camera mounted on UAV, enhancing
the robot’s ability to locate and monitor human subjects effectively. Second, we
conducted experiments using novel LiDAR technology to track small challenging
objects such as micro UAVs for multi-robot cooperation purpose, highlighting the
versatility of LiDAR sensors in object tracking scenarios.
UAVs have been playing an increasingly active role in supporting SAR opera-
tions in recent years. The benefits are multiple such as enhanced situational aware-
ness, status assessment, or mapping of the operational area through aerial imagery.
Most of these application scenarios require the UAVs to cover a certain area. If the
objective is to detect people or other objects, or analyze in detail the area, then there is
a trade-off between speed (higher altitude coverage) and perception accuracy (lower
altitude). An optimal point in between requires active perception on-board the UAV
to dynamically adjust the flight altitude and path planning. As an initial step towards
active vision in UAV search in maritime SAR scenarios, in this chapter we analyzed
how the flight altitude affects the performance of object detection algorithms.
From the perspective of deployment within multi-robot systems, being able to
track UAVs from ground robots, e.g., UGVs, enables miniaturization and higher
degrees of flexibility relaxing the need for high-accuracy onboard localization. A
recent and significant example of multi-robot system deployment in GNSS-denied
environments is the DARPA Subterranean challenge [142; 143]. Reports from par-
ticipating teams indicate that localization and collaborative sensing were among the
key challenges, with UAVs being deployed from UGVs dynamically during the chal-
lenge. Since UAVs often rely on visual-inertial odometry (VIO) for self and relative
estate estimation [144], relying on external LiDAR-based tracking can also extend
the operability to low-visibility or other domains where VIO has inherent limita-
tions [145; 146].
83
Qingqing Li
In this chapter, the focus is on two tracking scenarios: human tracking from a
UAV equipped with a camera and UAV tracking by a ground robot equipped with
novel solid-state LiDAR. It begins with an investigation into human detection and
tracking using a UAV-mounted camera, primarily applied in search and rescue mis-
sions. The objective is to evaluate the detection performance at varying altitudes,
specifically to enhance the efficiency of search and rescue operations. After demon-
strating the UAV’s tracking abilities, the discussion transitions to addressing relative
spatial pose estimation for ground-aerial robot collaboration through novel adaptive
tracking method with solid-state LiDAR.
6.1 Human Detection and Tracking with UAVs
Recent years have seen an increasingly wider adoption of unmanned aerial vehicles
(UAVs) to support search and rescue (SAR) operations. Owing to their fast deploy-
ment, speed and aerial point of view, UAVs can aid quick response teams, but also
in longer-term monitoring and surveillance [147]. Some of the main applications
of UAVs in these scenarios are real-time mapping of the operational area or deliv-
ery of emergency supplies. In particular, UAVs can bring a significant increase of
the response team’s situational awareness and detect objects and people from the air,
specially those in need of rescue [148]. An overview of recent research in this area
is available in [149], where UAVs for SAR operations are characterized based on the
operational environment, the type of robotic systems in use, and the onboard sensing
capabilities of the UAVs.
We are interested in optimizing the support that UAVs can provide in maritime
SAR operations (see Figure 35), but also for monitoring and surveillance in mar-
itime environments, where they have already been widely utilized [150]. Maritime
SAR operations might occur in both normal and harsh environments. For example,
according to the Spanish national drowning report [151], in 2019 over 40% of drown-
ings happened on a beach, around 60% of the incidents happened between 10:00 and
18:00, and in 20% of the cases lifeguards were present in the area. Therefore, there
is still a need for better solutions for monitoring and supporting SAR operations in
safeguarded beaches, lakes or rivers even with favorable weather conditions, which
can then be extended towards rougher environments as the technology evolves. In
this work, we study the detection of people in mostly still waters at different altitudes.
In the future, we aim to utilize this information within an active vision algorithm that
can dynamically adapt the flight plan of UAVs towards optimization of search speed
and reliability.
In terms of UAV-based perception, deep learning (DL) methods have become
the de-facto standards in object detection and image segmentation with great success
across multiple domains [152; 153]. We utilize the YOLOv3 [152] architecture and
characterize its performance for human detection on still water surfaces. Within the
84
Dynamic Object Detection and Tracking
Figure 35. Illustration of active-vision-based search in maritime environments with UAVs.
A single UAV can first fly higher to cover larger areas and descend in the event of a positive
detection to increase reliability. Search time can then be optimized by dynamically adjusting
the altitude depending on the perception confidence.
machine perception field, active vision has been a topic of interest that has gained
increasing research interest, owing to the multiple advances in DL and accessibility
of UAV platforms for research. Active vision has been successfully applied for single
and multi-agent tracking [154], but we have observed a gap in the literature in terms
of active vision for search and area coverage. The most active research direction in
active perception is currently reinforcement learning (RL) [155]. However, an RL
approach can be challenging owing to the lack of realistic simulators to train models
for sea SAR.
Deep learning for perception in maritime environments is limited by the lack
of realistic training datasets openly available. Moreover, a key challenge for UAV-
based person search and detection in these environments is the relatively small size
of objects to be detected in comparatively large areas to be searched [156]. There is
an evident trade-off between speed and area coverage, and reliability of both positive
and negative detection. An additional challenge is that the view of people at sea from
the air is only partial, as a significant portion of the body is immersed in the water.
Water reflection and refraction effects might also distort the shape. In order to train
YOLOv3 to adapt to this scenario, and owing to the lack of open data for detecting
people in water, we collected over 450 high-resolution images to train, validate and
test our model. The images have been taken at altitudes ranging from 20 m to 120 m.
85
Qingqing Li
This is, to the best of our knowledge, the first work to analyze the perception
accuracy for UAVs with RGB cameras in maritime environments as a function of
their altitude. The results can be generalized by accounting for the size in pixels of
the persons to be detected assuming well-focused images. Moreover, the retrained
YOLO model outperforms the state-of-the-art in object classification, as it has been
trained to detect people even when only their head emerges above the water level.
The retrained YOLO model can be applied for people swimming but also standing
near the shore in a beach.
6.1.1 Human Detection for SAR Task
Multiple works have demonstrated the benefits of integrating UAVs to maritime SAR
operations [157; 158]. Typical sensors onboard UAVs are RGB, RGB-D and thermal
cameras, 3D LiDARs, and inertial/positional sensors for GNSS and altitude estima-
tion [159; 160]. With these sensors, UAVs can aid in SAR operations by mapping
the environment, locating victims and survivors, and recognising and classifying dif-
ferent objects [159]. From the perception point of view, DL methods have become
the predominant solution for detecting humans or other objects [153; 161; 162].
Human detection is a sub-task of object detection that is of particular interest
for SAR robotics [163]. Some of the most popular neural network architectures for
object detection are R-CNN [164], Fast-RCNN [165], and YOLO [152]. In particular,
YOLOv3 is the current state-of-the-art for real-time detection, able of fast inference
and high accuracy [152]. In this work, we re-train the YOLOv3 network with a new
dataset for detecting people in the water.
Active perception has been defined as:
An agent is an active perceiver if it knows why it wishes to sense, and
then chooses what to perceive, and determines how, when, and where to
achieve that perception. [166]
In UAV-aided maritime SAR operations, algorithms for area coverage and human
search incorporating active vision need to be aware that their main objective is to
find humans (why), and need to be able to dynamically adjust their path planning
and orientation to achieve higher-confidence results (what). This latter aspect can
be achieved by, for instance, adjusting their height and camera pitch, or by moving
around the person to get a better angle (how, where and when).
Active vision has been increasingly adopted in different object detection tasks.
However, no previous research has, to the best of our knowledge, focused on ac-
tive vision for detection of humans in SAR scenarios. We therefore list here some
other relevant works in the area. Ammirato et al. presented a dataset for robotic vi-
sion tasks in indoor environments using RGBD cameras with the introduction of an
active vision strategy using Deep RL to predict the next best move for object detec-
86
Dynamic Object Detection and Tracking
tion [167]. Juan et al. presented an autonomous Sequential Decision Process (SDP)
for active perception of targets in uncertain and cluttered environments, with exper-
iments conducted in a simulated SAR scenario [168]. Davide et al. applied active
vision to a path planing algorithms that enabled quadrotor flight through narrow gaps
in indoor complex environments [169]. Manuela et al. applied bio-inspired active
vision for object avoidance with wheel robots in indoor environments [170]. In SAR
operations, once a target has been identified, continuously updating the position of
target is essential, so that path planning for the rescue teams can be adjusted. This
can be achieved though active tracking [171].
In terms of detecting people in maritime environments, Eleftherios et al. pre-
sented a real-time human detection system using DL models that run on-board UAVs
to detect open water swimmers [172]. The authors, however, do not study the ac-
curacy of the perception for different altitudes or positions. In this work, we focus
on analyzing human detection as a trade-off between larger area coverage (higher
altitude) and higher amount of detail in the images (lower altitudes).
In general, we see a clear trend towards a more widespread utilization of UAVs in
SAR missions and DL models for perception (either onboard or offloading computa-
tion). We have found, however, no previous works exploring the correlation between
the altitude at which UAVs fly and the detection accuracy in maritime SAR scenarios.
6.1.2 Deep Learning-Based Huamn Detection
This section describes the data and details of the training process for the perception
algorithm. We also outline the metrics that are analyzed in our experiments.
Table 12. mAP-scores for different IoU-thresholds.
IoU-threshold
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
TS model 0.698 0.698 0.697 0.695 0.693 0.688 0.670 0.638 0.542 0.000
YOLOv3 0.054 0.053 0.053 0.052 0.051 0.051 0.051 0.050 0.044 0.000
6.1.3 Data Acquisition
Owing to the lack of labeled open data showing people in water, and in particular
data labeled with the flight altitude, we have collected data from people swimming
and standing in a lake. The dataset contains 458 labeled photos that are taken by the
camera mounted on the UAV. The camera has a fixed focal length of 24 mm (35 mm
format equivalent) with a field of view of 83° and an aperture f/2.8. The images have
a resolution of 9 MP (4000 by 2250 pixels), and were recorded near the beach area of
Littoistenja¨rvi Lake (60.4582 N, 22.3766 E), shown in Figure 36 (a), Turku, Finland.
87
Qingqing Li
(a) Beach view (b) Top view
(c) Low altitude (d) Back light
Figure 36. (a) Example images of terrain at Littoinen Lake, Finland, (b) The top view of
swimmer, (c) The far view of swimmers, (d) The close view of swimmer
Each photo captures one or more people that are either swimming or standing in
the lake at different heights and angles. Some examples are shown in Figure 36 (b),
(c) and (d). However, the majority of pictures were taken with a gimbal pitch of
-90° (top-view images). The dataset contains 2D bounding boxes for two classes:
persons and other objects, the latter one being used for animals in the water and other
floating objects. In addition to the bounding boxes, each image contains information
about the GPS position, relative altitude to the take-off point (just above the water
level), and pitch angle of the camera gimbal (from horizontal images with 0° pitch
to top-view images with -90° pitch). The relative altitude ranges from 0 m to 143
m. While the dataset has been acquired with good weather conditions and mostly
still waters, variable light conditions are also introduced. This results in different
colors for both water and people, as can be seen in Figure 36 (b) and (d). Some
of the swimmers use swimming caps of different colors and wear different types of
swimming suits.
6.1.4 Modal Training and Test Setup
Training and testing were done with the YOLOv3 real-time object detection model [152].
The YOLOv3 model pre-trained with ImageNet [173] was trained again with our
dataset using transfer learning. Training is done in a way where all but the last three
88
Dynamic Object Detection and Tracking
layers are frozen for the first 50 epochs and then unfrozen and trained further for
another 50 epochs with batch size of 32 and learning rate of 0.001.
Each image contains between 1 and 50 object instances. The objects are divided
into two classes: ’person’, containing 2454 instances, and ’something else’, contain-
ing 238 instances, mostly birds but also some other objects floating in the water.
All the images were labeled manually, using bounding boxes with the Labelbox an-
notation tool [174]. Training and testing were done using 4-fold cross-validation,
randomly splitting the images using a 75/25 train/test split. We refer to the re-trained
model as the task-specific model hereinafter.
6.1.5 Evaluation Metrics
Object detection performance was evaluated using PASCAL VOC challenge met-
rics [175] provided by [176]. We calculated average perception (AP) for both classes
separately and mean average perception (mAP) over both classes using different in-
tersection over union (IoU) thresholds. The comparison in performance was done
between the pre-trained YOLOv3 model and the task-specific model with our data
using transfer learning. Furthermore, since our objective is to analyze the correla-
tion between the performance of the human detection and the altitude, we also ana-
lyze how the detection confidence and the ratio of false positives and false negatives
changes as a function of the altitude.
6.1.6 Experimental Results
In this section, we assess the performance of the trained model as a classifier using
the mean average precision for different IoU thresholds, but also its usability for
active-vision-based control where the input to the algorithm is the confidence of the
model on each of its detections.
Some representative example detections made by the task-specific model are il-
lustrated in Figure 37. In Figure 37a, we observe how the network is able to pinpoint
the location of people in the image, but the bounding box appears around the turbu-
lent water rather than around the person itself. However, not all objects or turbulent
areas are detected as people, as other objects are also properly identified (Figure 37b.
In Figure 37b, we also observe that people can be located far away when the gim-
bal pitch is closer to 0°. Finally, we see that even at high altitudes, the confidence
remains high and people are detected also when immersed (Figure 37c).
The performance of the task-specific model compared to the pre-trained YOLOv3
network is shown in Table 12, where we see that the task-specific model is clearly
superior. In terms of the precision × recall curves, those corresponding to classes
’person’ and ’something else’ are provided in Figure 38a.
Next, we analyze performance at different altitudes. The significance of the al-
89
Qingqing Li
(a) Detection of one person (high confidence),
and turbulent water next to another (lower
confidence). Altitude: 37 m. Pitch: -80°.
(b) Detection of other objects but missing two
persons in the distance. Altitude: 12 m. Pitch:
-25°.
(c) Successful detection of three people at high altitude, one of them fully immersed in the water (only
a portion of the original image is shown). Altitude: 86 m. Gimbal pitch: -90°.
Figure 37. Samples of detections made using the task-specific model.
titude is, however, relative to the resolution of the camera and its ability to produce
clear images. The camera pitch is also important as illustrated. In order to provide
results that are more generalizable, Figure 38b shows the size in pixels of the ground
truth bounding boxes.
Figure 39a(a) shows all the person detections plotted in terms of their confidence
against the altitude, using 𝐼𝑜𝑈 = 0.1 to consider true positives. We have set the
IoU to 0.1 because we are only interested in pointing to the approximate location of
persons but not their exact size and place. For altitudes under 60 m, over 98.8% of
the detections with a confidence above 0.5 are correct. A clear threshold appears at
an altitude of 90 m. Above 90 m, 83.3% of the detections are correct.
In some of the test images, we have noticed that the model detects turbulence in
the water created by people as persons, and not the full bodies of the people them-
selves. Because we are not interested in analyzing how capable the task-specific
model is of generating accurate bounding boxes, but instead on pointing to the ap-
proximate location of people at sea, we might also want to consider as correct detec-
tion boxes that are just adjacent to actual people. In Figure 39a(b), we have plotted
the confidence as a function of the altitude, but now using a distance in pixels of less
than 100 between the ground truth and the predicted box (DIST) to assume that a
detection is correct. We now see that all except one of the positive detections with
90
Dynamic Object Detection and Tracking
(a) Precision x Recall curve for class ’person’
and ’something˙else’ using IoU-threshold 0.5
with the task-specific model.
(b) Side length of the ground truth bouding
boxes, in pixels, based on the altitude.
(a) Confidence with IOU (a) and Confidence with DIST (b).
Figure 39. Confidence of individual detections as a function of the relative UAV altitude.
We observe a clear difference between high-confidence and true positives under the
threshold of 100 m, with lower confidence and higher rate of false positives above it.
a confidence of over 70% are correct for an altitude up to 100 m. For a confidence
above 45%, all but one detections are correct up to an altitude of 70 m. The distribu-
tions of the true positives and false positives for each of the two metrics (IoU, DIST)
are shown in Figure 40a. There is a clear threshold just under a confidence of 0.6,
with almost 75% of true positive having a confidence over 0.6, and almost 75% of
false positives having a lower confidence.
In order to evaluate this model within its context for SAR missions, we also need
to take into account that false positives do not necessarily have a significant impact
on the search performance, but false negatives do, as they mean that the UAV misses
a person. We have therefore plotted in Figure 40b the proportion of false negatives
over true positives. If we use the pixel distance to consider a detection as correct,
91
Qingqing Li
(a) Distributions for the confidence of true
positive (TP) and false positive (FP) detections
(DIST and 𝐼𝑜𝑈 = 0.1).
(b) Proportion of false negatives (FN) over true
positives (TP) and FN. This gives an idea of the
probability of missing a person.
then the proportion remains under 10% for all altitudes. With 𝐼𝑜𝑈 = 0.5, however,
over 50% of the people in the water are undetected. However, we do not consider
this an effective way of evaluating a detection in this scenario.
6.1.7 Summary and Conclusion
With UAVs increasingly penetrating multiple civil domains and, among them, search
and rescue operations, more complex control mechanisms are required for more au-
tonomous UAVs. To that end, active perception is one of the most promising research
directions. In UAV search, active vision can be exploited to optimize the flight plan
based on the confidence of the DL vision algorithms. We have presented preliminary
work that studies the confidence of a re-trained YOLOv3 model for detecting people
in the water for altitudes ranging from 20 m to 120 m. With a custom dataset, we
have seen a major performance increase with respect to the pre-trained YOLOv3 net-
work. Our results show a clear correlation between the altitude and the confidence
of the detections and between the confidence and the correctness of the detections.
When considering as true positives detections near actual people (e.g., over water
turbulence created by people), we have seen that the proportion of false negatives
remains low even for high altitudes, and the proportion of false positives over true
positives drops significantly for all predictions with a confidence over 60%. Finally,
we have observed a clear altitude threshold at around 100 m after which confidence
and accuracy drop.
6.2 Micro UAV Detection and Tracking with Solid-state
LiDAR
Micro-aerial vehicles (MAVs) have seen an increasing adoption across a variety of
application domains in recent years [177]. Multiple works have been devoted to the
navigation of MAVs in GNSS-denied environments [178], and state estimation in
both single [179] and multi-MAV systems [180]. In this chapter, we are particularly
92
Dynamic Object Detection and Tracking
t=0s t=1ms t=2ms t=4ms t=20ms
10 cm
7 cm
Sparse point cloud 
representing the MAV
The UGV is equipped 
with a limited field-of-
view lidar featuring a 
non-repetitive scan 
pattern.
Figure 41. Illustration of the field of view (FoV) coverage with different point cloud
integration times in a non-repetitive LiDAR scanning device.
interested in tracking and state estimation from an exter al system, for those appli-
cations where MAVs are deployed together with or from unmanned ground vehicles
(UGVs) [181; 182].
Tracking and detecting UAVs has been a topic of interest for researchers in recent
years. First, owing to the increasing need of identifying and detecting foreign objects
or drones in areas with controlled airspace such as airports [183; 184]. Second, to
optimize the utilization of UAVs as flexible mobile sensing platforms. This chapter
focuses on the latter use. Compared to the existing literature, which relies mainly
on vis on-based techniques [185], we provide a LiDAR-based solution that can be
utilized more independently of the environmental conditions. Until recently, most 3D
LiDARs provided relatively sparse point clouds in terms of object recognition [186],
with limited vertical resolution in inexpensive devices. However, solid-state LiDARs
have recently emerged as state-of-the-art in terms of long-range scanners featuring
high-density point clouds [16]. The main caveat is the limited field of view (FoV)
in most of these devices [15], but solutions include utilizing multiple LiDARs or
correspondingly adjusting the position and orientation of the robot base where the
LiDAR is installed. In this section, we also put our focus on single and known
micro UAV detection and present generic methods that can be extended to multi-
MAV tracking.
Here, WE are particularly interested in the problem of tracking a MAV that is
deployed from a ground robot. We assume thus that the initial position of the MAV
after take-off is known. We also assume that its shape and size are known a priori.
We develop methods targeting solid-state LiDARs owing to the higher density of
the resulting point cloud even with more limited FoV. Moreover, in these LiDARs,
the concept of a frame or scan frequency changes considerably. Similarly, as in
rotating 3D LiDARs, a frame in solid-state LiDARs can be naturally related to a
single revolution. With non-repetitive scan patterns, LiDARs can output point clouds
93
Qingqing Li
t=0s t=1ms t=2ms t=4ms t=20ms
10 cm
7 cm
Sparse point cloud 
representing the MAV
The UGV is equipped 
with a limited field-of-
view lidar featuring a 
non-repetitive scan 
pattern.
Figure 42. Illustration of a ground robot tracking a micro-aerial vehicle (MAV) using a
limited-FoV solid-state LiDAR.
at adjustable frequencies with varying FoV coverage, as illustrated in Figure 41. This
opens the door to new LiDAR perception methods that exploit the possibilities of
adaptively adjust the frame integration time to better sense the objects. To the best
of our knowledge, this approach has not been previously studied. We apply the
proposed adaptive LiDAR scan integration methods within the problem of a UGV
tracking a MAV for external state estimation, as conceptualized in Figure 42. While
our focus is on MAVs, the proposed methods can also be easily adapted to detect
foreign objects or intruder MAVs more accurately. We first put our focus on single
and known MAV detection, but present generic methods that can be extended to
multi-MAV tracking as long as FoV limitations are accounted for.
6.2.1 Problem Definition
We consider the problem of tracking a MAV from a ground robot. The objective
is to improve the collaboration between the robots and the ability of the MAV to
navigate in complex environments aided by the UGV. The rest of this work delves
into the definition, design, and implementation of methods for tracking a single MAV.
Nonetheless, these can be extended to multiple MAVs. The main limitation when
tracking multiple units is the FoV of the LiDAR sensors onboard the ground vehicle,
and therefore assumptions have to be made to the spatial distribution of the MAVs
(always within the FoV of the ground robot). Alternatively, more LiDAR scanners
can be installed to increase the FoV.
The majority of 3D laser scanners available to date are multi-channel, rotating
LiDARs. While devices with 64 or 128 vertical channels can provide high angular
94
Dynamic Object Detection and Tracking
resolution in both horizontal and vertical dimensions, these high-end devices are not
the most common. Moreover, the scanning pattern is in general repetitive, which
has benefited from a geometric perspective in terms of data processing but does not
enable a higher FoV coverage with longer exposure if the position of the sensor is
fixed. New solid-state LiDARs featuring non-repetitive scan patterns, albeit having
more limited FoV, can provide more dense point clouds and often feature longer de-
tection ranges. In particular, we are interested in the possibilities of dynamically
adjusting the FoV coverage and density in the point cloud to be processed for de-
tection and tracking. Among the benefits of these new LiDARs and the possibilities
of adaptive scanning rates is also higher resilience against one of the challenges in
LiDAR-based perception: motion-induced distortion [187]. In general, the literature
targeting tracking of MAVs using LiDAR scanners is scarce, and existing methods in
point cloud object detection and tracking considering mainly static frames. We aim
to define more optimal settings for generating point clouds based on the state (speed
and distance to the sensor) of the MAV being tracked.
6.2.2 Tracking System Overview
We propose three simultaneous tracking modalities with three processes analyzing
point cloud frames resulting in integration times ranging several orders of magnitude.
A general view of the multi-modal tracking processes is shown in Figure 43. In more
detail, the three modalities are described below:
1. Adaptive high-frequency tracking. In this first process, sparse point clouds are
integrated at frequencies up to 100 Hz. The MAV is only trackable through a
reduced number of points, but we are able to estimate its position and speed
with high accuracy. In this process, the MAV is not necessarily recognizable
in all processed frames.
2. Adaptive medium-frequency tracking. The second process operates at frequen-
cies within the range of typical LiDAR scanners (i.e., 5 to 20 Hz). The fre-
quency within that same range is dynamically adjusted to optimize the density
of the point cloud. At these frequencies, the extracted point cloud representing
the MAV is distorted by motion, and thus the localization and speed estimation
accuracy is lower. However, this process enables more robust and persistent
tracking as the MAV can be recognized in most if not all frames.
3. Low-frequency trajectory and object validation. The third and last process that
runs in parallel to the previous two performs long-term tracking and validates
the reconstructed trajectory of the MAV based on predefined dimensional con-
straints. An illustration of such trajectory reconstruction is shown in Figure 44
95
Qingqing Li
Let 𝒫𝑘(𝐼𝑘𝑟 ) be the point cloud generated by the LiDAR with an integration time
𝐼𝑘𝑟 . We also denote by s𝑘𝑀𝐴𝑉 ={p𝑘𝑀𝐴𝑉 ,p˙𝑘𝑀𝐴𝑉 } the position and speed of the MAV.
We use discrete steps represented by 𝑘 owing to the discrete nature of the set of
consecutive point clouds. The output of the main tracking algorithm is to extract
from 𝒫𝑘(𝐼𝑘𝑟 ) the set of points representing the MAV, which we denote by 𝒫𝑘𝑀𝐴𝑉 ,
and to adjust the integration time for the next point cloud, 𝐼𝑘𝐻𝐹 , 𝐼
𝑘
𝑀𝐹 .
Algorithm 3: MAV tracking with adaptive scan integration
Input:
High- and medium-freq int. rates: {𝐼𝑘−1𝐻𝐹 , 𝐼𝑘−1𝑀𝐹 }
3D lidar point clouds: {𝒫𝑘(𝐼𝑘−1𝐻𝐹 ),𝒫𝑘(𝐼𝑘−1𝑀𝐹 ) }
Last known MAV state: (p𝑘−1𝑀𝐴𝑉 , p˙
𝑘−1
𝑀𝐴𝑉 )
Output:
MAV state: {p𝑘𝑀𝐴𝑉 , p˙𝑘𝑀𝐴𝑉 }
UGV control: q˙𝑘𝑈𝐺𝑉
Int. rates: {𝐼𝑘𝐻𝐹 , 𝐼𝑘𝑀𝐹 }
Function 𝑜𝑏𝑗𝑒𝑐𝑡 𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛
(︁
𝒫 , 𝐼, p𝑘−1𝑀𝐴𝑉 , p˙𝑘−1𝑀𝐴𝑉
)︁
:
Ground removal: 𝒫 ′ ← 𝒫;
Generate KD Tree: 𝑘𝑑𝑡𝑟𝑒𝑒← 𝒫 ′ ;
MAV pos estimation: p^𝑘𝑀𝐴𝑉 ← p𝑘−1𝑀𝐴𝑉 +
p˙𝑘−1
𝑀𝐴𝑉
𝐼
;
MAV points: 𝒫𝑘𝑀𝐴𝑉 = 𝐾𝑁𝑁(𝑘𝑑𝑡𝑟𝑒𝑒, p^𝑘𝑀𝐴𝑉 );
MAV state estimation: p𝑘𝑀𝐴𝑉 = 1|𝒫𝑘
𝑀𝐴𝑉
|
∑︀
𝑝∈𝒫𝑘
𝑀𝐴𝑉
𝑝;
return p𝑘𝑀𝐴𝑉 ;
// Coarse but persistent tracking
while new 𝒫𝑘(𝐼𝑘𝑀𝐹 ) do
p𝑘
′
𝑀𝐴𝑉 = 𝑜𝑏𝑗𝑒𝑐𝑡 𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛(𝒫𝑘(𝐼𝑘𝑀𝐹 ), 𝐼𝑘𝑀𝐹 , p𝑘−1𝑀𝐴𝑉 , p˙𝑘−1𝑀𝐴𝑉 );
// Fine-grained estimation
while new 𝒫𝑘(𝐼𝑘𝐻𝐹 ) do
p𝑘
′′
𝑀𝐴𝑉 = 𝑜𝑏𝑗𝑒𝑐𝑡 𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛(𝒫𝑘(𝐼𝑘𝐻𝐹 ), 𝐼𝑘𝐻𝐹 , p𝑘−1𝑀𝐴𝑉 , p˙𝑘−1𝑀𝐴𝑉 );
p𝑘𝑀𝐴𝑉 , p˙
𝑘
𝑀𝐴𝑉 ← 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒
(︁
p𝑘
′
𝑀𝐴𝑉 , p
𝑘′′
𝑀𝐴𝑉
)︁
;
{𝐼𝑘𝐻𝐹 , 𝐼𝑘𝑀𝐹 } ← 𝑎𝑑𝑗𝑢𝑠𝑡 𝑖𝑛𝑡𝑒𝑔𝑟𝑎𝑡𝑖𝑜𝑛 𝑓𝑟𝑒𝑞𝑠
(︀
p𝑘𝑀𝐴𝑉 , p˙
𝑘
𝑀𝐴𝑉
)︀
;
q˙𝑘𝑈𝐺𝑉 ← 𝑘𝑒𝑒𝑝 𝑤𝑖𝑡ℎ𝑖𝑛 𝐹𝑜𝑉
(︀
p𝑘𝑀𝐴𝑉 , p˙
𝑘
𝑀𝐴𝑉
)︀
;
6.2.3 Adaptive Scan Integration
Since we assume that the state of the MAV (p𝑘−1𝑀𝐴𝑉 , p˙𝑘−1𝑀𝐴𝑉 ) is initially known, the
point cloud processing proceeds as follows. First, we perform ground removal based
on the last-known altitude of the MAV. We then proceed with finding the nearest
neighbor points to a predicted MAV position. This step is repeated for both the high
and medium frequency scans, the former one providing a more accurate position
96
Dynamic Object Detection and Tracking
estimation while the latter is more persistent in time. Finally, these two estimations
are combined, and the results are utilized to adjust the integration rates based on
the point cloud density expected for the given distance and speed. This process is
outlined in Algorithm 3.
Algorithm 4: Trajectory validation
Input:
Low-freq int. rate: 𝐼𝑘−1𝐿𝐹
3D lidar point cloud: 𝒫𝑘
(︀
𝐼𝑘−1𝐿𝐹
)︀
MAV state history: (p𝑀𝐴𝑉 , p˙𝑀𝐴𝑉 )
Output: Trajectory validation (𝑏𝑜𝑜𝑙)
while new 𝒫𝑘
(︁
𝐼𝑘−1𝐿𝐹
)︁
do
// Generate cubic splines
// with position and speed constraints
{𝐵𝑖} ← {p𝑀𝐴𝑉 , p˙𝑀𝐴𝑉 };
// Estimate expected point cloud from
// known density at given distance and speed
𝒫𝑘 ← {{𝐵𝑖}, p𝑀𝐴𝑉 , p˙𝑀𝐴𝑉 } ;
// Calculate IoU
𝐼𝑜𝑈 = 𝑐𝑎𝑙𝑐 𝐼𝑜𝑈
(︀
𝒫𝑘
(︀
𝐼𝑘−1𝐿𝐹
)︀
,𝒫𝑘
)︀
;
if 𝐼𝑜𝑈 > 𝑡ℎ then
return True
else
return False
The main purpose of the low-frequency scan stream is to validate the extracted
MAV’s trajectory. While the tracking with adaptive scan integration only takes into
account the MAV size roughly in terms of distance within which nearest neighbors
are looked for, the extracted point cloud is not validated against its known dimen-
sions. This is done when enough points are accumulated into a reconstructed trajec-
tory. As exposed in Algorithm 4, we first perform a cubic spline interpolation based
on the history of estimated positions and speeds. To calculate the parameters of the
cubic spline, we utilize constraints on the first derivative based on the speed, rather
than forcing the first and second derivative to be continuous. Indeed, the acceleration
of the MAV can suddenly change. Based on predetermined values of point cloud den-
sity as a function of the MAV’s distance to the LiDAR and its speed, we then produce
an expected point cloud. We validate the original point cloud given a threshold for
the IoU measure with the generated estimate.
97
Qingqing Li
High-speed and high-
accuracy localization
Robust and persistent
long-term tracking
Trajectory and 
object identification
Adaptive lidar scan 
integration
MAV estate 
estimation
20-100 Hz
0.1 – 5 Hz
5-20 Hz
Continuous, 
non-repetitive 
laser scan point cloud
Figure 43. Overview of the proposed methods, where tracking is simultaneously performed
at three different scan frequencies. Within each of these three threads, the scan frame
integration is adjusted based on the distance to the target MAV and its speed.
Figure 44. Integration trajectory recovery example
98
Dynamic Object Detection and Tracking
6.2.4 Experimental Evaluation
Experimental platforms
The experimental platforms consist on a single ground robot and a commercially
available Ryze Tello MAV. The ground robot is an EAI Dashgo platform equipped
with a Livox Horizon LiDAR (81.7° × 25.1° FoV). The LiDAR is able to output
scanned pointcloud up to 100 Hz, featuring a non-repetitive pattern. A pair of ultra-
wideband (UWB) transceivers is used to obtain a single range between the robot
and the MAV at frequencies ranging from 10 Hz to 100 Hz. The UWB ranging is
used in aiding the manual validation of the extracted trajectory in places where there
was no external positioning system. In the future, it could be incorporated as part
of the tracking algorithm as well, as is becoming increasing adopted in multi-robot
systems [188; 189].
UWB Transceiver
Livox 3D Lidar with 81.7 o x25.1o FoV
Onboard Computer
Wi-Fi
UWB Transceiver
EAI Dashgo Ground Robot Ryze Tello MAV
Figure 45. Ground robot and MAV utilized in the experiments.
Software
The system has been implemented using ROS Melodic under Ubuntu 18.04. The al-
gorithms are running in the main computer onboard the ground robot. The computer
runs the Tello MAV driver1, the Livox LiDAR driver2, and our open-source MAV
tracking package3. The latter is a multi-threaded node able to process the different
point clouds in real time. The point cloud library (PCL) [190] is utilized to extract
the position of the MAV from the LiDAR’s point cloud.
1https://github.com/TIERS/tello-driver-ros
2https://github.com/Livox-SDK/livox_ros_driver
3https://github.com/TIERS/adaptive-LiDAR-tracking
99
Qingqing Li
Metrics
Due to the lack of an accurate external positioning system, such as a motion capture
system for large environments, our focus is instead on measuring the performance
of the tracking at different scan integration rates and manually validating the overall
trajectory. The experimental flights are carried out in large indoor halls with multiple
columns and objects, as shown in Figure 44. Another set of experiments is carried out
in a small flying area where an external UWB positioning system was available and
used to fly the MAV over a predefined trajectory. A characterization of the accuracy
of such a system can be found in [191].
Figure 46. Density of the point cloud representing the MAV based on the distance to the
LiDAR scanner and the scanning frequency.
Figure 47. Distance between consecutive MAV detections based on its speed and the
LiDAR’s scanning frequency.
Adaptive Scan Integration Results
The first objective of our experiments was to assess the tracking performance at dif-
ferent scan frequencies in order to better model the adaptiveness of our algorithm.
In order to adapt the scanning frequency to optimize the tracking performance, key
100
Dynamic Object Detection and Tracking
Figure 48. Accumulated point cloud for the circular trajectory.
parameters are the point cloud density at different distances and the reliability of the
detections at different speeds.
The point cloud density for different scanning frequencies as a function of the
distance between the LiDAR and the MAV is shown in Figure 46. This measure
refers only to the density of the points representing the MAV and not the overall
density including the rest of the scene. The darker lines represent the average point
cloud density, while the band with higher transparency represents the values within
the standard deviation. The size of the Tello MAV is about 500 cubic centimeters.
Based on our experiments, reliable tracking at high speeds can be achieved with at
least 4 points, while we require at least 20 points at medium scanning frequency.
This, however, only applies in free space. As can be seen in Figure 48, significant
noise appears in the point cloud between the MAV and walls in the environment
when flying nearby. We discuss further this issue at the end of this section.
In terms of the tracking performance based on the speed, we plot in Figure 47
the distance between consecutive detections at different scanning frequencies. The
results in this particular figure cannot be directly utilized to model the adaptive nature
of our tracking algorithm. Nonetheless, they can be leveraged to better understand
what are the speed limits under which given scanning frequencies do not provide the
expected distance between detection that can be inferred from the MAV speed and
the scan frequency.
The results included in Figure 46 and Figure 47 have been obtained flying the
MAV in a long, straight corridor with a length of about 35 m. The MAV was flying
mostly in straight lines and the speed was estimated using both visual odometry and
the position history extracted from the LiDAR data in a partially manual manner.
101
Qingqing Li
Figure 49. Estimated trajectories at different frequencies and with adaptive approach (top
five plots), and trajectory estimated from our algorithm (bottom plot).
Qualitative Trajectory Validation
In order to validate the performance of the tracking algorithm and better understand
the limitations of our tracking approach at different scanning frequencies, we com-
pare two different types of trajectories. Owing to the lack of a system to obtain
ground truth (e.g., a motion capture system), we provide qualitative analysis for one
of the trajectories and compare it with a UWB positioning system in the other one.
First, we test the tracking algorithm through a trajectory where the MAV flies
in a large open area at distances from 2 m to over 17 m far from the LiDAR scan-
ner and variable speeds. In this scenario, the analysis is mostly qualitative, with the
trajectories shown in Figure 49. However, the UWB ranging data and the LiDAR
data has been both manually confirmed, so the maximum positioning error along the
track is at worst around 20 cm. Qualitatively, the main results from this experiment
are the ability of the tracking algorithm to keep track of the MAV over changes in
speed, direction, and at longer distances. The figure only shows frequencies equal to
or above 5 Hz because at lower scanning frequencies the speed estimation was highly
inaccurate during the early stages of the flight. We can see that only at the highest
frequency we are able to track the MAV along the completed trajectory, while the
trajectory itself is noisier. The higher level of error when estimating the MAV po-
sition is due to a lower number of points being detected, which can correspond to
different parts of the MAV in consecutive scans. The last subplot shows the overall
estimated trajectory where our algorithm has combined the different scanning fre-
quencies to obtain the smoothness of the medium frequencies and the performance
of the higher frequencies. The trajectory also employs the cubic spline interpolation
102
Dynamic Object Detection and Tracking
Figure 50. Reference trajectory (UWB) and estimated positions at different fixed
frequencies.
from the validation algorithm.
Second, we perform a continuous flight with a predefined circular trajectory in a
small flying arena where the UWB positioning system is available. The results for
this flight are shown in Figure 50. The leftmost plot shows the reference position.
However, it is worth noticing that the accuracy of the LiDAR, of around 2 cm for
distances shorter than 20 m, is higher than the average accuracy of 10 to 15 cm in
the UWB positioning system. Therefore, the trajectory is mere as a reference and
only a qualitative discussion is possible with these results. In any case, owing to the
continuous change in the speed of the MAV, which is a prior unknown to the tracking
algorithm, again only at frequencies equal or over 5 Hz are we able to track the MAV.
Nonetheless, at 5 Hz the tracking stops before the fourth revolution is completed, and
persistent tracking is only possible when higher frequencies are taken into account.
6.2.5 Summary and Conclusion
We have shown in this section qualitative results that show the performance of the
adaptive tracking algorithm and the same approach applied only to specific scanning
frequencies. From both sets of experiments, the main conclusion is that the adaptive
approach is able to accommodate a wider variety of scenarios. We have been able to
put together the flexibility of high-speed tracking with the robustness of medium fre-
quencies, avoiding the frequent errors of the former, and the lower tracking capacity
of the latter is more challenging conditions.
One key limitation when tracking MAVs, as visualized in the circular trajectory
103
Qingqing Li
experiments, is the low density of the point cloud and the inability to tell the dif-
ference between the MAV’s points and LiDAR noise. This is also due to the low
reflectively of the MAV, and there is thus the potential for mitigation with more re-
flective surfaces that could aid in separating the sparse MAV point cloud from the
LiDAR noise originated due to near objects. As we can see in Figure 48, the point
cloud density near the rear wall is very sparse in some areas, therefore being unable
to reconstruct a robust trajectory as there are multiple options available that would
meet the dynamics and dimensional constraints of the MAV.
104
7 Conclusions
This thesis aimed to investigate robust localization, mapping, and tracking algorithms
through the exploration of novel sensors and fusion methods. This encompassed cre-
ating a multi-LiDAR dataset to advance SLAM research, addressing mapping com-
plexities in environments like large-scale forests and GNSS-denied urban areas. The
work also involved devising methods for dynamic object detection and tracking and
enhancing dense mapping capabilities using spinning LiDAR and solid-state LiDAR
technologies.
7.1 Summary and Contributions
The thesis start by studying different modal LiDAR sensors and investigate the state-
of-the-art SLAM algorithms with presented dataset in Chapter 2. The comprehen-
sive LiDAR dataset that includes data from five LiDAR sensors, a LiDAR camera,
and stereo fish-eye cameras, spanning various environments. The evaluation of nine
sequences on two computing platforms focuses on LiDAR Odometry and power con-
sumption. The dataset serves as a benchmark for different localization and mapping
algorithms, facilitating the development of generalized methods for diverse LiDAR
sensors and environments. From the experimental results, it was observed that the
solid-state LiDAR-based SLAM algorithms present the most detailed and clear map.
However, the spinning LiDAR-based method exhibits less odometry drift across dif-
ferent environments. This indicates that each LiDAR has its own advantages, and
fusing these sensors into one LiDAR system can contribute to building a more robust
mapping system
Chapter 3 addressed LiDAR-based global localization challenges in GNSS-denied
unstructured forest environments. An effective segmentation-based approach was
proposed for accurate trunk position detection and Delaunay triangulation-based lo-
calization. This method enables real-time estimation of precise 3D position and
rotation in forest environments, aiding autonomous vehicle localization and trunk
recognition. From the real experiments with data from spinning LiDAR attached
to a working harvester, our semantic graph search approach demonstrated promis-
ing performance in accurately localizing vehicles in challenging forest environments
without GNSS data. This can greatly benefit future fully autonomous harvesters op-
erating in forest environments.
105
Qingqing Li
Chapter 4 delves into another global localization problem, focusing on urban en-
vironments for delivery robots, where pre-built maps are altered, and accurate GNSS
data is unavailable. A sensor fusion method using 3D LiDAR data complemented
with GNSS and inertial data was proposed to rectify corrupted maps. Algorithms
and analysis were tested in the JD Digital Globalization Challenge, where our team
excelled. From the results, LiDAR scan matching against a 3D map provides the
highest accuracy for localization, both in terms of position and orientation. There-
fore, it is essential to take into account different modality sensor data to implement
a more robust approach that is less prone to instabilities and depends less on the
operational environment. While GNSS and inertial data are essential for increasing
localization accuracy, their performance can be affected by the environment. Hence,
it is crucial to detect and minimize the possibilities of unexpected behavior of sensors
for a robust localization algorithm.
Based on the findings from studying different modal LiDARs and IMU sensors
in Chapter 2, and robust localization solutions in Chapters 3 and 4, Chapter 5 intro-
duces a multi-modal multi-LiDAR sensor fusion framework utilizing novel LiDAR
sensors for high-performance localization and mapping in challenging indoor envi-
ronments. The proposed system effectively integrates the advantages of dense point
clouds generated by solid-state LiDAR for enhanced situational awareness and the
larger Field of View (FoV) offered by spinning LiDARs, achieving high-robustness
odometry and producing high-quality maps by utilizing an optimal combination of
cost-effective sensors. Additionally, the proposed framework adopts novel time syn-
chronization, IMU pre-integration, as well as sliding window optimization methods
for state-of-the-art performance. The methods are modularly designed and can be
adapted to other LiDAR combinations systems.
Chapter 6 explores dynamic object detection and tracking techniques for accurate
mapping and multi-robot cooperation, focusing on deep learning-based object detec-
tion and adaptive tracking algorithms in various scenarios, notably human detection
during search and rescue operations and non-repetitive scanning patterns with solid-
state LiDAR. Our experiments with camera-based human detection reveal correla-
tions between altitude and detection confidence, as well as between confidence and
detection correctness, particularly for detections near actual people and over water
turbulence. Solid-state LiDAR with non-repetitive scanning patterns enables track-
ing of small objects, despite challenges arising from the continuous change in target
speed. Integration at high frequencies results in higher error levels in MAV position
estimation due to fewer detected points, while lower scanning frequencies lead to
highly inaccurate speed estimation. In conclusion, persistent tracking is achievable
only with consideration of higher frequencies.
In summary, this research addressed key challenges in localization, mapping, and
tracking across different sensor modalities, providing solutions for challenging envi-
ronments. Additionally, a large-scale multi-modal multi-sensor dataset was curated
106
Conclusions
to aid in the development of generalized localization and mapping algorithms.
7.2 Future Research
This thesis extensively investigates innovative methodologies employing advanced
sensor fusion algorithms to significantly enhance various LiDAR perception tasks,
including global localization, multi-sensor fusion, and object tracking. Despite ad-
dressing some key research problems, further efforts are required to develop a robust,
accurate, and reliable localization and mapping system capable of operating effec-
tively in diverse challenging environments. In the subsequent section, we primarily
explore potential future research directions and address open questions based on our
findings in these areas.
7.2.1 Adaptive Multi Sensor Online Calibration for Life Long Au-
tonomous System
In Chapter 2, we manually measured the intrinsic and extrinsic parameters of pro-
posed multi-sensor data collection platform. The research findings highlight the
potential of integrating multiple sensors to significantly enhance a system’s per-
ception and environmental understanding. However, this integration presents chal-
lenges. The growing number of sensors increases system complexity, posing difficul-
ties for long-term monitoring of intrinsic and extrinsic parameter, which are pivotal
for achieving precision and robustness in state estimation within multi-sensor fusion
algorithms.
Previous studies have introduced various offline approaches for estimating spatial
and temporal sensor relationships [192; 193]. Recent efforts have delved into online
extrinsic calibration for multi-LiDAR systems, as proposed by Jiao et al. [35]. Lee et
al. presented an efficient real-time Multi-State Constraint Kalman Filter (MSCKF)-
based multi-sensor aided inertial navigation system (MINS) [194], integrating data
from cameras, wheel encoders, GPS, and 3D LiDAR with online calibration to en-
sure optimal performance. Online intrinsic calibration has also been addressed in
[195], and camera-IMU extrinsic calibration discussed in [196] for visual-inertial
navigation systems (VINS).
Despite these advancements, the escalating complexity of multi-sensor platforms
underscores the need for further research in adaptive online sensor calibration. Ad-
dressing this challenge requires developing techniques that facilitate real-time and
effective calibration of intrinsic, extrinsic, and temporal parameters across diverse
sensor types and modalities.
107
Qingqing Li
7.2.2 Global Localization in Dynamic Environments
Chapter 3 of the thesis introduces the implementation of real-time, low-cost segmen-
tation, and graph search-based global localization methods in challenging forest en-
vironments. The rapid progress in deep learning techniques has led to the emergence
of more data-driven approaches utilizing semantic object-level global localization
methods.
Multiple approaches involve the conversion of raw point cloud data into a unique
graph representation by aggregating semantic information. These methods employ
deep learning-based techniques to compare the similarity of semantic graphs for
place recognition tasks [197; 23]. Several research studies tackle global localization
challenges by computing discriminative 3D point cloud descriptors using advanced
deep learning techniques [198; 199]. These methods leverage deep neural networks,
such as PointNet, as an initial processing step to extract local features, which are
then aggregated into powerful global descriptors [200; 201]. Additionally, beyond
point-based descriptors, sparse voxelized point cloud descriptors and the 3D FPN ar-
chitecture [202; 203]. These deep learning-based methods introduce powerful tools
to extract both local and global features and have demonstrated promising results.
However, a fundamental requirement for creating a robust, lifelong autonomous
system is the capacity to adapt to dynamic and changing surroundings. Chapter 3
challenges the assumption of a static environment, acknowledging real-world scenar-
ios involving significant alterations, such as terrain changes due to harvesters navigat-
ing through forests or construction robots modifying their surroundings [204; 205].
Devising strategies to achieve robust global localization in such dynamic environ-
ments remains an open challenge that has not yet been adequately addressed.
7.2.3 Enhancing Human-robot Interaction Through Multi-Modal
Multi-LiDAR Fusion
In chapters 5 and chapter 6, our research successfully implemented tightly coupled
multi-LiDAR inertial dense mapping and dynamic object detection and tracking
methods, utilizing dense measurements from solid-state LiDARs. Notably, solid-
state LiDAR holds significant potential in the domain of human-robot interaction,
offering heightened situational awareness capabilities.
Advancements in accurate and user-friendly technologies have progressed along-
side sensor technology. Integrating hand gesture recognition into human-robot in-
teraction has the potential to transform communication, streamlining collaboration.
This innovation promises increased efficiency across applications and tackles chal-
lenges. Li et al. introduced pioneering work [206], presenting a LiDAR-captured
human motion dataset, surpassing RGB-based methods with a deep learning-based
approach [207; 208]. Similarly, Chamorro et al. showcased real-time robot tele-
108
Conclusions
portation using deep learning-based LiDAR gesture recognition [209]. They ef-
fectively predicted gestures using Long Short-Term Memory networks. Addition-
ally, Moysiadis et al. proposed an integrated real-time hand gesture recognition
framework for agricultural human-robot interaction [210]. This technology promises
smarter and more user-friendly human-robot interactions.
Future research may emphasize solid-state LiDAR or multi-modal LiDAR-based
gesture recognition to enhance human-computer interaction. Leveraging solid-state
LiDAR capabilities can strengthen collaboration between humans and robots, fos-
tering more intuitive interactions. This progress holds the potential to revolutionize
human-robot collaboration, enabling seamless interactions in dynamic environments
across industries.
109
List of References
[1] Diksha Singh, Esha Trivedi, Yukti Sharma, and Vandana Niranjan. Turtlebot: Design and hard-
ware component selection. In 2018 International Conference on Computing, Power and Com-
munication Technologies (GUCON), pages 805–809. IEEE, 2018.
[2] Shinpei Kato, Shota Tokunaga, Yuya Maruyama, Seiya Maeda, Manato Hirabayashi, Yuki Kit-
sukawa, Abraham Monrroy, Tomohito Ando, Yusuke Fujii, and Takuya Azumi. Autoware on
board: Enabling autonomous vehicles with embedded systems. In 2018 ACM/IEEE 9th Interna-
tional Conference on Cyber-Physical Systems (ICCPS), pages 287–296. IEEE, 2018.
[3] Alexander Carballo, Jacob Lambert, Abraham Monrroy, David Wong, Patiphon Narksri, Yuki
Kitsukawa, Eijiro Takeuchi, Shinpei Kato, and Kazuya Takeda. Libre: The multiple 3d lidar
dataset. In 2020 IEEE Intelligent Vehicles Symposium (IV), pages 1094–1101. IEEE, 2020.
[4] Thinal Raj, Fazida Hanim Hashim, Aqilah Baseri Huddin, Mohd Faisal Ibrahim, and Aini Hus-
sain. A survey on lidar scanning mechanisms. Electronics, 9(5):741, 2020.
[5] Martin Buehler, Karl Iagnemma, and Sanjiv Singh. The DARPA urban challenge: autonomous
vehicles in city traffic, volume 56. springer, 2009.
[6] Hartmut Surmann, Andreas Nu¨chter, and Joachim Hertzberg. An autonomous mobile robot with
a 3d laser range finder for 3d exploration and digitalization of indoor environments. Robotics
and Autonomous Systems, 45(3-4):181–198, 2003.
[7] Qingqing Li, Jorge Pen˜a Queralta, Tuan Nguyen Gia, Zhuo Zou, and Tomi Westerlund. Multi-
sensor fusion for navigation and mapping in autonomous vehicles: Accurate localization in urban
environments. Unmanned Systems, 8(03):229–237, 2020.
[8] Nina Varney, Vijayan K Asari, and Quinn Graehling. Dales: a large-scale aerial lidar data set for
semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition Workshops, pages 186–187, 2020.
[9] Xu Liu, Guilherme V Nardari, Fernando Cladera Ojeda, Yuezhan Tao, Alex Zhou, Thomas
Donnelly, Chao Qu, Steven W Chen, Roseli AF Romero, Camillo J Taylor, and Vijay Kumar.
Large-scale autonomous flight with real-time semantic slam under dense forest canopy. IEEE
Robotics and Automation Letters (RA-L), 2022.
[10] Juntao Yang, Zhizhong Kang, Sai Cheng, Zhou Yang, and Perpetual Hope Akwensi. An in-
dividual tree segmentation method based on watershed algorithm and three-dimensional spatial
distribution analysis from airborne lidar point clouds. IEEE Journal of Selected Topics in Applied
Earth Observations and Remote Sensing, 13:1055–1067, 2020.
[11] Ignacio Vizzo, Tiziano Guadagnino, Benedikt Mersch, Louis Wiesmann, Jens Behley, and Cyrill
Stachniss. Kiss-icp: In defense of point-to-point icp simple, accurate, and robust registration if
done the right way. IEEE Robotics and Automation Letters, 2023.
[12] Aleksandr Segal, Dirk Haehnel, and Sebastian Thrun. Generalized-icp. In Robotics: science and
systems, volume 2, page 435. Seattle, WA, 2009.
[13] Ji Zhang and Sanjiv Singh. Loam: Lidar odometry and mapping in real-time. In Robotics:
Science and systems, volume 2, pages 1–9. Berkeley, CA, 2014.
[14] Tixiao Shan and Brendan Englot. Lego-loam: Lightweight and ground-optimized lidar odometry
and mapping on variable terrain. In 2018 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), pages 4758–4765. IEEE, 2018.
110
LIST OF REFERENCES
[15] Jiarong Lin and Fu Zhang. Loam livox: A fast, robust, high-precision LiDAR odometry and
mapping package for LiDARs of small FoV. In IEEE International Conference on Robotics and
Automation (ICRA), pages 3126–3131. IEEE, 2020.
[16] Kailai Li, Meng Li, and Uwe D Hanebeck. Towards high-performance solid-state-lidar-inertial
odometry and mapping. IEEE Robotics and Automation Letters, 6(3):5167–5174, 2021.
[17] Wei Xu and Fu Zhang. Fast-lio: A fast, robust lidar-inertial odometry package by tightly-coupled
iterated kalman filter. IEEE Robotics and Automation Letters, 6(2), 2021.
[18] Jiarong Lin, Xiyuan Liu, and Fu Zhang. A decentralized framework for simultaneous calibration,
localization and mapping with multiple lidars. In 2020 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), pages 4870–4877. IEEE, 2020.
[19] Huan Yin, Xuecheng Xu, Sha Lu, Xieyuanli Chen, Rong Xiong, Shaojie Shen, Cyrill Stachniss,
and Yue Wang. A survey on global lidar localization: Challenges, advances and open problems.
arXiv preprint arXiv:2302.07433, 2023.
[20] Rainer Ku¨mmerle, Giorgio Grisetti, Hauke Strasdat, Kurt Konolige, and Wolfram Burgard. g
2 o: A general framework for graph optimization. In 2011 IEEE International Conference on
Robotics and Automation, pages 3607–3613. IEEE, 2011.
[21] Tixiao Shan, Brendan Englot, Drew Meyers, Wei Wang, Carlo Ratti, and Daniela Rus. Lio-sam:
Tightly-coupled lidar inertial odometry via smoothing and mapping. In 2020 IEEE/RSJ interna-
tional conference on intelligent robots and systems (IROS), pages 5135–5142. IEEE, 2020.
[22] Hao Dong, Xieyuanli Chen, Simo Sa¨rkka¨, and Cyrill Stachniss. Online pole segmentation on
range images for long-term lidar localization in urban environments. Robotics and Autonomous
Systems, 159:104283, 2023.
[23] Xin Kong, Xuemeng Yang, Guangyao Zhai, Xiangrui Zhao, Xianfang Zeng, Mengmeng Wang,
Yong Liu, Wanlong Li, and Feng Wen. Semantic graph based place recognition for 3d point
clouds. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
pages 8216–8223. IEEE, 2020.
[24] Lin Li, Xin Kong, Xiangrui Zhao, Tianxin Huang, Wanlong Li, Feng Wen, Hongbo Zhang, and
Yong Liu. Ssc: Semantic scan context for large-scale place recognition. In 2021 IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS), pages 2092–2099. IEEE,
2021.
[25] Giseop Kim, Sunwook Choi, and Ayoung Kim. Scan context++: Structural place recognition
robust to rotation and lateral variations in urban environments. IEEE Transactions on Robotics,
38(3):1856–1874, 2021.
[26] Marwan Hussein, Matthew Renner, and Karl Iagnemma. Global localization of autonomous
robots in forest environments. Photogrammetric Engineering & Remote Sensing, 81(11):839–
846, 2015.
[27] Georgi Tinchev, Simona Nobili, and Maurice Fallon. Seeing the wood for the trees: Reliable
localization in urban and natural environments. In 2018 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), pages 8239–8246. IEEE, 2018.
[28] Haoyang Ye, Yuying Chen, and Ming Liu. Tightly coupled 3d lidar inertial odometry and map-
ping. In 2019 International Conference on Robotics and Automation (ICRA), pages 3144–3150.
IEEE, 2019.
[29] Tixiao Shan, Brendan Englot, Carlo Ratti, and Daniela Rus. Lvi-sam: Tightly-coupled lidar-
visual-inertial odometry via smoothing and mapping. In 2021 IEEE international conference on
robotics and automation (ICRA), pages 5692–5698. IEEE, 2021.
[30] Shibo Zhao, Hengrui Zhang, Peng Wang, Lucas Nogueira, and Sebastian Scherer. Super odome-
try: Imu-centric lidar-visual-inertial estimator for challenging environments. In 2021 IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS), pages 8729–8736. IEEE,
2021.
[31] Jiarong Lin and Fu Zhang. Loam livox: A fast, robust, high-precision lidar odometry and map-
ping package for lidars of small fov. In 2020 IEEE International Conference on Robotics and
Automation (ICRA), pages 3126–3131. IEEE, 2020.
111
Qingqing Li
[32] Zheng Liu and Fu Zhang. Balm: Bundle adjustment for lidar mapping. IEEE Robotics and
Automation Letters, 6(2):3184–3191, 2021.
[33] Wei Xu and Fu Zhang. Fast-lio: A fast, robust lidar-inertial odometry package by tightly-coupled
iterated kalman filter. IEEE Robotics and Automation Letters, 6(2):3317–3324, 2021.
[34] Yuewen Zhu, Chunran Zheng, Chongjian Yuan, Xu Huang, and Xiaoping Hong. Camvox: A
low-cost and accurate lidar-assisted visual slam system. In 2021 IEEE International Conference
on Robotics and Automation (ICRA), pages 5049–5055. IEEE, 2021.
[35] Jianhao Jiao, Haoyang Ye, Yilong Zhu, and Ming Liu. Robust odometry and mapping for multi-
lidar systems with online extrinsic calibration. IEEE Transactions on Robotics, 2021.
[36] Thien-Minh Nguyen, Shenghai Yuan, Muqing Cao, Lyu Yang, Thien Hoang Nguyen, and Lihua
Xie. Miliom: Tightly coupled multi-input lidar-inertia odometry and mapping. IEEE Robotics
and Automation Letters, 6(3), 2021.
[37] Haoyang Ye, Yuying Chen, and Ming Liu. Tightly coupled 3d lidar inertial odometry and map-
ping. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019.
[38] Yan Xu, Junyi Lin, Jianping Shi, Guofeng Zhang, Xiaogang Wang, and Hongsheng Li. Ro-
bust self-supervised lidar odometry via representative structure discovery and 3d inherent error
modeling. IEEE Robotics and Automation Letters, 2022.
[39] Ha Sier, Qingqing Li, Xianjia Yu, Jorge Pen˜a Queralta, Zhuo Zou, and Tomi Westerlund. A
benchmark for multi-modal lidar slam with ground truth in gnss-denied environments. Remote
Sensing, 15(13):3314, 2023.
[40] Pengxin Chen, Wenzhong Shi, Sheng Bao, Muyang Wang, Wenzheng Fan, and Haodong Xiang.
Low-drift odometry, mapping and ground segmentation using a backpack lidar system. IEEE
Robotics and Automation Letters, 6(4), 2021.
[41] Tong Qin, Peiliang Li, and Shaojie Shen. Vins-mono: A robust and versatile monocular visual-
inertial state estimator. IEEE Transactions on Robotics, 34(4):1004–1020, 2018.
[42] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The
kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
[43] Li Qingqing, Yu Xianjia, Jorge Pena Queralta, and Tomi Westerlund. Multi-modal lidar dataset
for benchmarking general-purpose localization and mapping algorithms. In 2022 IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS), pages 3837–3844. IEEE,
2022.
[44] Naoki Akai, Luis Yoichi Morales, Eijiro Takeuchi, Yuki Yoshihara, and Yoshiki Ninomiya. Ro-
bust localization using 3d ndt scan matching with experimentally determined uncertainty and
road marker matching. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 1356–1363.
IEEE, 2017.
[45] Jesse Levinson, Michael Montemerlo, and Sebastian Thrun. Map-based precision vehicle local-
ization in urban environments. In Robotics: Science and Systems, volume 4, page 1. Citeseer,
2007.
[46] Qingqing Li, Paavo Nevalainen, Jorge Pen˜a Queralta, Jukka Heikkonen, and Tomi Westerlund.
Localization in unstructured environments: Towards autonomous robots in forests with delaunay
triangulation. Remote Sensing, 12(11):1870, 2020.
[47] Wei Wang, Jun Liu, Chenjie Wang, Bin Luo, and Cheng Zhang. Dv-loam: Direct visual lidar
odometry and mapping. Remote Sensing, 13(16):3340, 2021.
[48] Junyi Ma, Jun Zhang, Jintao Xu, Rui Ai, Weihao Gu, and Xieyuanli Chen. Overlaptrans-
former: An efficient and yaw-angle-invariant transformer network for lidar-based place recog-
nition. IEEE Robotics and Automation Letters, 7(3):6958–6965, 2022.
[49] Qingqing Li. Multi-modal lidars dataset for benchmarking general-purpose localization and
mapping algorithms, 2022. URL https://github.com/TIERS/tiers-lidars-dataset.
[50] Qingqing Li. A benchmark for multi-modal lidar slam with ground truth in gnss-denied environ-
ments, 2023. URL https://github.com/TIERS/tiers-lidars-dataset-enhanced.
[51] Will Maddern, Geoffrey Pascoe, Chris Linegar, and Paul Newman. 1 year, 1000 km: The oxford
robotcar dataset. The International Journal of Robotics Research, 36(1):3–15, 2017.
112
LIST OF REFERENCES
[52] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush
Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset
for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition(CVPR), pages 11621–11631, 2020.
[53] Zhi Yan, Li Sun, Toma´sˇ Krajnı´k, and Yassine Ruichek. Eu long-term dataset with multiple
sensors for autonomous driving. In 2020 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), pages 10697–10704. IEEE, 2020.
[54] Nicholas Carlevaris-Bianco, Arash K Ushani, and Ryan M Eustice. University of michigan north
campus long-term vision and lidar dataset. The International Journal of Robotics Research, 35
(9):1023–1035, 2016.
[55] Milad Ramezani, Yiduo Wang, Marco Camurri, David Wisth, Matias Mattamala, and Maurice
Fallon. The newer college dataset: Handheld lidar, inertial and vision with ground truth. In 2020
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4353–4360.
IEEE, 2020.
[56] Jie Yin, Ang Li, Tao Li, Wenxian Yu, and Danping Zou. M2dgr: A multi-sensor and multi-
scenario slam dataset for ground robots. IEEE Robotics and Automation Letters, 7(2):2266–2273,
2021.
[57] Maggie Wigness, Sungmin Eum, John G Rogers, David Han, and Heesung Kwon. A rugd dataset
for autonomous navigation and visual perception in unstructured outdoor environments. In 2019
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5000–5007.
IEEE, 2019.
[58] Chaoyue Niu, Danesh Tarapore, and Klaus-Peter Zauner. Low-viewpoint forest depth dataset
for sparse rover swarms. In 2020 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), pages 8035–8040. IEEE, 2020.
[59] Gabriela Gresenz, Jules White, and Douglas C Schmidt. An off-road terrain dataset including
images labeled with measures of terrain roughness. In 2021 IEEE International Conference on
Autonomous Systems (ICAS), pages 1–5. IEEE, 2021.
[60] Zachary Pezzementi, Trenton Tabor, Peiyun Hu, Jonathan K Chang, Deva Ramanan, Carl
Wellington, Benzun P Wisely Babu, and Herman Herman. Comparing apples and oranges:
Off-road pedestrian detection on the national robotics engineering center agricultural person-
detection dataset. Journal of Field Robotics, 35(4):545–563, 2018.
[61] Pengchuan Xiao, Zhenlei Shao, Steven Hao, Zishuo Zhang, Xiaolin Chai, Judy Jiao, Zesong Li,
Jian Wu, Kai Sun, Kun Jiang, et al. Pandaset: Advanced sensor suite dataset for autonomous driv-
ing. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pages
3095–3101. IEEE, 2021.
[62] Jinyong Jeong, Younggun Cho, Young-Sik Shin, Hyunchul Roh, and Ayoung Kim. Complex
urban lidar data set. In 2018 IEEE International Conference on Robotics and Automation (ICRA),
pages 6344–6351. IEEE, 2018.
[63] Peter Biber and Wolfgang Straer. The normal distributions transform: A new approach to laser
scan matching. In Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS 2003)(Cat. No. 03CH37453), volume 3, pages 2743–2748. IEEE, 2003.
[64] Ju¨rgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A
benchmark for the evaluation of rgb-d slam systems. In 2012 IEEE/RSJ International Conference
on Intelligent Robots and Systems(IROS), pages 573–580. IEEE, 2012.
[65] Yun Chang, Kamak Ebadi, Christopher E Denniston, Muhammad Fadhil Ginting, Antoni Rosi-
nol, Andrzej Reinke, Matteo Palieri, Jingnan Shi, Arghya Chatterjee, Benjamin Morrell, et al.
Lamp 2.0: A robust multi-robot slam system for operation in challenging large-scale under-
ground environments. IEEE Robotics and Automation Letters, 7(4):9175–9182, 2022.
[66] Ville Kankare, Jari Vauhkonen, Topi Tanhuanpaa, Markus Holopainen, Mikko Vastaranta, Mari-
anna Joensuu, Anssi Krooks, Juha Hyyppa, Hannu Hyyppa, Petteri Alho, and Risto Viitala. Ac-
curacy in estimation of timber assortments and stem distribution - a comparison of airborne and
113
Qingqing Li
terrestrial laser scanning techniques. ISPRS Journal of Photogrammetry and Remote Sensing,
97:89–97, 2014. ISSN 0924-2716. doi: 10.1016/j.isprsjprs.2014.08.008.
[67] Thomas Hellstro¨m, P. La¨rkeryd, Tomas Nordfjell, and Ola Ringdahl. Autonomous forest vehi-
cles: Historic, envisioned, and state-of-the-art. International Journal of Forest Engineering, 20:
31–38, 01 2009.
[68] Fang Liao, Shupeng Lai, Yuchao Hu, Jinqiang Cui, Jian Liang Wang, Rodney Teo, and Feng
Lin. 3d motion planning for uavs in gps-denied unknown forest environment. In 2016 IEEE
Intelligent Vehicles Symposium (IV), pages 246–251. IEEE, 2016.
[69] Yulun Tian, Katherine Liu, Kyel Ok, Loc Tran, Danette Allen, Nicholas Roy, and Jonathan P
How. Search and rescue under the forest canopy using multiple uas. In International Symposium
on Experimental Robotics, pages 140–152. Springer, 2018.
[70] Ola Ringdahl, Ola Lindroos, Thomas Hellstro¨m, Dan Bergstro¨m, Dimitris Athanassiadis, and
Tomas Nordfjell. Path tracking in forest terrain by an autonomous forwarder. Scandinavian
Journal of Forest Research, 26(4):350–359, 2011. doi: 10.1080/02827581.2011.566889. URL
https://doi.org/10.1080/02827581.2011.566889.
[71] Xiaorui Zhu, Youngshik Kim, Mark A. Minor, and Chunxin Qiu. Autonomous Mobile Robots in
Unknown Outdoor Environments. CRC Press, Inc., USA, 1st edition, 2017. ISBN 1498740553.
[72] Claudine Badue, Raˆnik Guidolini, Raphael Vivacqua Carneiro, Pedro Azevedo, Vinicius B Car-
doso, Avelino Forechi, Luan Jesus, Rodrigo Berriel, Thiago M Paixao, Filipe Mutz, et al. Self-
driving cars: A survey. Expert Systems with Applications, 165:113816, 2021.
[73] Rajeev Thakur. Scanning lidar in advanced driver assistance systems and beyond: Building a
road map for next-generation lidar technology. IEEE Consumer Electronics Magazine, 5(3):
48–54, 2016.
[74] Li Qingqing, Jorge Pen˜a Queralta, Tuan Nguyen Gia, Zhuo Zou, and Tomi Westerlund. Multi
sensor fusion for navigation and mapping in autonomous vehicles: Accurate localization in urban
environments. Unammned Systems (to appear). Accepted and presented at the 9th IEEE CIS-
RAM conference., 2020.
[75] Wuming Zhang, Peng Wan, Tiejun Wang, Shangshu Cai, Yiming Chen, Xiuliang Jin, and
Guangjian Yan. A novel approach for the detection of standing tree stems from plot-level terres-
trial laser scanning data. Remote sensing, 11(2):211, 2019.
[76] Marek Pierzchała, Philippe Gigue´re, and Rasmus Astrup. Mapping forests using an unmanned
ground vehicle with 3D LiDAR and graph-SLAM. Computers and Electronics in Agriculture,
145:217 – 225, 2018. ISSN 0168-1699. doi: https://doi.org/10.1016/j.compag.2017.12.034.
URL http://www.sciencedirect.com/science/article/pii/S0168169917301631.
[77] Xinlian Liang, Ville Kankare, Juha Hyyppa¨, Yunsheng Wang, Antero Kukko, Henrik Haggre´n,
Xiaowei Yu, Harri Kaartinen, Anttoni Jaakkola, Fengying Guan, Markus Holopainen, and Mikko
Vastaranta. Terrestrial laser scanning in forest inventories. ISPRS Journal of Photogrammetry
and Remote Sensing, 115:63 – 77, 2016. ISSN 0924-2716. doi: https://doi.org/10.1016/
j.isprsjprs.2016.01.006. URL http://www.sciencedirect.com/science/article/pii/
S0924271616000204. Theme issue ’State-of-the-art in photogrammetry, remote sensing and
spatial information science’.
[78] Mikko Miettinen, Matti O¨hman, Arto Visala, and Pekka Forsman. Simultaneous localization and
mapping for forest harvesters. In IEEE International Conference on Robotics and Automation,
Rooma April 2007, pages 517–522, 2007.
[79] Jian Tang, Yuwei Chen, Antero Kukko, Harri Kaartinen, Anttoni Jaakkola, Ehsan Khoramshahi,
Teemu Hakala, Juha Hyyppa¨, Markus Holopainen, and Hannu Hyyppa¨. Slam-aided stem map-
ping for forest inventory with small-footprint mobile lidar. Forests, 6(12):4588–4606, 2015.
[80] Ayumu Tominaga, Hayashi Eiji, and Abbe Mowshowitz. Development of navigation system in
field robot for forest management. In 2018 Joint 10th International Conference on Soft Comput-
ing and Intelligent Systems (SCIS) and 19th International Symposium on Advanced Intelligent
Systems (ISIS), pages 1142–1147. IEEE, 2018.
114
LIST OF REFERENCES
[81] Steven W Chen, Guilherme V Nardari, Elijah S Lee, Chao Qu, Xu Liu, Roseli Ap Francelin
Romero, and Vijay Kumar. Sloam: Semantic lidar odometry and mapping for forest inventory.
IEEE Robotics and Automation Letters, 5(2):612–619, 2020.
[82] Keisuke Yoneda, Naoki Suganuma, Ryo Yanase, and Mohammad Aldibaja. Automated driving
recognition technologies for adverse weather conditions. IATSS Research, 2019.
[83] Aji Resindra Widya, Akihiko Torii, and Masatoshi Okutomi. Structure from motion using dense
cnn features with keypoint relocalization. IPSJ Transactions on Computer Vision and Applica-
tions, 10:1–7, 2018.
[84] Tomastik et al. Horizontal accuracy and applicability of smartphone gnss positioning in forests.
Forestry: An International Journal of Forest Research, 90(2):187–198, 2016.
[85] Eloise G. Zimbelman and Robert F. Keefe. Real-time positioning in logging: Effects of forest
stand characteristics, topography, and line-of-sight obstructions on gnss-rf transponder accuracy
and radio signal propagation. PLOS ONE, 13(1):1–17, 01 2018. doi: 10.1371/journal.pone.
0191017. URL https://doi.org/10.1371/journal.pone.0191017.
[86] Szymon Rusinkiewicz and Marc Levoy. Efficient variants of the icp algorithm. In Proceedings
Third International Conference on 3-D Digital Imaging and Modeling, pages 145–152. IEEE,
2001.
[87] Aleksandr Segal, Dirk Haehnel, and Sebastian Thrun. Generalized-icp. In Robotics: science and
systems, volume 2, page 435. Seattle, WA, 2009.
[88] Dirk Holz, Alexandru E Ichim, Federico Tombari, Radu B Rusu, and Sven Behnke. Registra-
tion with the point cloud library: A modular framework for aligning in 3-d. IEEE Robotics &
Automation Magazine, 22(4):110–124, 2015.
[89] Martin Lauer, Sascha Lange, and Martin Riedmiller. Calculating the perfect match: an efficient
and accurate approach for robot self-localization. In Robot Soccer World Cup. Springer, 2005.
[90] Peter Biber and Wolfgang Straßer. The normal distributions transform: A new approach to laser
scan matching. In Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS 2003)(Cat. No. 03CH37453), volume 3. IEEE, 2003.
[91] Jiaolong Yang, Hongdong Li, Dylan Campbell, and Yunde Jia. Go-icp: A globally opti-
mal solution to 3d icp point-set registration. IEEE Trans. Pattern Anal. Mach. Intell., 38
(11):2241–2254, 11 2016. ISSN 0162-8828. doi: 10.1109/TPAMI.2015.2513405. URL
https://doi.org/10.1109/TPAMI.2015.2513405.
[92] Pedro Nunez, Ricardo Vazquez-Martin, Jose C del Toro, Antonio Bandera, and Francisco San-
doval. Feature extraction from laser scan data based on curvature estimation for mobile robotics.
In Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA
2006., pages 1167–1172. IEEE, 2006.
[93] Aparajithan Sampath and Jie Shan. Clustering based planar roof extraction from lidar data. In
American Society for Photogrammetry and Remote Sensing Annual Conference, Reno, Nevada,
May, pages 1–6, 2006.
[94] Jing Liang, Jixian Zhang, Kazhong Deng, Zhengjun Liu, and Qunshan Shi. A new power-line
extraction method based on airborne lidar point cloud data. In 2011 International Symposium on
Image and Data Fusion, pages 1–4. IEEE, 2011.
[95] Sebastian Thrun, Daphne Koller, Zoubin Ghahmarani, and Hugh Durrant-Whyte. Slam updates
require constant time. In Workshop on the Algorithmic Foundations of Robotics, pages 1–20.
Citeseer, 2002.
[96] Yufeng Liu and Sebastian Thrun. Results for outdoor-slam using sparse extended information fil-
ters. In 2003 IEEE International Conference on Robotics and Automation (Cat. No. 03CH37422),
volume 1, pages 1227–1233. IEEE, 2003.
[97] Iwan Ulrich and Illah Nourbakhsh. Appearance-based place recognition for topological local-
ization. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference
on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), volume 2, pages
1023–1029. Ieee, 2000.
115
Qingqing Li
[98] Jun Chen, Chaomin Luo, Mohan Krishnan, Mark Paulik, and Yipeng Tang. An enhanced dy-
namic delaunay triangulation-based path planning algorithm for autonomous mobile robot navi-
gation. In Intelligent Robots and Computer Vision XXVII: Algorithms and Techniques, volume
7539, page 75390P. International Society for Optics and Photonics, 2010.
[99] Marian Himstedt, Jan Frost, Sven Hellbach, Hans-Joachim Bo¨hme, and Erik Maehle. Large
scale place recognition in 2d lidar scans using geometrical landmark relations. In 2014 IEEE/RSJ
International Conference on Intelligent Robots and Systems, pages 5030–5035. IEEE, 2014.
[100] Simon Lynen, Michael Bosse, and Roland Siegwart. Trajectory-based place-recognition for
efficient large scale localization. International Journal of Computer Vision, 124(1):49–64, 2017.
[101] Michael Bosse and Robert Zlot. Place recognition using keypoint voting in large 3d lidar datasets.
In 2013 IEEE International Conference on Robotics and Automation, pages 2677–2684. IEEE,
2013.
[102] Michael Bosse and Robert Zlot. Keypoint design and evaluation for place recognition in 2d lidar
maps. Robotics and Autonomous Systems, 57(12):1211–1224, 2009.
[103] Herbert Edelsbrunner. Triangulations and meshes in computational geometry. Acta Numerica
2000, 9:133 – 213, 01 2000. doi: 10.1017/S0962492900001331.
[104] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Com-
munications of the ACM, 18(9):509–517, 1975.
[105] Radu Bogdan Rusu. Semantic 3d object maps for everyday manipulation in human living envi-
ronments. KI-Ku¨nstliche Intelligenz, 24(4):345–348, 2010.
[106] Der-Tsai Lee and Bruce J Schachter. Two algorithms for constructing a delaunay triangulation.
International Journal of Computer & Information Sciences, 9(3):219–242, 1980.
[107] Z. Arzoumanian, Jason Holmberg, and Brad Norman. An astronomical pattern-matching algo-
rithm for computer-aided identification of whale sharks rhincodon typus. Journal of Applied
Ecology, 42:999 – 1011, 09 2005. doi: 10.1111/j.1365-2664.2005.01117.x.
[108] David Sinclair. S-hull: a fast radial sweep-hull routine for Delaunay triangulation. arXiv e-prints,
art. arXiv:1604.01428, 2016.
[109] Fani Boukouvala, Ruth Misener, and Christodoulos Floudas. Global optimization advances in
mixed-integer nonlinear programming, minlp, and constrained derivative-free optimization, cdfo.
European Journal of Operational Research, 252, 12 2015. doi: 10.1016/j.ejor.2015.12.018.
[110] Ryan W Wolcott and Ryan M Eustice. Visual localization within lidar maps for automated urban
driving. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages
176–183. IEEE, 2014.
[111] Ahti Heinla, Allan Martinson, Kalle-rasmus Volkov, Andrew Macks, Lindsay Roberts, Indrek
Mandre, Ma¨rt Liivik, Tiit Liivik, and Ivo Liivik. Method and system for autonomous or semi-
autonomous delivery, 11 2021. US Patent 11,164,273.
[112] Stephen Se, David G Lowe, and James J Little. Vision-based global localization and mapping
for mobile robots. IEEE Transactions on robotics, 21(3):364–375, 2005.
[113] Emilio Garcia-Fidalgo and Alberto Ortiz. Vision-based topological mapping and localization
methods: A survey. Robotics and Autonomous Systems, 64, 2015.
[114] Cristiano Premebida, Luis Garrote, Alireza Asvadi, A Pedro Ribeiro, and Urbano Nunes. High-
resolution lidar-based depth mapping using bilateral filter. In 2016 IEEE 19th international
conference on intelligent transportation systems (ITSC), pages 2469–2474. IEEE, 2016.
[115] M. John. Google cars drive themselves. The New York Times. URL http://www. nytimes.
com/2010/10/10/science/10google. html, 2010.
[116] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon
Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning
for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
[117] Mark Harris. Documents confirm apple is building self-driving car. The Guardian, 14, 2015.
[118] Nathan A Greenblatt. Self-driving cars and the law. IEEE spectrum, 53(2):46–51, 2016.
[119] Shinpei Kato, Eijiro Takeuchi, Yoshio Ishiguro, Yoshiki Ninomiya, Kazuya Takeda, and
Tsuyoshi Hamada. An open approach to autonomous vehicles. IEEE Micro, 35(6):60–68, 2015.
116
LIST OF REFERENCES
[120] Turn Zhang. Lidar-based road and road-edge detection. In 2010 IEEE Intelligent Vehicles Sym-
posium, pages 845–848, 06 2010. doi: 10.1109/IVS.2010.5548134.
[121] Erik Hofmann and Marco Ru¨sch. Industry 4.0 and the current status as well as future prospects
on logistics. Computers in Industry, 89:23–34, 2017.
[122] Heike Fla¨mig. Autonomous vehicles and autonomous driving in freight transport. In Autonomous
driving, pages 365–385. Springer, 2016.
[123] Nils Boysen, Stefan Schwerdfeger, and Felix Weidinger. Scheduling last-mile deliveries with
truck-based autonomous robots. European Journal of Operational Research, 271(3):1085–1099,
2018.
[124] Ji Zhang and Sanjiv Singh. Visual-lidar odometry and mapping: Low-drift, robust, and fast.
In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 2174–2181.
IEEE, 2015.
[125] Hyunggi Cho, Young-Woo Seo, BVK Vijaya Kumar, and Ragunathan Raj Rajkumar. A multi-
sensor fusion system for moving object detection and tracking in urban driving environments.
In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 1836–1843.
IEEE, 2014.
[126] Pavel Gonza´lez Prieto, Fernando Martı´n, Luis Moreno, and Juan Carballeira. Dendt: 3d-ndt scan
matching with differential evolution. In 2017 25th Mediterranean Conference on Control and
Automation (MED), pages 719–724, 07 2017. doi: 10.1109/MED.2017.7984203.
[127] Chris Rizos, Volker Janssen, Craig Roberts, and Th Grinter. Precise point positioning: Is the era
of differential gnss positioning drawing to an end? University Of Tasmania, 2012.
[128] The KITTI Vision Benchmark Suite. Visual odometry and slam evaluation, 2012. URL http:
//www.cvlibs.net/datasets/kitti/eval_odometry.php.
[129] Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob
Wheeler, and Andrew Y Ng. Ros: an open-source robot operating system. In ICRA workshop
on open source software. Kobe, Japan, 2009.
[130] Radu Bogdan Rusu and Steve Cousins. 3D is here: Point cloud library (PCL). In 2011 IEEE
International Conference on Robotics and Automation, 05 2011. doi: 10.1109/ICRA.2011.
5980567.
[131] Weisong Wen, Li-Ta Hsu, and Guohao Zhang. Performance analysis of ndt-based graph slam
for autonomous vehicle in diverse typical driving scenarios of hong kong. Sensors, 18(11):3928,
2018.
[132] Li Qingqing, Yu Xianjia, Jorge Pena Queralta, and Tomi Westerlund. Multi-modal lidar dataset
for benchmarking general-purpose localization and mapping algorithms. In 2022 IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022.
[133] TY Lam. Hamilton’s quaternions. In Handbook of algebra, volume 3, pages 429–454. Elsevier,
2003.
[134] Jian S Dai. Euler–rodrigues formula variations, quaternion conjugation and intrinsic connections.
Mechanism and Machine Theory, 92:144–152, 2015.
[135] Si Agarwal and Ki Mierle. Ceres solver, 2023. URL http://ceres-solver.org.
[136] Frank Dellaert. Factor graphs and gtsam: A hands-on introduction. Georgia Institute of Technol-
ogy, Tech. Rep, 2:4, 2012.
[137] Marco Lixia, Andrea Benigni, Alessandra Flammini, Carlo Muscas, Ferdinanda Ponci, and An-
tonello Monti. A software-only ptp synchronization for power system state estimation with pmus.
IEEE Transactions on Instrumentation and Measurement, 61(5):1476–1485, 2012.
[138] Hanjie Liu, Guoliang Liu, Guohui Tian, Shiqing Xin, and Ze Ji. Visual slam based on dynamic
object removal. In 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO),
pages 596–601. IEEE, 2019.
[139] Li Hou, Wanggen Wan, Jenq-Neng Hwang, Rizwan Muhammad, Mingyang Yang, and Kang
Han. Human tracking over camera networks: a review. EURASIP Journal on Advances in
Signal Processing, 2017(1):1–20, 2017.
117
Qingqing Li
[140] Mahmudul Hasan, Junichi Hanawa, Riku Goto, Ryota Suzuki, Hisato Fukuda, Yoshinori Kuno,
and Yoshinori Kobayashi. Lidar-based detection, tracking, and property estimation: A contem-
porary review. Neurocomputing, 2022.
[141] Li Qingqing, Yu Xianjia, Jorge Pena Queralta, and Tomi Westerlund. Adaptive lidar scan frame
integration: Tracking known mavs in 3d point clouds. In 2021 20th International Conference on
Advanced Robotics (ICAR), pages 1079–1086. IEEE, 2021.
[142] Toma´sˇ Roucˇek, Martin Pecka, Petr Cˇı´zˇek, Toma´sˇ Petrˇı´cˇek, Jan Bayer, Vojteˇch Sˇalansky`, Daniel
Herˇt, Mateˇj Petrlı´k, Toma´sˇ Ba´cˇa, Vojeˇch Spurny`, et al. Darpa subterranean challenge: Multi-
robotic exploration of underground environments. In Modelling and Simulation for Autonomous
Systems: 6th International Conference, MESAS 2019, Palermo, Italy, October 29–31, 2019, Re-
vised Selected Papers 6, pages 274–290. Springer, 2020.
[143] Mateˇj Petrlı´k, Toma´sˇ Ba´cˇa, Daniel Herˇt, Matousˇ Vrba, Toma´sˇ Krajnı´k, and Martin Saska. A
robust uav system for operations in a constrained environment. IEEE Robotics and Automation
Letters, 5(2):2169–2176, 2020.
[144] Ty Nguyen, Kartik Mohta, Camillo J Taylor, and Vijay Kumar. Vision-based multi-mav local-
ization with anonymous relative measurements using coupled probabilistic data association filter.
In IEEE International Conference on Robotics and Automation (ICRA), pages 3349–3355. IEEE,
2020.
[145] Jorge Pen˜a Queralta, Qingqing Li, Fabrizio Schiano, and Tomi Westerlund. Vio-uwb-based col-
laborative localization and dense scene reconstruction within heterogeneous multi-robot systems.
In 2022 International Conference on Advanced Robotics and Mechatronics (ICARM), pages 87–
94. IEEE, 2022.
[146] Li Qingqing, Jorge Pena Queralta, Tuan Nguyen Gia, and Tomi Westerlund. Offloading monoc-
ular visual odometry with edge computing: Optimizing image quality in multi-robot systems. In
Proceedings of the 2019 5th International Conference on Systems, Control and Communications,
pages 22–26, 2019.
[147] Hazim Shakhatreh, Ahmad H Sawalmeh, Ala Al-Fuqaha, Zuochao Dou, Eyad Almaita, Issa
Khalil, Noor Shamsiah Othman, Abdallah Khreishah, and Mohsen Guizani. Unmanned aerial
vehicles (uavs): A survey on civil applications and key research challenges. IEEE Access, 7:
48572–48634, 2019.
[148] Jorge Pen˜a Queralta, Jussi Taipalmaa, Bilge Can Pullinen, Victor Kathan Sarker, Tuan Nguyen
Gia, Hannu Tenhunen, Moncef Gabbouj, Jenni Raitoharju, and Tomi Westerlund. Collaborative
multi-robot search and rescue: Coordination and perception. arXiv preprint arXiv:2008.12610
[cs.RO], 2020.
[149] Sean Grogan, R Pellerin, and M Gamache. The use of unmanned aerial vehicles and drones in
search and rescue operations–a survey. Proceedings of the PROLOG, 2018.
[150] William Roberts, Kelly Griendling, Anthony Gray, and D Mavris. Unmanned vehicle collabora-
tion research environment for maritime search and rescue. In 30th Congress of the International
Council of the Aeronautical Sciences. International Council of the Aeronautical Sciences (ICAS)
Bonn, Germany, 2016.
[151] Royal Spanish Federation of First Aid and Rescue. National Drownings Report - Informe Na-
cional de Ahogamientos (INA), 2019.
[152] Joseph Redmon and Ali Farhadi. YOLOv3: An incremental improvement. arXiv, 2018.
[153] Suk-Ju Hong, Yunhyeok Han, Sang-Yeon Kim, Ah-Yeong Lee, and Ghiseok Kim. Application
of deep-learning methods to bird detection using unmanned aerial vehicle imagery. Sensors, 19
(7):1651, 2019.
[154] Rahul Tallamraju, Eric Price, Roman Ludwig, Kamalakar Karlapalem, Heinrich H Bu¨lthoff,
Michael J Black, and Aamir Ahmad. Active perception based formation control for multiple
aerial vehicles. IEEE Robotics and Automation Letters, 4(4):4491–4498, 2019.
[155] D. Gallos and F. Ferrie. Active vision in the era of convolutional neural networks. In 2019 16th
Conference on Computer and Robot Vision (CRV), pages 81–88, 2019.
118
LIST OF REFERENCES
[156] Jorge Pen˜a Queralta, Jenni Raitoharju, Tuan Nguyen Gia, Nikolaos Passalis, and Tomi West-
erlund. Autosos: Towards multi-uav systems supporting maritime search and rescue with
lightweight ai and edge computing. arXiv preprint arXiv:2005.03409, 2020.
[157] Anibal Matos, Alfredo Martins, Andre Dias, Bruno Ferreira, Jose´ Miguel Almeida, Hugo Fer-
reira, Guilherme Amaral, Andre Figueiredo, Rui Almeida, and Filipe Silva. Multiple robot
operations for maritime search and rescue in eurathlon 2015 competition. In OCEANS 2016-
Shanghai, pages 1–7. IEEE, 2016.
[158] Johannes Gu¨ldenring, Lucas Koring, Philipp Gorczak, and Christian Wietfeld. Heterogeneous
multilink aggregation for reliable uav communication in maritime search and rescue missions.
In 2019 International Conference on Wireless and Mobile Computing, Networking and Commu-
nications (WiMob), pages 215–220. IEEE, 2019.
[159] Rudin Konrad, Daniel Serrano, and Pascal Strupler. Unmanned aerial systems. Search and
Rescue Robotics—From Theory to Practice, pages 37–52, 2017.
[160] Hartmut Surmann, Rainer Worst, Tim Buschmann, Artur Leinweber, Alexander Schmitz, Ger-
hard Senkowski, and Niklas Goddemeier. Integration of uavs in urban search and rescue missions.
In 2019 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), pages
203–209. IEEE, 2019.
[161] Themistoklis Giitsidis, Evangelos G Karakasis, Antonios Gasteratos, and G Ch Sirakoulis. Hu-
man and fire detection from high altitude uav images. In 2015 23rd Euromicro International Con-
ference on Parallel, Distributed, and Network-Based Processing, pages 309–315. IEEE, 2015.
[162] S. Yong and Y. Yeong. Human object detection in forest with deep learning based on drone’s
vision. In 2018 4th International Conference on Computer and Information Sciences (ICCOINS),
pages 1–5. IEEE, 2018.
[163] Sankula Likhith Krishna, Guduru Sai Rama Chaitanya, Abbasani Sree Hari Reddy,
Arasada Manoj Naidu, SS Poorna, and K Anuraj. Autonomous human detection system mounted
on a drone. In 2019 International Conference on Wireless Communications Signal Processing
and Networking (WiSPNET), pages 335–338. IEEE, 2019.
[164] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for ac-
curate object detection and semantic segmentation. In Computer Vision and Pattern Recognition,
2014.
[165] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer
vision, pages 1440–1448, 2015.
[166] R. Bajcsy, Y. Aloimonos, and J.K Tsotsos. Revisiting active perception. Autonomous Robots,
42:177—-196, 2018.
[167] P. Ammirato, P. Poirson, E. Park, J. Kosˇecka´, and A. C. Berg. A dataset for developing and
benchmarking active vision. In 2017 IEEE International Conference on Robotics and Automa-
tion (ICRA), pages 1378–1385, 2017.
[168] Juan Sandino, Fernando Vanegas, Felipe Gonza´lez, and Fre´de´ric Maire. Autonomous uav nav-
igation for active perception of targets in uncertain and cluttered environments. In 2020 IEEE
Aerospace Conference, 2020.
[169] D. Falanga, E. Mueggler, M. Faessler, and D. Scaramuzza. Aggressive quadrotor flight through
narrow gaps with onboard sensing and computing using active vision. In IEEE International
Conference on Robotics and Automation (ICRA), pages 5774–5781, 2017.
[170] M. Chessa, S. Murgia, L. Nardelli, S. P. Sabatini, and F. Solari. Bio-inspired active vision
for obstacle avoidance. In 2014 International Conference on Computer Graphics Theory and
Applications (GRAPP), pages 1–8, 2014.
[171] Fangwei Zhong, Peng Sun, Wenhan Luo, Tingyun Yan, and Yizhou Wang. AD-VAT: An asym-
metric dueling mechanism for learning visual active tracking. In International Conference on
Learning Representations, 2019.
[172] Eleftherios Lygouras, Nicholas Santavas, Anastasios Taitzoglou, Konstantinos Tarchanidis,
Athanasios Mitropoulos, and Antonios Gasteratos. Unsupervised human detection with an em-
119
Qingqing Li
bedded vision system on a fully autonomous uav for search and rescue operations. Sensors, 19
(16):3542, 2019.
[173] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierar-
chical Image Database. In CVPR09, 2009.
[174] Labelbox. Labelbox, 2019. [Online]. Available: https://labelbox.com.
[175] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual
object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June
2010.
[176] R. Padilla, S. L. Netto, and E. A. B. da Silva. A survey on performance metrics for object-
detection algorithms. In 2020 International Conference on Systems, Signals and Image Process-
ing (IWSSIP), pages 237–242, 2020.
[177] Michael Blo¨sch, Stephan Weiss, Davide Scaramuzza, and Roland Siegwart. Vision based mav
navigation in unknown and unstructured environments. In 2010 IEEE International Conference
on Robotics and Automation, pages 21–28. IEEE, 2010.
[178] Matthias Nieuwenhuisen, David Droeschel, Marius Beul, and Sven Behnke. Autonomous navi-
gation for micro aerial vehicles in complex gnss-denied environments. Journal of Intelligent &
Robotic Systems, 84:199–216, 2016.
[179] Yu Song, Stephen Nuske, and Sebastian Scherer. A multi-sensor fusion mav state estimation
from long-range stereo, imu, gps and barometric sensors. Sensors, 17(1):11, 2017.
[180] Hao Xu, Luqi Wang, Yichen Zhang, Kejie Qiu, and Shaojie Shen. Decentralized visual-inertial-
uwb fusion for relative state estimation of aerial swarm. arXiv preprint arXiv:2003.05138, 2020.
[181] Abel Gawel, Yukai Lin, The´odore Koutros, Roland Siegwart, and Cesar Cadena. Aerial-ground
collaborative sensing: Third-person view for teleoperation. In IEEE International Symposium
on Safety, Security, and Rescue Robotics (SSRR), pages 1–7. IEEE, 2018.
[182] Jorge Pena Queralta, Jussi Taipalmaa, Bilge Can Pullinen, Victor Kathan Sarker, Tuan Nguyen
Gia, Hannu Tenhunen, Moncef Gabbouj, Jenni Raitoharju, and Tomi Westerlund. Collaborative
multi-robot search and rescue: Planning, coordination, perception, and active vision. Ieee Access,
8:191617–191643, 2020.
[183] Ismail Guvenc, Farshad Koohifar, Simran Singh, Mihail L Sichitiu, and David Matolak. De-
tection, tracking, and interdiction for amateur drones. IEEE Communications Magazine, 56(4):
75–81, 2018.
[184] Sebastien Hengy, Martin Laurenzis, Ste´phane Schertzer, Alexander Hommes, Franck Kloep-
pel, Alex Shoykhetbrod, Thomas Geibig, Winfried Johannes, Oussama Rassy, and Frank Christ-
nacher. Multimodal uav detection: Study of various intrusion scenarios. In Electro-Optical
Remote Sensing XI, volume 10434, pages 203–212. SPIE, 2017.
[185] Matousˇ Vrba and Martin Saska. Marker-less micro aerial vehicle detection and localization using
convolutional neural networks. IEEE Robotics and Automation Letters (RA-L), 5(2):2459–2466,
2020.
[186] Jan Razlaw, Jan Quenzel, and Sven Behnke. Detection and tracking of small objects in sparse 3d
laser range data. In IEEE International Conference on Robotics and Automation (ICRA), pages
2967–2973. IEEE, 2019.
[187] Frank Neuhaus, Tilman Koß, Robert Kohnen, and Dietrich Paulus. Mc2slam: Real-time inertial
lidar odometry using two-scan motion compensation. In Pattern Recognition: 40th German
Conference, GCPR 2018, Stuttgart, Germany, October 9-12, 2018, Proceedings 40, pages 60–
72. Springer, 2019.
[188] Wang Shule, Carmen Martı´nez Almansa, Jorge Pen˜a Queralta, Zhuo Zou, and Tomi Westerlund.
Uwb-based localization for multi-uav systems and collaborative heterogeneous multi-robot sys-
tems. Procedia Computer Science, 175:357–364, 2020.
[189] Carmen Martı´nez Almansa, Wang Shule, Jorge Pena Queralta, and Tomi Westerlund. Autocali-
bration of a mobile uwb localization system for ad-hoc multi-robot deployments in gnss-denied
environments. arXiv preprint arXiv:2004.06762, 2020.
120
LIST OF REFERENCES
[190] Radu Bogdan Rusu and Steve Cousins. 3d is here: Point cloud library (pcl). In 2011 IEEE
international conference on robotics and automation, pages 1–4. IEEE, 2011.
[191] Jorge Pen˜a Queralta, Carmen Martı´nez Almansa, Fabrizio Schiano, Dario Floreano, and Tomi
Westerlund. Uwb-based system for uav localization in gnss-denied environments: Characteriza-
tion and dataset. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), pages 4521–4528. IEEE, 2020.
[192] Paul Furgale, Joern Rehder, and Roland Siegwart. Unified temporal and spatial calibration for
multi-sensor systems. In 2013 IEEE/RSJ International Conference on Intelligent Robots and
Systems, pages 1280–1286. IEEE, 2013.
[193] Guohang Yan, Zhuochun Liu, Chengjie Wang, Chunlei Shi, Pengjin Wei, Xinyu Cai, Tao Ma,
Zhizheng Liu, Zebin Zhong, Yuqian Liu, et al. Opencalib: A multi-sensor calibration toolbox
for autonomous driving. Software Impacts, 14:100393, 2022.
[194] Woosik Lee, Yulin Yang, and Guoquan Huang. Efficient multi-sensor aided inertial navigation
with online calibration. In 2021 IEEE International Conference on Robotics and Automation
(ICRA), pages 5706–5712. IEEE, 2021.
[195] Yulin Yang, Patrick Geneva, Xingxing Zuo, and Guoquan Huang. Online imu intrinsic calibra-
tion: Is it necessary? 2020 Robotics: Science and Systems, 2020.
[196] Zhenfei Yang and Shaojie Shen. Monocular visual–inertial state estimation with online initial-
ization and camera–imu extrinsic calibration. IEEE Transactions on Automation Science and
Engineering, 14(1):39–51, 2016.
[197] Renaud Dube´, Daniel Dugas, Elena Stumm, Juan Nieto, Roland Siegwart, and Cesar Cadena.
Segmatch: Segment based place recognition in 3d point clouds. In 2017 IEEE International
Conference on Robotics and Automation (ICRA), pages 5266–5272. IEEE, 2017.
[198] Mikaela Angelina Uy and Gim Hee Lee. Pointnetvlad: Deep point cloud based retrieval for
large-scale place recognition. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 4470–4479, 2018.
[199] Wenxiao Zhang and Chunxia Xiao. Pcan: 3d attention map learning using contextual information
for point cloud based retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 12436–12445, 2019.
[200] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point
sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 652–660, 2017.
[201] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical
feature learning on point sets in a metric space. Advances in neural information processing
systems, 30, 2017.
[202] Tsung-Yi Lin, Piotr Dolla´r, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie.
Feature pyramid networks for object detection. In Proceedings of the IEEE conference on com-
puter vision and pattern recognition, pages 2117–2125, 2017.
[203] Jacek Komorowski. Minkloc3d: Point cloud based large-scale place recognition. In Proceedings
of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1790–1799,
2021.
[204] Marco Hutter, Philipp Leemann, Stefan Stevsic, Andreas Michel, Dominic Jud, Mark
Hoepflinger, Roland Siegwart, Ruedi Figi, Christian Caduff, Markus Loher, et al. Towards
optimal force distribution for walking excavators. In 2015 international conference on advanced
robotics (ICAR), pages 295–301. IEEE, 2015.
[205] Nived Chebrolu, Philipp Lottes, Thomas La¨be, and Cyrill Stachniss. Robot localization based
on aerial images for precision agriculture tasks in crop fields. In 2019 International Conference
on Robotics and Automation (ICRA), pages 1787–1793. IEEE, 2019.
[206] Jialian Li, Jingyi Zhang, Zhiyong Wang, Siqi Shen, Chenglu Wen, Yuexin Ma, Lan Xu, Jingyi
Yu, and Cheng Wang. Lidarcap: Long-range marker-less 3d human motion capture with lidar
point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 20502–20512, 2022.
121
Qingqing Li
[207] Lan Xu, Weipeng Xu, Vladislav Golyanik, Marc Habermann, Lu Fang, and Christian Theobalt.
Eventcap: Monocular 3d capture of high-speed human motions using an event camera. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4968–
4978, 2020.
[208] Ruilong Li, Yuliang Xiu, Shunsuke Saito, Zeng Huang, Kyle Olszewski, and Hao Li. Monoc-
ular real-time volumetric performance capture. In Computer Vision–ECCV 2020: 16th Euro-
pean Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16, pages 49–67.
Springer, 2020.
[209] Simon Chamorro, Jack Collier, and Franc¸ois Grondin. Neural network based lidar gesture recog-
nition for realtime robot teleoperation. In 2021 IEEE International Symposium on Safety, Secu-
rity, and Rescue Robotics (SSRR), pages 98–103. IEEE, 2021.
[210] Vasileios Moysiadis, Dimitrios Katikaridis, Lefteris Benos, Patrizia Busato, Athanasios Anag-
nostis, Dimitrios Kateris, Simon Pearson, and Dionysis Bochtis. An integrated real-time hand
gesture recognition framework for human–robot interaction in agriculture. Applied Sciences, 12
(16):8160, 2022.
122
Q
ingqing Li
F 40
A
N
N
A
LES U
N
IV
ERSITATIS TU
RK
U
EN
SIS
ISBN 978-951-29-9735-0 (PRINT)
ISBN 978-951-29-9736-7 (PDF)
ISSN 2736-9390 (PRINT)
ISSN 2736-9684 (ONLINE)
Pa
in
os
al
am
a,
 T
ur
ku
, F
in
la
nd
 2
02
4