End-to-End Learned Visual Odometry Based on Vision Transformer

dc.contributor.authorVyas, Aman Manishbhai
dc.contributor.departmentfi=Tietotekniikan laitos|en=Department of Computing|
dc.contributor.facultyfi=Teknillinen tiedekunta|en=Faculty of Technology|
dc.contributor.studysubjectfi=Tietotekniikka|en=Information and Communication Technology|
dc.date.accessioned2024-08-03T21:02:24Z
dc.date.available2024-08-03T21:02:24Z
dc.date.issued2024-07-30
dc.description.abstractEstimating the camera’s pose from images of a single camera, a task known as monocular visual odometry, is fundamental in mobile robots and autonomous vehicles. Traditional approaches often rely on geometric methods that require significant engineering effort tailored to specific scenarios. Deep learning methods, while generalizable with extensive training data, have shown promising results. Recently, transformer-based architectures, which have been highly successful in natural language processing and computer vision, are proving to be superior for this task as well. In this study, we introduce the Vision Transformer (ViT) model, which leverages spatio-temporal self-attention mechanisms to extract features from images and estimate camera motions in an end-to-end manner. Extensive experimentation on the KITTI visual odometry dataset demonstrates that ViT achieves competitive state-of-the-art performance. Remarkably, it surpasses both traditional geometry-based methods and existing deep learning approaches, including DeepVO, MagicVO, and PoseNet. This significant improvement underscores the effectiveness of transformer-based architectures in capturing complex spatio-temporal dependencies essential for accurate visual odometry. Our results highlight ViT's potential as a powerful tool for enhancing pose estimation in dynamic environments, making it a valuable contribution to the advancement of autonomous navigation technologies. Our results over five different route trajectories with varying environmental conditions show that ViT achieves up to an 8% improvement in translation error and a 4% improvement in rotation error compared to previous deep learning methods. This highlights ViT's potential to enhance pose estimation in dynamic environments and advance autonomous navigation.
dc.format.extent63
dc.identifier.olddbid195797
dc.identifier.oldhandle10024/178848
dc.identifier.urihttps://www.utupub.fi/handle/11111/18880
dc.identifier.urnURN:NBN:fi-fe2024080263484
dc.language.isoeng
dc.rightsfi=Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.|en=This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.|
dc.rights.accessrightsavoin
dc.source.identifierhttps://www.utupub.fi/handle/10024/178848
dc.titleEnd-to-End Learned Visual Odometry Based on Vision Transformer
dc.type.ontasotfi=Pro gradu -tutkielma|en=Master's thesis|

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
Aman_Manishbhai_Vyas_Thesis.pdf
Size:
1.92 MB
Format:
Adobe Portable Document Format