End-to-End Learned Visual Odometry Based on Vision Transformer

Vyas, Aman Manishbhai

End-to-End Learned Visual Odometry Based on Vision Transformer

dc.contributor.author	Vyas, Aman Manishbhai
dc.contributor.department	fi=Tietotekniikan laitos\|en=Department of Computing\|
dc.contributor.faculty	fi=Teknillinen tiedekunta\|en=Faculty of Technology\|
dc.contributor.studysubject	fi=Tietotekniikka\|en=Information and Communication Technology\|
dc.date.accessioned	2024-08-03T21:02:24Z
dc.date.available	2024-08-03T21:02:24Z
dc.date.issued	2024-07-30
dc.description.abstract	Estimating the camera’s pose from images of a single camera, a task known as monocular visual odometry, is fundamental in mobile robots and autonomous vehicles. Traditional approaches often rely on geometric methods that require significant engineering effort tailored to specific scenarios. Deep learning methods, while generalizable with extensive training data, have shown promising results. Recently, transformer-based architectures, which have been highly successful in natural language processing and computer vision, are proving to be superior for this task as well. In this study, we introduce the Vision Transformer (ViT) model, which leverages spatio-temporal self-attention mechanisms to extract features from images and estimate camera motions in an end-to-end manner. Extensive experimentation on the KITTI visual odometry dataset demonstrates that ViT achieves competitive state-of-the-art performance. Remarkably, it surpasses both traditional geometry-based methods and existing deep learning approaches, including DeepVO, MagicVO, and PoseNet. This significant improvement underscores the effectiveness of transformer-based architectures in capturing complex spatio-temporal dependencies essential for accurate visual odometry. Our results highlight ViT's potential as a powerful tool for enhancing pose estimation in dynamic environments, making it a valuable contribution to the advancement of autonomous navigation technologies. Our results over five different route trajectories with varying environmental conditions show that ViT achieves up to an 8% improvement in translation error and a 4% improvement in rotation error compared to previous deep learning methods. This highlights ViT's potential to enhance pose estimation in dynamic environments and advance autonomous navigation.
dc.format.extent	63
dc.identifier.olddbid	195797
dc.identifier.oldhandle	10024/178848
dc.identifier.uri	https://www.utupub.fi/handle/11111/18880
dc.identifier.urn	URN:NBN:fi-fe2024080263484
dc.language.iso	eng
dc.rights	fi=Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.\|en=This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.\|
dc.rights.accessrights	avoin
dc.source.identifier	https://www.utupub.fi/handle/10024/178848
dc.title	End-to-End Learned Visual Odometry Based on Vision Transformer
dc.type.ontasot	fi=Pro gradu -tutkielma\|en=Master's thesis\|

Tiedostot

Näytetään 1 - 1 / 1

Name:: Aman_Manishbhai_Vyas_Thesis.pdf
Size:: 1.92 MB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Pro gradu -tutkielmat ja diplomityöt sekä syventävien opintojen opinnäytetyöt (kokotekstit)