A review on deep learning for vision-based hand detection, hand segmentation and hand gesture recognition in human-robot interaction

dc.contributor.authorJalayer, Reza
dc.contributor.authorJalayer, Masoud
dc.contributor.authorOrsenigo, Carlotta
dc.contributor.authorTomizuka, Masayoshi
dc.contributor.organizationfi=materiaalitekniikka|en=Materials Engineering|
dc.contributor.organization-code1.2.246.10.2458963.20.80931480620
dc.converis.publication-id500332936
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/500332936
dc.date.accessioned2026-01-21T12:24:54Z
dc.date.available2026-01-21T12:24:54Z
dc.description.abstractHand-based analysis, including hand detection, segmentation, and gesture recognition, plays a pivotal role in enabling natural and intuitive human-robot interaction (HRI). Recent advances in vision-based deep learning (DL) have significantly improved robots' ability to interpret hand cues across diverse settings. However, previous reviews have not addressed all three tasks collectively or focused on recent DL architectures. Filling this gap, we review recent studies at the intersection of DL and hand-based interaction in HRI. We structure the literature around three core tasks, i.e. hand detection, segmentation, and gesture recognition, highlighting DL models, dataset characteristics, evaluation metrics, and key challenges for each. We further examine the application of these models across industrial, assistive, social, aerial, and space robotics domains. We identify the dominant role of Convolutional and Recurrent Neural Networks (CNNs and RNNs), as well as emerging approaches such as attention-based models (Transformers), uncertainty-aware models, Graph Neural Networks (GNNs), and foundation models, i.e. Vision-Language Models (VLMs) and Large Language Models (LLMs). Our analysis reveals gaps, including the scarcity of HRI-specific datasets, underrepresentation of multi-hand and multi-user scenarios, limited use of RGBD and multi-modal inputs, weak cross-dataset generalization, and inconsistent real-time benchmarking. Dynamic and long-range gestures, multi-view setups, and context-aware understanding also remain relatively underexplored. Despite these limitations, promising directions have emerged, such as multi-modal fusion, use of foundation models for intent reasoning, and the development of lightweight architectures for deployment. This review offers a consolidated foundation to support future research on robust and context-aware DL systems for hand-centric HRI.
dc.identifier.eissn1879-2537
dc.identifier.jour-issn0736-5845
dc.identifier.olddbid212445
dc.identifier.oldhandle10024/195463
dc.identifier.urihttps://www.utupub.fi/handle/11111/52122
dc.identifier.urlhttps://doi.org/10.1016/j.rcim.2025.103110
dc.identifier.urnURN:NBN:fi-fe202601215875
dc.language.isoen
dc.okm.affiliatedauthorJalayer, Masoud
dc.okm.discipline216 Materials engineeringen_GB
dc.okm.discipline216 Materiaalitekniikkafi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA2 Scientific Article
dc.publisherPERGAMON-ELSEVIER SCIENCE LTD
dc.publisher.countryUnited Statesen_GB
dc.publisher.countryYhdysvallat (USA)fi_FI
dc.publisher.country-codeUS
dc.relation.articlenumber103110
dc.relation.doi10.1016/j.rcim.2025.103110
dc.relation.ispartofjournalRobotics and Computer-Integrated Manufacturing
dc.relation.volume97
dc.source.identifierhttps://www.utupub.fi/handle/10024/195463
dc.titleA review on deep learning for vision-based hand detection, hand segmentation and hand gesture recognition in human-robot interaction
dc.year.issued2026

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
1-s2.0-S0736584525001644-main.pdf
Size:
4.55 MB
Format:
Adobe Portable Document Format