A review on deep learning for vision-based hand detection, hand segmentation and hand gesture recognition in human-robot interaction

Jalayer, Reza; Jalayer, Masoud; Orsenigo, Carlotta; Tomizuka, Masayoshi

A review on deep learning for vision-based hand detection, hand segmentation and hand gesture recognition in human-robot interaction

dc.contributor.author	Jalayer, Reza
dc.contributor.author	Jalayer, Masoud
dc.contributor.author	Orsenigo, Carlotta
dc.contributor.author	Tomizuka, Masayoshi
dc.contributor.organization	fi=materiaalitekniikka\|en=Materials Engineering\|
dc.contributor.organization-code	1.2.246.10.2458963.20.80931480620
dc.converis.publication-id	500332936
dc.converis.url	https://research.utu.fi/converis/portal/Publication/500332936
dc.date.accessioned	2026-01-21T12:24:54Z
dc.date.available	2026-01-21T12:24:54Z
dc.description.abstract	Hand-based analysis, including hand detection, segmentation, and gesture recognition, plays a pivotal role in enabling natural and intuitive human-robot interaction (HRI). Recent advances in vision-based deep learning (DL) have significantly improved robots' ability to interpret hand cues across diverse settings. However, previous reviews have not addressed all three tasks collectively or focused on recent DL architectures. Filling this gap, we review recent studies at the intersection of DL and hand-based interaction in HRI. We structure the literature around three core tasks, i.e. hand detection, segmentation, and gesture recognition, highlighting DL models, dataset characteristics, evaluation metrics, and key challenges for each. We further examine the application of these models across industrial, assistive, social, aerial, and space robotics domains. We identify the dominant role of Convolutional and Recurrent Neural Networks (CNNs and RNNs), as well as emerging approaches such as attention-based models (Transformers), uncertainty-aware models, Graph Neural Networks (GNNs), and foundation models, i.e. Vision-Language Models (VLMs) and Large Language Models (LLMs). Our analysis reveals gaps, including the scarcity of HRI-specific datasets, underrepresentation of multi-hand and multi-user scenarios, limited use of RGBD and multi-modal inputs, weak cross-dataset generalization, and inconsistent real-time benchmarking. Dynamic and long-range gestures, multi-view setups, and context-aware understanding also remain relatively underexplored. Despite these limitations, promising directions have emerged, such as multi-modal fusion, use of foundation models for intent reasoning, and the development of lightweight architectures for deployment. This review offers a consolidated foundation to support future research on robust and context-aware DL systems for hand-centric HRI.
dc.identifier.eissn	1879-2537
dc.identifier.jour-issn	0736-5845
dc.identifier.olddbid	212445
dc.identifier.oldhandle	10024/195463
dc.identifier.uri	https://www.utupub.fi/handle/11111/52122
dc.identifier.url	https://doi.org/10.1016/j.rcim.2025.103110
dc.identifier.urn	URN:NBN:fi-fe202601215875
dc.language.iso	en
dc.okm.affiliatedauthor	Jalayer, Masoud
dc.okm.discipline	216 Materials engineering	en_GB
dc.okm.discipline	216 Materiaalitekniikka	fi_FI
dc.okm.internationalcopublication	international co-publication
dc.okm.internationality	International publication
dc.okm.type	A2 Scientific Article
dc.publisher	PERGAMON-ELSEVIER SCIENCE LTD
dc.publisher.country	United States	en_GB
dc.publisher.country	Yhdysvallat (USA)	fi_FI
dc.publisher.country-code	US
dc.relation.articlenumber	103110
dc.relation.doi	10.1016/j.rcim.2025.103110
dc.relation.ispartofjournal	Robotics and Computer-Integrated Manufacturing
dc.relation.volume	97
dc.source.identifier	https://www.utupub.fi/handle/10024/195463
dc.title	A review on deep learning for vision-based hand detection, hand segmentation and hand gesture recognition in human-robot interaction
dc.year.issued	2026

Tiedostot

Näytetään 1 - 1 / 1

Name:: 1-s2.0-S0736584525001644-main.pdf
Size:: 4.55 MB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet