References

moitvivt

Моделирование, оптимизация и информационные технологии

Modeling, Optimization and Information Technology

2310-6018

Издательство

10.26102/2310-6018/2024.45.2.035

1574

Исследование эффективности моделей глубокого обучения в задаче распознавания технологических операций, как последовательности движений кистей рук

Study of deep learning models in the task of recognizing technological operations as a sequence of hand movements

0000-0003-2866-4864

Штехин

Сергей Евгеньевич

Shtekhin Sergei Evgenievich

Sergei Evgenievich

shs77@bk.ru aff-1

Стадник

Алексей Викторович

Stadnik

Alexey Viktorovich

i@lxstd.ru aff-2

«Отраслевой центр разработки и внедрения информационных систем» Сириус, филиал № 11 "Industry center for the development and implementation of information systems" Sirius, branch No. 11

01 01 2026

1 1

10.26102/2310-6018/2024.45.2.035

2026

This work is licensed under a Creative Commons Attribution 4.0 International License

В работе изучаются методы распознавания на видео специфического класса технологических операций ручного труда, который представляет собой последовательности движения кистей и пальцев рук. Технологическая операция здесь определяется как последовательность новых специфических символов жестового языка. Рассмотрены различные методы распознавания жестов на видео. Исследован двухэтапный подход: на первом этапе распознаются ключевые точки рук на каждом кадре с помощью открытой библиотеки mediapipe, на втором этапе покадровая последовательность ключевых точек трансформируется в текст с помощью обученной нейросети архитектуры трансформер. Основное внимание уделено обучению модели нейросети архитектуры трансформер на базе открытого датасета американского жестового языка (ASL) для распознавания предложений жестового языка на видео. Затронут вопрос применимости данного подхода и обученной модели ASL для распознавания технологических операций ручного труда с мелкой моторикой в виде текстовой последовательности. Полученные результаты могут быть полезны при исследовании трудовых процессов с быстрыми движениями и малыми отрезками времени в алгоритмах распознавания технологических операций ручного труда на видеоданных.

In this paper, we consider methods for recognizing on video a specific class of technological manual labor operations, which are a sequence of movements of the hands and fingers. The technological operation in this work is considered as a sequence of new specific symbols of the sign language. The paper considers various methods of gesture recognition on video. In this paper, a two-step approach was investigated. At the first stage, the key points of the hands on each frame are recognized by using the open mediapipe library. At the second stage, a frame-by-frame sequence of keypoints transformed into text using a trained neural network of the transformer architecture. The main attention is paid to training a neural network model of the Transformer architecture based on the open American Sign Language (ASL) dataset for recognizing sign language sentences in video. The paper considers the applicability of approach and the trained model of ASL for recognizing technological operations of manual labor with fine-motor skills as a text sequence. The results obtained in this paper can be useful in the study of labor processes with fast movements and short time intervals in algorithms for recognizing technological operations of manual labor on video data.

видеоанализ движений рук распознавание жестов распознавание действий глубокие нейронные сети трансформер технологические операции

video analysis of hand movements gesture recognition action recognition deep neural networks transformer technological operations

Исследование выполнено без спонсорской поддержки.

The study was performed without external funding.

References 1

Hou Z., Peng X., Qiao Y., Tao D. Visual Compositional Learning for Human-Object Interaction Detection. In: Computer Vision – ECCV 2020: 16th European Conference: Proceedings: Part XV, 23-28 August 2020, Glasgow, United Kingdom. Cham: Springer; 2020. P. 584–600. https://doi.org/10.1007/978-3-030-58555-6_35

Lin T.-Y., Dollár P., Girshick R., He K., Hariharan B., Belongie S. Feature Pyramid Networks for Object Detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 21-26 July 2017, Honolulu, HI, USA. IEEE; 2017. P. 936–944. https://doi.org/10.1109/CVPR.2017.106

Liu W., Anguelov D., Erhan D., Szegedy C., Reed S., Fu C.-Y., Berg A.C. SSD: Single Shot MultiBox Detector. In: Computer Vision – ECCV 2016: 14th European Conference: Proceedings: Part I, 11-14 October 2016, Amsterdam, The Netherlands. Cham: Springer; 2016. P. 21–37. https://doi.org/10.1007/978-3-319-46448-0_2

Nie J., Anwer R.M., Cholakkal H., Khan F.S., Pang Y., Shao L. Enriched Feature Guided Refinement Network for Object Detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 27 October 2019 – 02 November 2019, Seoul, Korea (South). IEEE; 2019. P. 9536–9545. https://doi.org/10.1109/ICCV.2019.00963

Pang Y., Xie J., Khan M.H., Anwer R.M., Khan F.S., Shao L. Mask-Guided Attention Network for Occluded Pedestrian Detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 27 October 2019 – 02 November 2019, Seoul, Korea (South). IEEE; 2019. P. 4966–4974. https://doi.org/10.1109/ICCV.2019.00507

Gupta J., Malik J. Visual Semantic Role Labeling. URL: https://doi.org/10.48550/arXiv.1505.04474 (Accessed 19th March 2024).

Штехин С.Е., Карачёв Д.К., Иванова Ю.А. Разработка алгоритма распознавания движений человека методами компьютерного зрения в задаче нормирования рабочего времени. Труды Института системного программирования РАН. 2020;32(1):121–136. https://doi.org/10.15514/ISPRAS-2020-32(1)-7

Mitchell R.E., Young T.A., Bachleda B., Karchmer M.A. How Many People Use ASL in the United States? Why Estimates Need Updating. Sign Language Studies. 2006;6(3):306–335. https://doi.org/10.1353/sls.2006.0019

Kim T. American Sign Language fingerspelling recognition from video: Methods for unrestricted recognition and signer-independence. URL: https://doi.org/10.48550/arXiv.1608.08339 (Accessed 19th March 2024).

Suresh S., Mithun H.T.P, Supriya M.H. Sign Language Recognition System Using Deep Neural Network. In: 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS), 15-16 March 2019, Coimbatore, India. IEEE; 2019. P. 614–618. https://doi.org/10.1109/ICACCS.2019.8728411

Kim S., Ji Y., Lee K.-B. An Effective Sign Language Learning with Object Detection Based ROI Segmentation. In: 2018 Second IEEE International Conference on Robotic Computing (IRC), 31 January 2018 – 02 February 2018, Laguna Hills, CA, USA. IEEE; 2018. P. 330–333. https://doi.org/10.1109/IRC.2018.00069

Shivashankara S., Srinath S. A Review on Vision Based American Sign Language Recognition, its Techniques, and Outcomes. In: 2017 7th International Conference on Communication Systems and Network Technologies (CSNT), 11-13 November 2017, Nagpur, India. IEEE; 2017. P. 293–299. https://doi.org/10.1109/CSNT.2017.8418554

Kumar R., Bajpai A., Sinha A. Mediapipe and CNNs for Real-Time ASL Gesture Recognition. URL: https://doi.org/10.48550/arXiv.2305.05296 (Accessed 19th March 2024).

Akandeh A. Sentence-Level Sign Language Recognition Framework. URL: https://doi.org/10.48550/arXiv.2211.14447 (Accessed 23rd March 2024).

Lee C.K.M. et al. American sign language recognition and training method with recurrent neural network. Expert Systems with Applications. 2021;167. https://doi.org/10.1016/j.eswa.2020.114403

Jayanthi P., Ponsy R.K., Bhama S.P.R., Madhubalasri B. Sign Language Recognition using Deep CNN with Normalised Keyframe Extraction and Prediction using LSTM. Journal of Scientific and Industrial Research. 2023;82(7):745–755.

Рюмин Д. Метод автоматического видеоанализа движений рук и распознавания жестов в человеко-машинных интерфейсах. Научно-технический вестник информационных технологий, механики и оптики. 2020;20(4):525–531. https://doi.org/10.17586/2226-1494-2020-20-4-525-531

Tekin B., Bogo F., Pollefeys M. H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 15-20 June 2019, Long Beach, CA, USA. IEEE; 2019. P. 4506–4515. https://doi.org/10.1109/CVPR.2019.00464

Li D., Opazo C.R., Yu X., Li H. Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 01-05 March 2020, Snowmass, CO, USA. IEEE; 2020. P. 1448–1458. https://doi.org/10.1109/WACV45572.2020.9093512

Supančič Ja.S., Rogez G., Yang Yi., Shotton Ja., Ramanan D. Depth-Based Hand Pose Estimation: Methods, Data, and Challenges. International Journal of Computer Vision. 2018;126(11):1180–1198. https://doi.org/10.1007/s11263-018-1081-7

Ivashechkin M., Mendez O., Bowden R. Improving 3D Pose Estimation for Sign Language. URL: https://doi.org/10.48550/arXiv.2308.09525 (Accessed 25th March 2024).

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł. Polosukhin I. Attention Is All You Need. In: NIPS'17: 31st International Conference on Neural Information Processing Systems: Advances in Neural Information Processing Systems 30 (NIPS 2017), 4-9 December 2017, Long Beach, CA, USA. Montreal: Curran Associates; 2017. P. 5998–6008.

Goodfellow I., Bengio Y., Courville A. Deep Learning. Cambridge: MIT Press; 2016. 800 p.

Brown T., Mann B., Ryder N., Subbiah M., Kaplan J.D., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A. et al. Language Models are Few-Shot Learners. In: NeurIPS 2020: 34th Conference on Neural Information Processing Systems: Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 06-12 December 2020, Vancouver, Canada. Curran Associates; 2020. P. 1877–1901.

Touvron H. et al. LLaMA: Open and Efficient Foundation Language Models. URL: https://doi.org/10.48550/arXiv.2302.13971 (Accessed 5th April 2024).

Lugaresi C., Tang J., Nash H. et al. MediaPipe: A Framework for Building Perception Pipelines. URL: https://doi.org/10.48550/arXiv.1906.08172 (Accessed 5th April 2024).

The authors declare that there are no conflicts of interest present.