<?xml version="1.0" encoding="UTF-8"?>
<article article-type="research-article" dtd-version="1.3" xml:lang="ru" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="https://metafora.rcsi.science/xsd_files/journal3.xsd">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">moitvivt</journal-id>
      <journal-title-group>
        <journal-title xml:lang="ru">Моделирование, оптимизация и информационные технологии</journal-title>
        <trans-title-group xml:lang="en">
          <trans-title>Modeling, Optimization and Information Technology</trans-title>
        </trans-title-group>
      </journal-title-group>
      <issn pub-type="epub">2310-6018</issn>
      <publisher>
        <publisher-name>Издательство</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.26102/2310-6018/2024.45.2.035</article-id>
      <article-id pub-id-type="custom" custom-type="elpub">1574</article-id>
      <title-group>
        <article-title xml:lang="ru">Исследование эффективности моделей глубокого обучения в задаче распознавания технологических операций, как последовательности движений кистей рук</article-title>
        <trans-title-group xml:lang="en">
          <trans-title>Study of deep learning models in the task of recognizing technological operations as a sequence of hand movements</trans-title>
        </trans-title-group>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">0000-0003-2866-4864</contrib-id>
          <name-alternatives>
            <name name-style="eastern" xml:lang="ru">
              <surname>Штехин</surname>
              <given-names>Сергей Евгеньевич</given-names>
            </name>
            <name name-style="western" xml:lang="en">
              <surname>Shtekhin Sergei Evgenievich</surname>
              <given-names>Sergei Evgenievich</given-names>
            </name>
          </name-alternatives>
          <email>shs77@bk.ru</email>
          <xref ref-type="aff">aff-1</xref>
        </contrib>
        <contrib contrib-type="author">
          <name-alternatives>
            <name name-style="eastern" xml:lang="ru">
              <surname>Стадник</surname>
              <given-names>Алексей Викторович</given-names>
            </name>
            <name name-style="western" xml:lang="en">
              <surname>Stadnik</surname>
              <given-names>Alexey Viktorovich</given-names>
            </name>
          </name-alternatives>
          <email>i@lxstd.ru</email>
          <xref ref-type="aff">aff-2</xref>
        </contrib>
      </contrib-group>
      <aff-alternatives id="aff-1">
        <aff xml:lang="ru">«Отраслевой центр разработки и внедрения информационных систем» Сириус, филиал № 11</aff>
        <aff xml:lang="en">"Industry center for the development and implementation of information systems" Sirius, branch No. 11</aff>
      </aff-alternatives>
      <aff-alternatives id="aff-2">
        <aff xml:lang="ru">«Отраслевой центр разработки и внедрения информационных систем» Сириус, филиал № 11</aff>
        <aff xml:lang="en">"Industry center for the development and implementation of information systems" Sirius, branch No. 11</aff>
      </aff-alternatives>
      <pub-date pub-type="epub">
        <day>01</day>
        <month>01</month>
        <year>2026</year>
      </pub-date>
      <volume>1</volume>
      <issue>1</issue>
      <elocation-id>10.26102/2310-6018/2024.45.2.035</elocation-id>
      <permissions>
        <copyright-statement>Copyright © Авторы, 2026</copyright-statement>
        <copyright-year>2026</copyright-year>
        <license license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/">
          <license-p>This work is licensed under a Creative Commons Attribution 4.0 International License</license-p>
        </license>
      </permissions>
      <self-uri xlink:href="https://moitvivt.ru/ru/journal/article?id=1574"/>
      <abstract xml:lang="ru">
        <p>В работе изучаются методы распознавания на видео специфического класса технологических операций ручного труда, который представляет собой последовательности движения кистей и пальцев рук. Технологическая операция здесь определяется как последовательность новых специфических символов жестового языка. Рассмотрены различные методы распознавания жестов на видео. Исследован двухэтапный подход: на первом этапе распознаются ключевые точки рук на каждом кадре с помощью открытой библиотеки mediapipe, на втором этапе покадровая последовательность ключевых точек трансформируется в текст с помощью обученной нейросети архитектуры трансформер. Основное внимание уделено обучению модели нейросети архитектуры трансформер на базе открытого датасета американского жестового языка (ASL) для распознавания предложений жестового языка на видео. Затронут вопрос применимости данного подхода и обученной модели ASL для распознавания технологических операций ручного труда с мелкой моторикой в виде текстовой последовательности. Полученные результаты могут быть полезны при исследовании трудовых процессов с быстрыми движениями и малыми отрезками времени в алгоритмах распознавания технологических операций ручного труда на видеоданных.</p>
      </abstract>
      <trans-abstract xml:lang="en">
        <p>In this paper, we consider methods for recognizing on video a specific class of technological manual labor operations, which are a sequence of movements of the hands and fingers. The technological operation in this work is considered as a sequence of new specific symbols of the sign language. The paper considers various methods of gesture recognition on video. In this paper, a two-step approach was investigated. At the first stage, the key points of the hands on each frame are recognized by using the open mediapipe library. At the second stage, a frame-by-frame sequence of keypoints transformed into text using a trained neural network of the transformer architecture. The main attention is paid to training a neural network model of the Transformer architecture based on the open American Sign Language (ASL) dataset for recognizing sign language sentences in video. The paper considers the applicability of approach and the trained model of ASL for recognizing technological operations of manual labor with fine-motor skills as a text sequence. The results obtained in this paper can be useful in the study of labor processes with fast movements and short time intervals in algorithms for recognizing technological operations of manual labor on video data.</p>
      </trans-abstract>
      <kwd-group xml:lang="ru">
        <kwd>видеоанализ движений рук</kwd>
        <kwd>распознавание жестов</kwd>
        <kwd>распознавание действий</kwd>
        <kwd>глубокие нейронные сети</kwd>
        <kwd>трансформер</kwd>
        <kwd>технологические операции</kwd>
      </kwd-group>
      <kwd-group xml:lang="en">
        <kwd>video analysis of hand movements</kwd>
        <kwd>gesture recognition</kwd>
        <kwd>action recognition</kwd>
        <kwd>deep neural networks</kwd>
        <kwd>transformer</kwd>
        <kwd>technological operations</kwd>
      </kwd-group>
      <funding-group>
        <funding-statement xml:lang="ru">Исследование выполнено без спонсорской поддержки.</funding-statement>
        <funding-statement xml:lang="en">The study was performed without external funding.</funding-statement>
      </funding-group>
    </article-meta>
  </front>
  <back>
    <ref-list>
      <title>References</title>
      <ref id="cit1">
        <label>1</label>
        <mixed-citation xml:lang="ru">Hou Z., Peng X., Qiao Y., Tao D. Visual Compositional Learning for Human-Object Interaction Detection. In: Computer Vision – ECCV 2020: 16th European Conference: Proceedings: Part XV, 23-28 August 2020, Glasgow, United Kingdom. Cham: Springer; 2020. P. 584–600. https://doi.org/10.1007/978-3-030-58555-6_35</mixed-citation>
      </ref>
      <ref id="cit2">
        <label>2</label>
        <mixed-citation xml:lang="ru">Lin T.-Y., Dollár P., Girshick R., He K., Hariharan B., Belongie S. Feature Pyramid Networks for Object Detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 21-26 July 2017, Honolulu, HI, USA. IEEE; 2017. P. 936–944. https://doi.org/10.1109/CVPR.2017.106</mixed-citation>
      </ref>
      <ref id="cit3">
        <label>3</label>
        <mixed-citation xml:lang="ru">Liu W., Anguelov D., Erhan D., Szegedy C., Reed S., Fu C.-Y., Berg A.C. SSD: Single Shot MultiBox Detector. In: Computer Vision – ECCV 2016: 14th European Conference: Proceedings: Part I, 11-14 October 2016, Amsterdam, The Netherlands. Cham: Springer; 2016. P. 21–37. https://doi.org/10.1007/978-3-319-46448-0_2</mixed-citation>
      </ref>
      <ref id="cit4">
        <label>4</label>
        <mixed-citation xml:lang="ru">Nie J., Anwer R.M., Cholakkal H., Khan F.S., Pang Y., Shao L. Enriched Feature Guided Refinement Network for Object Detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 27 October 2019 – 02 November 2019, Seoul, Korea (South). IEEE; 2019. P. 9536–9545. https://doi.org/10.1109/ICCV.2019.00963</mixed-citation>
      </ref>
      <ref id="cit5">
        <label>5</label>
        <mixed-citation xml:lang="ru">Pang Y., Xie J., Khan M.H., Anwer R.M., Khan F.S., Shao L. Mask-Guided Attention Network for Occluded Pedestrian Detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 27 October 2019 – 02 November 2019, Seoul, Korea (South). IEEE; 2019. P. 4966–4974. https://doi.org/10.1109/ICCV.2019.00507</mixed-citation>
      </ref>
      <ref id="cit6">
        <label>6</label>
        <mixed-citation xml:lang="ru">Gupta J., Malik J. Visual Semantic Role Labeling. URL: https://doi.org/10.48550/arXiv.1505.04474 (Accessed 19th March 2024).</mixed-citation>
      </ref>
      <ref id="cit7">
        <label>7</label>
        <mixed-citation xml:lang="ru">Штехин С.Е., Карачёв Д.К., Иванова Ю.А. Разработка алгоритма распознавания движений человека методами компьютерного зрения в задаче нормирования рабочего времени. Труды Института системного программирования РАН. 2020;32(1):121–136. https://doi.org/10.15514/ISPRAS-2020-32(1)-7</mixed-citation>
      </ref>
      <ref id="cit8">
        <label>8</label>
        <mixed-citation xml:lang="ru">Mitchell R.E., Young T.A., Bachleda B., Karchmer M.A. How Many People Use ASL in the United States? Why Estimates Need Updating. Sign Language Studies. 2006;6(3):306–335. https://doi.org/10.1353/sls.2006.0019</mixed-citation>
      </ref>
      <ref id="cit9">
        <label>9</label>
        <mixed-citation xml:lang="ru">Kim T. American Sign Language fingerspelling recognition from video: Methods for unrestricted recognition and signer-independence. URL: https://doi.org/10.48550/arXiv.1608.08339 (Accessed 19th March 2024).</mixed-citation>
      </ref>
      <ref id="cit10">
        <label>10</label>
        <mixed-citation xml:lang="ru">Suresh S., Mithun H.T.P, Supriya M.H. Sign Language Recognition System Using Deep Neural Network. In: 2019 5th International Conference on Advanced Computing &amp; Communication Systems (ICACCS), 15-16 March 2019, Coimbatore, India. IEEE; 2019. P. 614–618. https://doi.org/10.1109/ICACCS.2019.8728411</mixed-citation>
      </ref>
      <ref id="cit11">
        <label>11</label>
        <mixed-citation xml:lang="ru">Kim S., Ji Y., Lee K.-B. An Effective Sign Language Learning with Object Detection Based ROI Segmentation. In: 2018 Second IEEE International Conference on Robotic Computing (IRC), 31 January 2018 – 02 February 2018, Laguna Hills, CA, USA. IEEE; 2018. P. 330–333. https://doi.org/10.1109/IRC.2018.00069</mixed-citation>
      </ref>
      <ref id="cit12">
        <label>12</label>
        <mixed-citation xml:lang="ru">Shivashankara S., Srinath S. A Review on Vision Based American Sign Language Recognition, its Techniques, and Outcomes. In: 2017 7th International Conference on Communication Systems and Network Technologies (CSNT), 11-13 November 2017, Nagpur, India. IEEE; 2017. P. 293–299. https://doi.org/10.1109/CSNT.2017.8418554</mixed-citation>
      </ref>
      <ref id="cit13">
        <label>13</label>
        <mixed-citation xml:lang="ru">Kumar R., Bajpai A., Sinha A. Mediapipe and CNNs for Real-Time ASL Gesture Recognition. URL: https://doi.org/10.48550/arXiv.2305.05296 (Accessed 19th March 2024).</mixed-citation>
      </ref>
      <ref id="cit14">
        <label>14</label>
        <mixed-citation xml:lang="ru">Akandeh A. Sentence-Level Sign Language Recognition Framework. URL: https://doi.org/10.48550/arXiv.2211.14447 (Accessed 23rd March 2024).</mixed-citation>
      </ref>
      <ref id="cit15">
        <label>15</label>
        <mixed-citation xml:lang="ru">Lee C.K.M. et al. American sign language recognition and training method with recurrent neural network. Expert Systems with Applications. 2021;167. https://doi.org/10.1016/j.eswa.2020.114403</mixed-citation>
      </ref>
      <ref id="cit16">
        <label>16</label>
        <mixed-citation xml:lang="ru">Jayanthi P., Ponsy R.K., Bhama S.P.R., Madhubalasri B. Sign Language Recognition using Deep CNN with Normalised Keyframe Extraction and Prediction using LSTM. Journal of Scientific and Industrial Research. 2023;82(7):745–755.</mixed-citation>
      </ref>
      <ref id="cit17">
        <label>17</label>
        <mixed-citation xml:lang="ru">Рюмин Д. Метод автоматического видеоанализа движений рук и распознавания жестов в человеко-машинных интерфейсах. Научно-технический вестник информационных технологий, механики и оптики. 2020;20(4):525–531. https://doi.org/10.17586/2226-1494-2020-20-4-525-531</mixed-citation>
      </ref>
      <ref id="cit18">
        <label>18</label>
        <mixed-citation xml:lang="ru">Tekin B., Bogo F., Pollefeys M. H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 15-20 June 2019, Long Beach, CA, USA. IEEE; 2019. P. 4506–4515. https://doi.org/10.1109/CVPR.2019.00464</mixed-citation>
      </ref>
      <ref id="cit19">
        <label>19</label>
        <mixed-citation xml:lang="ru">Li D., Opazo C.R., Yu X., Li H. Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 01-05 March 2020, Snowmass, CO, USA.  IEEE; 2020. P. 1448–1458. https://doi.org/10.1109/WACV45572.2020.9093512</mixed-citation>
      </ref>
      <ref id="cit20">
        <label>20</label>
        <mixed-citation xml:lang="ru">Supančič Ja.S., Rogez G., Yang Yi., Shotton Ja., Ramanan D. Depth-Based Hand Pose Estimation: Methods, Data, and Challenges. International Journal of Computer Vision. 2018;126(11):1180–1198. https://doi.org/10.1007/s11263-018-1081-7</mixed-citation>
      </ref>
      <ref id="cit21">
        <label>21</label>
        <mixed-citation xml:lang="ru">Ivashechkin M., Mendez O., Bowden R. Improving 3D Pose Estimation for Sign Language. URL: https://doi.org/10.48550/arXiv.2308.09525 (Accessed 25th March 2024).</mixed-citation>
      </ref>
      <ref id="cit22">
        <label>22</label>
        <mixed-citation xml:lang="ru">Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł. Polosukhin I. Attention Is All You Need. In: NIPS'17: 31st International Conference on Neural Information Processing Systems: Advances in Neural Information Processing Systems 30 (NIPS 2017), 4-9 December 2017, Long Beach, CA, USA. Montreal: Curran Associates; 2017. P. 5998–6008.</mixed-citation>
      </ref>
      <ref id="cit23">
        <label>23</label>
        <mixed-citation xml:lang="ru">Goodfellow I., Bengio Y., Courville A. Deep Learning. Cambridge: MIT Press; 2016. 800 p.</mixed-citation>
      </ref>
      <ref id="cit24">
        <label>24</label>
        <mixed-citation xml:lang="ru">Brown T., Mann B., Ryder N., Subbiah M., Kaplan J.D., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A. et al. Language Models are Few-Shot Learners. In: NeurIPS 2020: 34th Conference on Neural Information Processing Systems: Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 06-12 December 2020, Vancouver, Canada. Curran Associates; 2020. P. 1877–1901.</mixed-citation>
      </ref>
      <ref id="cit25">
        <label>25</label>
        <mixed-citation xml:lang="ru">Touvron H. et al. LLaMA: Open and Efficient Foundation Language Models. URL: https://doi.org/10.48550/arXiv.2302.13971 (Accessed 5th April 2024).</mixed-citation>
      </ref>
      <ref id="cit26">
        <label>26</label>
        <mixed-citation xml:lang="ru">Lugaresi C., Tang J., Nash H. et al. MediaPipe: A Framework for Building Perception Pipelines. URL: https://doi.org/10.48550/arXiv.1906.08172  (Accessed 5th April 2024).</mixed-citation>
      </ref>
    </ref-list>
    <fn-group>
      <fn fn-type="conflict">
        <p>The authors declare that there are no conflicts of interest present.</p>
      </fn>
    </fn-group>
  </back>
</article>