References

moitvivt

Моделирование, оптимизация и информационные технологии

Modeling, Optimization and Information Technology

2310-6018

Издательство

10.26102/2310-6018/2021.32.1.004

929

Метод распознавания эмоций человека по двигательной активности тела в видеопотоке на основе нейронных сетей

Method of human emotion recognition through analysis of body motor activity in a video stream using neural networks.

0000-0002-7032-0291

Уздяев

Михаил Юрьевич

Uzdiaev

Mikhail Yurievich

m.y.uzdiaev@gmail.com aff-1

0000-0002-9509-178X

Дударенко

Дмитрий Михайлович

Dudarenko

Dmitry Mikhailovich

dmitry@dudarenko.net aff-2

Миронов

Виктор Николаевич

Mironov

Viktor Nikolaevich

vmn20@mail.ru aff-3

Федеральное государственное бюджетное учреждение науки «Санкт-Петербургский Федеральный исследовательский центр Российской академии наук» (СПб ФИЦ РАН), Санкт-Петербургский институт информатики и автоматизации Российской академии наук St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences

01 01 2026

1 1

10.26102/2310-6018/2021.32.1.004

2026

This work is licensed under a Creative Commons Attribution 4.0 International License

В данной статье рассматривается применение различных нейросетевых моделей для решения задачи распознавания эмоций человека по двигательной активности его тела на кадрах видеопотока без сложной предварительной обработки этих кадров. В работе представлены трехмерные сверточные нейронные сети: Inception 3D (I3D), Residual 3D (R3D), а также сверточно-рекуррентные нейросетевые архитектуры, использующие сверточную нейронную сеть архитектуры ResNet и рекуррентные нейросети архитектур LSTM и GRU (ResNet+LSTM, ResNet+GRU), которые не требуют предварительной обработки изображений или видеопотока и при этом потенциально позволяют достичь высокой точности распознавания эмоций. На основе рассмотренных архитектур предложен метод распознавания эмоций человека по двигательной активности тела в видеопотоке. Обсуждаются архитектурные особенности используемых моделей, способы обработки моделями кадров видеопотока, а также результаты распознавания эмоций по следующим метрикам качества: доля верно распознанных экземпляров (accuracy), точность (precision), полнота (recall). Результаты апробации предложенных в работе нейросетевых моделей I3D, R3D, ResNet+LSTM, ResNet+GRU на наборе данных FABO показали высокое качество распознавания эмоций по двигательной активности тела человека. Так, модель R3D показала лучшую долю верно распознанных экземпляров, равную 91 %. Другие предложенные модели: I3D, ResNet+LSTM, ResNet+GRU – показали точность распознавания 88 %, 80 % и 80 % соответственно. Таким образом, согласно полученным результатам экспериментальной оценки предложенных нейросетевых моделей, наиболее предпочтительными для использования при решении задачи распознавания эмоционального состояния человека по двигательной активности, с точки зрения совокупности показателей точности классификации эмоций, являются трехмерные сверточные модели I3D и R3D. При этом, предложенные модели, в отличие от большинства существующих решений, позволяют реализовывать распознавание эмоций на основе анализа RGB кадров видеопотока без выполнения их предварительной ресурсозатратной обработки, а также с высокой точностью выполнять распознавание эмоций в реальном масштабе времени.

This paper presents the use of various neural network models to solve the problem of human emotion recognition by the motor activity of his body on frames of a video stream without complex preprocessing of these frames. The paper presents three-dimensional convolutional neural networks: Inception 3D (I3D), Residual 3D (R3D), as well as convolutional-recurrent neural network architectures using the convolutional neural network of the ResNet architecture and recurrent neural networks of the LSTM and GRU architectures (ResNet + LSTM, ResNet + GRU) which do not require preliminary processing of images or video stream and at the same time potentially allow achieving high accuracy of emotion recognition. Based on the considered architectures, a method for human emotion recognition from the motor activity of the body in a video stream is proposed. Architectural features of the used models, methods of processing video stream frames by models, as well as the results of emotion recognition according to the following quality metrics: the proportion of correctly recognized instances (accuracy), precision, recall are discussed. Approbation results of the proposed neural network models I3D, R3D, ResNet + LSTM, ResNet + GRU on the FABO data set showed a high quality of emotion recognition based on the motor activity of the human body. Thus, the R3D model showed the best share of correctly recognized copies, equal to 91%. Other proposed models: I3D, ResNet + LSTM, ResNet + GRU showed 88%, 80% and 80% recognition accuracy, respectively. Therefore, according to the obtained results of the experimental evaluation of the proposed neural network models, the most preferable for use in solving the problem of a person's emotional state recognition by motor activity, from the point of view of a set of indicators of the accuracy of emotion classification, are three-dimensional convolutional models I3D and R3D. At the same time, the proposed models, in contrast to most existing solutions, make it possible to implement emotion recognition based on the analysis of RGB frames of a video stream without performing their preliminary resource-consuming processing, as well as to perform emotion recognition in real-time with high accuracy.

нейросетевая модель распознавание эмоций сверточные нейронные сети машинное обучение обработка изображений видеопоток

neural network model emotion recognition convolutional neural networks machine learning image processing video stream

Исследование выполнено без спонсорской поддержки.

The study was performed without external funding.

References 1

Ватаманюк И.В., Яковлев Р.Н. Алгоритмическая модель распределенной системы корпоративного информирования в рамках киберфизической системы организации. Моделирование, оптимизация и информационные технологии. 2019;7(4). Доступно по: https://moit.vivt.ru/wp-content/uploads/2019/11/VatamanukSoavtori_4_19_1.pdf. DOI: 10.26102/2310-6018/2019.27.4.026 (дата обращения: 20.10.2020).

Letenkov M., Levonevskiy D. Fast Face Features Extraction Based on Deep Neural Networks for Mobile Robotic Platforms. International Conference on Interactive Collaborative Robotics. Springer, Cham. 2020:200-211. DOI: 10.1007/978-3-030-60337-3_20.

Ватаманюк И.В., Яковлев Р.Н. Обобщенные теоретические модели киберфизических систем. Известия Юго-Западного государственного университета. 2019;23(6):161-175. Доступно по: https://science.swsu.ru/jour/article/view/666/489. DOI: 10.21869/2223-1560-2019-23-6-161-175 (дата обращения: 20.10.2020).

Frijda N.H. Emotions and action. Feelings and emotions: The Amsterdam symposium. 2004:158-173.

He G., Liu X., Fan F., You J. Image2Audio: Facilitating Semi-supervised Audio Emotion Recognition with Facial Expression Image. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020:912-913.

Kalsum T., Anwar S.M., Majid M., Khan B., Ali S.M. Emotion recognition from facial expressions using hybrid feature descriptors. IET Image Processing. 2018;12(6):1004-1012.

Levonevskii D., Shumskaya O., Velichko A., Uzdiaev M., Malov D. Methods for Determination of Psychophysiological Condition of User Within Smart Environment Based on Complex Analysis of Heterogeneous Data. Proceedings of 14th International Conference on Electromechanics and Robotics «Zavalishin's Readings». Springer, Singapore. 2020:511-523.

Уздяев М.Ю., Левоневский Д.К., Шумская О.О., Летенков М.А. Методы детектирования агрессивных пользователей информационного пространства на основе генеративно-состязательных нейронных сетей. Информационно-измерительные и управляющие системы. 2019;17(5):60-68.

Uzdiaev M. Methods of Multimodal Data Fusion and Forming Latent Representation in the Human Aggression Recognition Task. 2020 IEEE 10th International Conference on Intelligent Systems (IS). IEEE. 2020:399-403.

Thakur N., Han C.Y. A complex activity based emotion recognition algorithm for affect aware systems. 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC). IEEE. 2018:748-753.

Wu J., Zhang Y., Ning L. The Fusion Knowledge of Face, Body and Context for Emotion Recognition. 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE. 2019:108-113.

Piana S., Staglianò A., Odone F., Camurri A. Adaptive body gesture representation for automatic emotion recognition. ACM Transactions on Interactive Intelligent Systems (TiiS). 2016;6(1):1-31.

Ly S.T., Lee G.S., Kim S.H., Yang H.J. Emotion Recognition via Body Gesture: Deep Learning Model Coupled with Keyframe Selection. Proceedings of the 2018 International Conference on Machine Learning and Machine Intelligence. 2018:27-31.

Shen Z., Cheng J., Hu X., Dong Q. Emotion Recognition Based on Multi-View Body Gestures. 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019:3317-3321.

Targ S., Almeida D., Lyman K. Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029. 2016.

Carreira J., Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:6299-6308.

Hara K., Kataoka H., Satoh Y. Learning spatio-temporal features with 3D residual networks for action recognition. Proceedings of the IEEE International Conference on Computer Vision Workshops. 2017:3154-3160.

Deng J., Dong W., Socher R., Li L. J., Li K., Fei-Fei L. Imagenet: A large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition. IEEE. 2009:248-255.

Vinyals O., Toshev A., Bengio S., Erhan D. Show and tell: A neural image caption generator. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015:3156-3164.

Xu K., Ba J., Kiros R., Cho K., Courville A., Salakhudinov R., Bengio Y. Show, attend and tell: Neural image caption generation with visual attention. International conference on machine learning. 2015:2048-2057.

Yao L., Torabi A., Cho K., Ballas N., Pal C., Larochelle H., Courville A. Describing videos by exploiting temporal structure. Proceedings of the IEEE international conference on computer vision. 2015:4507-4515.

Hori C., Hori T., Lee T. Y., Zhang Z., Harsham B., Hershey J. R., Sumi K. Attention-based multimodal fusion for video description. Proceedings of the IEEE international conference on computer vision. 2017:4193-4202.

Yue-Hei Ng, J., Hausknecht M., Vijayanarasimhan S., Vinyals O., Monga R., Toderici G. Beyond short snippets: Deep networks for video classification. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015:4694-4702.

Ullah A., Ahmad J., Muhammad K., Sajjad M., Baik S. W. Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access. 2017;6:1155-1166.

Girshick R. Fast r-cnn. Proceedings of the IEEE international conference on computer vision. 2015:1440-1448.

Ren S., He K., Girshick R., Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems. 2015:91-99.

Redmon J., Divvala S., Girshick R., Farhadi A. You only look once: Unified, real-time object detection. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016:779-788.

Liu W., Anguelov D., Erhan D., Szegedy C., Reed S., Fu C.Y., Berg A.C. Ssd: Single shot multibox detector. European conference on computer vision. Springer, Cham, 2016:21-37.

Pan S.J., Yang Q. A survey on transfer learning. IEEE Transactions on knowledge and data engineering. 2009;22(10):1345-1359.

Weiss K., Khoshgoftaar T.M., Wang D.D. A survey of transfer learning. Journal of Big data. 2016;3(1):9.

He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016:770-778.

Hochreiter S., Schmidhuber J. Long short-term memory. Neural computation. 1997;9(8):1735-1780.

Chung J., Gulcehre C., Cho K., Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. 2014.

Tran D., Bourdev L., Fergus R., Torresani L., Paluri M. Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE international conference on computer vision. 2015:4489-4497.

Hara K., Kataoka H., Satoh Y. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2018:6546-6555.

Saveliev A., Uzdiaev M., Dmitrii M. Aggressive Action Recognition Using 3D CNN Architectures. 2019 12th International Conference on Developments in eSystems Engineering (DeSE). IEEE. 2019:890-895.

Kay W., Carreira J., Simonyan K., Zhang B., Hillier C., Vijayanarasimhan S., Suleyman M. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. 2017.

Szegedy C., Liu W., Jia Y., Sermanet P., Reed S., Anguelov D., Rabinovich A. Going deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015:1-9.

Gunes H., Piccardi M. A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior. 18th International Conference on Pattern Recognition (ICPR'06). IEEE. 2006;1:1148-1153.

Kingma D. P., Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014.

Gunes H., Piccardi M. Automatic temporal segment detection and affect recognition from face and body display. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). 2008;39(1):64-84.

Chen S., Tian Y., Liu Q., Metaxas D.N. Recognizing expressions from face and body gesture by temporal normalized motion and appearance features. Image and Vision Computing. 2013;31(2):175-185.

Barros P., Jirak D., Weber C., Wermter S. Multimodal emotional state recognition using sequence-dependent deep hierarchical features. Neural Networks. 2015;72:140-151.

Bahdanau D., Cho K., Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. 2014.

The authors declare that there are no conflicts of interest present.