References

moitvivt

Моделирование, оптимизация и информационные технологии

Modeling, Optimization and Information Technology

2310-6018

Издательство

10.26102/2310-6018/2025.50.3.017

1940

Мера семантической близости текстов

A measure of semantic text similarity

Шиян

Валерий Игоревич

Shiyan

Valery

kubsuteam01@gmail.com aff-1

Кубанский государственный университет Kuban State University

01 01 2026

1 1

10.26102/2310-6018/2025.50.3.017

2026

This work is licensed under a Creative Commons Attribution 4.0 International License

В статье рассматривается задача автоматического определения семантической близости текстов, направленная на выявление первоисточников и заимствований в новостных материалах. Представлен двухфазный алгоритм, который на первом этапе использует косинусную меру близости для предварительной фильтрации текстов, а на втором – рассчитывает несимметричную взвешенную меру семантической близости с применением моделей RuBERT. Алгоритм осуществляет комплексный анализ текстов, учитывая их морфологические, синтаксические и семантические особенности, и демонстрирует устойчивость к типичным ошибкам, встречающимся в новостных материалах. Разработанный алгоритм включает этапы лингвистической обработки текстов, построения инвертированных индексов и расчета мер близости с использованием различных лингвистических признаков. Особое внимание уделяется обработке предложений: взвешиванию по TF-IDF, удалению дубликатов и анализу пересечений. Для оценки семантической близости предложений применяется система взвешенных показателей, учитывающих лексические, морфологические, синтаксические и семантические особенности. Экспериментальная часть работы направлена на определение оптимальных параметров алгоритма, таких как пороговые значения и весовые коэффициенты для различных лингвистических признаков. Результаты эксперимента показывают, что предложенный алгоритм эффективно выявляет заимствования, включая случаи значительной переработки текстов, с высокой полнотой на этапе фильтрации и повышенной точностью после семантического анализа. Алгоритм особенно полезен для автоматического формирования новостных обзоров и мониторинга заимствований в региональных СМИ.

The article explores the task of automatically determining the semantic similarity of texts, aimed at identifying original sources and instances of borrowing in news materials. A two-phase algorithm is presented: the first stage employs cosine similarity for preliminary text filtering, while the second stage calculates an asymmetric weighted measure of semantic similarity using RuBERT models. The algorithm conducts a comprehensive analysis of texts, taking into account their morphological, syntactic, and semantic features, and demonstrates robustness against typical errors found in news materials. The developed algorithm includes stages of linguistic text processing, inverted index construction, and similarity calculation using various linguistic features. Special attention is given to sentence processing: TF-IDF weighting, duplicate removal, and intersection analysis. To assess the semantic similarity of sentences, a weighted scoring system is applied, incorporating lexical, morphological, syntactic, and semantic characteristics. The experimental part of the study focuses on determining the algorithm's optimal parameters, such as threshold values and weight coefficients for different linguistic features. The results demonstrate that the proposed algorithm effectively detects borrowings, including cases of substantial text modifications, achieving high recall at the filtering stage and improved precision after semantic analysis. The algorithm is particularly useful for automated news digest generation and monitoring text reuse in regional media.

семантическая близость обработка текстов нейросети RuBERT морфологический анализ синтаксический анализ семантический анализ заимствования первоисточник

semantic similarity text processing neural networks RuBERT morphological analysis syntactic analysis semantic analysis borrowings original source

Исследование выполнено без спонсорской поддержки.

The study was performed without external funding.

References 1

Николенко С., Кадурин А., Архангельская Е. Глубокое обучение. Санкт-Петербург: Питер; 2018. 480 с.

Feldman R., Sanger J. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. New York: Cambridge University Press; 2007. 410 p.

Tseng Yu.-H., Lin Ch.-J., Lin Yu-I. Text Mining Techniques for Patent Analysis. Information Processing & Management. 2007;43(5):1216–1247. https://doi.org/10.1016/j.ipm.2006.11.011

Ефименко И.В. Обработка естественно-языковых текстов: онтологичность в лингвистике и дискурсивность в извлечении знаний. В сборнике: КИИ-2006: десятая национальная конференция по искусственному интеллекту с международным участием: труды конференции: Том 2, 25–28 сентября 2006 года, Обнинск, Россия. Москва: Физматлит; 2006. С. 230–234.

Mikolov T., Chen K., Corrado G., Dean J. Efficient Estimation of Word Representations in Vector Space. arXiv. URL: https://arxiv.org/abs/1301.3781 [Accessed 28th March 2025].

Mikolov T., Sutskever I., Chen K., Corrado G.S., Dean J. Distributed Representations of Words and Phrases and Their Compositionality. In: NIPS 2013: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, 05–08 December 2013, Lake Tahoe, NV, USA. 2013. P. 3111–3119.

Dobrov B.V., Loukachevitch N.V. Multiple Evidence for Term Extraction in Broad Domains. In: RANLP 2011: Recent Advances in Natural Language Processing, 12–14 September 2011, Hissar, Bulgaria. Association for Computational Linguistics; 2011. P. 710–715.

Delgado M., Martín-Bautista M.J., Sánchez D., Vila M.A. Mining Text Data: Special Features and Patterns. In: Pattern Detection and Discovery: ESF Exploratory Workshop, 16–19 September 2002, London, UK. Berlin, Heidelberg: Springer; 2002. P. 140–153. https://doi.org/10.1007/3-540-45728-3_11

Hu K., Wu H., Qi K., et al. A Domain Keyword Analysis Approach Extending Term Frequency-Keyword Active Index with Google Word2vec Model. Scientometrics. 2018;114(3):1031–1068. https://doi.org/10.1007/s11192-017-2574-9

Cruse D.A. Meaning in Language: An Introduction to Semantics and Pragmatics. Oxford: Oxford University Press; 2011. 497 p.

Соченков И.В. Метод сравнения текстов для решения поисково-аналитических задач. Искусственный интеллект и принятие решений. 2013;(2):32–43.

Salton G., Buckley Ch. Term-Weighting Approaches in Automatic Text Retrieval. Information Processing & Management. 1988;24(5):513–523. https://doi.org/10.1016/0306-4573(88)90021-0

Luhn H.P. A Statistical Approach to Mechanized Encoding and Searching of Literary Information. IBM Journal of Research and Development. 1957;1(4):309–317. https://doi.org/10.1147/rd.14.0309

Jones K.S. A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation. 1972;28(1):11–21. https://doi.org/10.1108/eb026526

Jing L.-P., Huang H.-K., Shi H.-B. Improved Feature Selection Approach TFIDF in Text Mining. In: 2002 International Conference on Machine Learning and Cybernetics, 04–05 November 2002, Beijing, China. IEEE; 2002. P. 944–946. https://doi.org/10.1109/ICMLC.2002.1174522

Zubarev D.V., Sochenkov I.V. Paraphrased Plagiarism Detection Using Sentence Similarity. In: Компьютерная лингвистика и интеллектуальные технологии: по материалам ежегодной Международной конференции «Диалог», 31 May – 03 June 2017, Moscow, Russia. Moscow: Russian State University for the Humanities; 2017. P. 399–408.

The authors declare that there are no conflicts of interest present.