Keywords: semantic similarity, text processing, neural networks, ruBERT, morphological analysis, syntactic analysis, semantic analysis, borrowings, original source
A measure of semantic text similarity
UDC 004.912
DOI: 10.26102/2310-6018/2025.50.3.017
The article explores the task of automatically determining the semantic similarity of texts, aimed at identifying original sources and instances of borrowing in news materials. A two-phase algorithm is presented: the first stage employs cosine similarity for preliminary text filtering, while the second stage calculates an asymmetric weighted measure of semantic similarity using RuBERT models. The algorithm conducts a comprehensive analysis of texts, taking into account their morphological, syntactic, and semantic features, and demonstrates robustness against typical errors found in news materials. The developed algorithm includes stages of linguistic text processing, inverted index construction, and similarity calculation using various linguistic features. Special attention is given to sentence processing: TF-IDF weighting, duplicate removal, and intersection analysis. To assess the semantic similarity of sentences, a weighted scoring system is applied, incorporating lexical, morphological, syntactic, and semantic characteristics. The experimental part of the study focuses on determining the algorithm's optimal parameters, such as threshold values and weight coefficients for different linguistic features. The results demonstrate that the proposed algorithm effectively detects borrowings, including cases of substantial text modifications, achieving high recall at the filtering stage and improved precision after semantic analysis. The algorithm is particularly useful for automated news digest generation and monitoring text reuse in regional media.
1. Nikolenko S., Kadurin A., Arkhangel'skaya E. Glubokoe obuchenie. Saint Petersburg: Piter; 2018. 480 p. (In Russ.).
2. Feldman R., Sanger J. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. New York: Cambridge University Press; 2007. 410 p.
3. Tseng Yu.-H., Lin Ch.-J., Lin Yu-I. Text Mining Techniques for Patent Analysis. Information Processing & Management. 2007;43(5):1216–1247. https://doi.org/10.1016/j.ipm.2006.11.011
4. Efimenko I.V. Obrabotka estestvenno-yazykovykh tekstov: ontologichnost' v lingvistike i diskursivnost' v izvlechenii znanii. In: KII-2006: desyataya natsional'naya konferentsiya po iskusstvennomu intellektu s mezhdunarodnym uchastiem: trudy konferentsii: Volume 2, 25–28 September 2006, Obninsk, Russia. Moscow: Fizmatlit; 2006. P. 230–234. (In Russ.).
5. Mikolov T., Chen K., Corrado G., Dean J. Efficient Estimation of Word Representations in Vector Space. arXiv. URL: https://arxiv.org/abs/1301.3781 [Accessed 28th March 2025].
6. Mikolov T., Sutskever I., Chen K., Corrado G.S., Dean J. Distributed Representations of Words and Phrases and Their Compositionality. In: NIPS 2013: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, 05–08 December 2013, Lake Tahoe, NV, USA. 2013. P. 3111–3119.
7. Dobrov B.V., Loukachevitch N.V. Multiple Evidence for Term Extraction in Broad Domains. In: RANLP 2011: Recent Advances in Natural Language Processing, 12–14 September 2011, Hissar, Bulgaria. Association for Computational Linguistics; 2011. P. 710–715.
8. Delgado M., Martín-Bautista M.J., Sánchez D., Vila M.A. Mining Text Data: Special Features and Patterns. In: Pattern Detection and Discovery: ESF Exploratory Workshop, 16–19 September 2002, London, UK. Berlin, Heidelberg: Springer; 2002. P. 140–153. https://doi.org/10.1007/3-540-45728-3_11
9. Hu K., Wu H., Qi K., et al. A Domain Keyword Analysis Approach Extending Term Frequency-Keyword Active Index with Google Word2vec Model. Scientometrics. 2018;114(3):1031–1068. https://doi.org/10.1007/s11192-017-2574-9
10. Cruse D.A. Meaning in Language: An Introduction to Semantics and Pragmatics. Oxford: Oxford University Press; 2011. 497 p.
11. Sochenkov I.V. Text Comparison Method for a Search and Analytical Engine. Artificial Intelligence and Decision Making. 2013;(2):32–43. (In Russ.).
12. Salton G., Buckley Ch. Term-Weighting Approaches in Automatic Text Retrieval. Information Processing & Management. 1988;24(5):513–523. https://doi.org/10.1016/0306-4573(88)90021-0
13. Luhn H.P. A Statistical Approach to Mechanized Encoding and Searching of Literary Information. IBM Journal of Research and Development. 1957;1(4):309–317. https://doi.org/10.1147/rd.14.0309
14. Jones K.S. A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation. 1972;28(1):11–21. https://doi.org/10.1108/eb026526
15. Jing L.-P., Huang H.-K., Shi H.-B. Improved Feature Selection Approach TFIDF in Text Mining. In: 2002 International Conference on Machine Learning and Cybernetics, 04–05 November 2002, Beijing, China. IEEE; 2002. P. 944–946. https://doi.org/10.1109/ICMLC.2002.1174522
16. Zubarev D.V., Sochenkov I.V. Paraphrased Plagiarism Detection Using Sentence Similarity. In: Компьютерная лингвистика и интеллектуальные технологии: по материалам ежегодной Международной конференции «Диалог», 31 May – 03 June 2017, Moscow, Russia. Moscow: Russian State University for the Humanities; 2017. P. 399–408.
Keywords: semantic similarity, text processing, neural networks, ruBERT, morphological analysis, syntactic analysis, semantic analysis, borrowings, original source
For citation: Shiyan V. A measure of semantic text similarity. Modeling, Optimization and Information Technology. 2025;13(3). URL: https://moitvivt.ru/ru/journal/pdf?id=1940 DOI: 10.26102/2310-6018/2025.50.3.017 (In Russ).
Received 01.05.2025
Revised 10.07.2025
Accepted 14.07.2025