References

moitvivt

Моделирование, оптимизация и информационные технологии

Modeling, Optimization and Information Technology

2310-6018

Издательство

10.26102/2310-6018/2024.47.4.038

1763

Оценка качества интеллектуального перефразирования текстов на русском языке

Evaluation of the quality of intelligent text paraphrasing in Russian

Дагаев

Александр Евгеньевич

Dagaev

Alexander Evgenevich

a.e.dagaev@staff.mospolytech.ru aff-1

Попов

Дмитрий Иванович

Popov

Dmitry Ivanovich

damitry.popov@gmail.com aff-2

Московский политехнический университет Moscow Polytechnic University

Сочинский государственный университет Sochi State University

01 01 2026

1 1

10.26102/2310-6018/2024.47.4.038

2026

This work is licensed under a Creative Commons Attribution 4.0 International License

Данное исследование посвящено разработке интегральной метрики для оценки качества моделей перефразирования текстов, что отвечает актуальной задаче создания комплексных и объективных методов оценки. В отличие от предыдущих исследований, преимущественно фокусирующихся на англоязычных наборах данных, настоящее исследование акцентирует внимание на наборах данных русского языка, которые до настоящего времени оставались недостаточно изученными. Использование таких датасетов, как Gazeta, XL-Sum и WikiLingua (для русского языка), а также CNN Dailymail и XSum (для английского языка), обеспечивает многоязычную применимость предложенного подхода. Предлагаемая метрика сочетает лексические (ROUGE, BLEU), структурные (ROUGE-L) и семантические (BERTScore, METEOR, BLEURT) критерии оценки с распределением весов, исходя из важности каждой метрики. Результаты демонстрируют превосходство моделей ChatGPT-4 на русскоязычных наборах и GigaChat на англоязычных наборах, тогда как модели Gemini и YouChat показывают ограниченные возможности в достижении семантической точности вне зависимости от языка датасета. Оригинальность исследования заключается в объединении метрик в единую систему, что делает возможным более объективное и комплексное сравнение языковых моделей. Исследование вносит вклад в область обработки естественного языка, предлагая инструмент для оценки качества языковых моделей.

The study focuses on the development of an integral metric for evaluating the quality of text paraphrasing models, addressing the pressing need for comprehensive and objective evaluation methods. Unlike previous research, which predominantly focuses on English-language datasets, this study emphasizes Russian-language datasets, which have remained underexplored until now. The inclusion of datasets such as Gazeta, XL-Sum, and WikiLingua (for Russian) as well as CNN Dailymail and XSum (for English) ensures the multilingual applicability of the proposed approach. The proposed metric combines lexical (ROUGE, BLEU), structural (ROUGE-L), and semantic (BERTScore, METEOR, BLEURT) evaluation criteria, with weights assigned based on the importance of each metric. The results highlight the superiority of ChatGPT-4 on Russian datasets and GigaChat on English datasets, whereas models such as Gemini and YouChat exhibit limited capabilities in achieving semantic accuracy regardless of the dataset language. The originality of this research lies in the integration of multiple metrics into a unified system, enabling more objective and comprehensive comparisons of language models. The study contributes to the field of natural language processing by providing a tool for assessing the quality of language models.

обработка естественного языка перефразирование текста GigaChat YandexGPT 2 ChatGPT-3.5 ChatGPT-4 Gemini Bing AI YouChat Mistral Large

natural language processing text paraphrasing GigaChat YandexGPT 2 ChatGPT-3.5 ChatGPT-4 Gemini Bing AI YouChat Mistral Large

Исследование выполнено без спонсорской поддержки.

The study was performed without external funding.

References 1

Xie J., Agrawal A. Emotion and Sentiment Guided Paraphrasing. In: Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, 13 July 2023, Toronto, Canada. Association for Computational Linguistics; 2023. pp. 58–70. https://doi.org/10.18653/v1/2023.wassa-1.7

Krishna K., Song Y., Karpinska M., Wieting J., Iyyer M. Paraphrasing Evades Detectors of AI-Generated Text, but Retrieval is an Effective Defense. In: Advances in Neural Information Processing Systems: 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 10–16 December 2023, New Orleans, USA. Curran Associates; 2024. https://doi.org/10.48550/arXiv.2303.13408

Sadasivan V.S., Kumar A., Balasubramanian S., Wang W., Feizi S. Can AI-Generated Text be Reliably Detected? arXiv. URL: https://doi.org/10.48550/arXiv.2303.11156 [Accessed 14th November 2024].

Verma D., Lal Y.K., Sinha S., Van Durme B., Poliak A. Evaluating Paraphrastic Robustness in Textual Entailment Models. arXiv. URL: https://doi.org/10.48550/arXiv.2306.16722 [Accessed 14th November 2024].

Shen L., Liu L., Jiang H., Shi S. On the Evaluation Metrics for Paraphrase Generation. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 07–11 December 2022, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics; 2022. pp. 3178–3190.

Weston J., Lenain R., Meepegama U., Fristed E. Generative Pretraining for Paraphrase Evaluation [Preprint]. arXiv. URL: https://doi.org/10.48550/arXiv.2107.08251 [Accessed 14th November 2024].

Sharma S., Joshi A., Mukhija N., Zhao Y., Bhathena H., Singh P., Santhanam S., Biswas P. Systematic review of effect of data augmentation using paraphrasing on Named entity recognition. In: NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research, 28 November – 09 December 2022, New Orleans, USA.

Han T., Li D., Ma X., Hu N. Comparing product quality between translation and paraphrasing: Using NLP-assisted evaluation frameworks. Frontiers in Psychology. 2022;13. https://doi.org/10.3389/fpsyg.2022.1048132

Ahn J., Khosmood F. Evaluation of Automatic Text Summarization using Synthetic Facts. arXiv. URL: https://doi.org/10.48550/arXiv.2204.04869 [Accessed 14th November 2024].

Nicula B., Dascalu M., Newton N., Orcutt E., McNamara D.S. Automated Paraphrase Quality Assessment Using Recurrent Neural Networks and Language Models. In: Intelligent Tutoring Systems: 17th International Conference, ITS 2021: Proceedings, 07–11 June 2021, Online. Cham: Springer; 2021. pp. 333–340. https://doi.org/10.1007/978-3-030-80421-3_36

Gusev I. Dataset for Automatic Summarization of Russian News. In: Artificial Intelligence and Natural Language: 9th Conference, AINL 2020: Proceedings, 07–09 October 2020, Helsinki, Finland. Cham: Springer; 2020. pp. 122–134. https://doi.org/10.1007/978-3-030-59082-6_9

Hasan T., Bhattacharjee A., Islam M.S., Mubasshir K., Li Y.-F., Kang Y.-B., Rahman M.S., Shahriyar R. XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 01–06 August 2021, Online. Association for Computational Linguistics; 2021. pp. 4693–4703. https://doi.org/10.18653/v1/2021.findings-acl.413

Ladhak F., Durmus E., Cardie C., McKeown K. WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization. In: Findings of the Association for Computational Linguistics: EMNLP 2020, 16–20 November 2020, Online. Association for Computational Linguistics; 2020. pp. 4034–4048. https://doi.org/10.18653/v1/2020.findings-emnlp.360

Nallapati R., Zhou B., Dos Santos C., Gülçehre Ç., Xiang B. Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond. In: Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, 11–12 August 2016, Berlin, Germany. Berlin: Association for Computational Linguistics; 2016. pp. 280–290. https://doi.org/10.18653/v1/K16-1028

Narayan S., Cohen S.B., Lapata M. Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 31 October – 04 November 2018, Brussels, Belgium. Association for Computational Linguistics; 2018. pp. 1797–1807. https://doi.org/10.18653/v1/D18-1206

Patil O., Singh R., Joshi T. Understanding Metrics for Paraphrasing. arXiv. URL: https://doi.org/10.48550/arXiv.2205.13119 [Accessed 14th November 2024].

Lin C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, 25–26 July 2004, Barcelona, Spain. Association for Computational Linguistics; 2004. pp. 74–81.

Zhang T., Kishore V., Wu F., Weinberger K.Q., Artzi Y. BERTScore: Evaluating Text Generation with BERT. In: Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, 26–30 April 2020, Addis Ababa, Ethiopia. Addis Ababa: International Conference on Learning Representations; 2020. pp. 1–43. https://doi.org/10.48550/arXiv.1904.09675

Banerjee S., Lavie A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 29 June 2005, Ann Arbor, USA. Association for Computational Linguistics; 2005. pp. 65–72.

Post M. A Call for Clarity in Reporting BLEU Scores. In: Proceedings of the Third Conference on Machine Translation: Research Papers, 31 October – 01 November 2018, Brussels, Belgium. Association for Computational Linguistics; 2018. pp. 186–191. https://doi.org/10.18653/v1/W18-6319

Sellam T., Das D., Parikh A. BLEURT: Learning Robust Metrics for Text Generation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 05–10 July 2020, Online. Association for Computational Linguistics; 2020. pp. 7881–7892. https://doi.org/10.18653/v1/2020.acl-main.704

The authors declare that there are no conflicts of interest present.