References

moitvivt

Моделирование, оптимизация и информационные технологии

Modeling, Optimization and Information Technology

2310-6018

Издательство

10.26102/2310-6018/2026.54.3.008

2207

Метод извлечения информации на основе экстрактивных вопросно-ответных моделей и стратегий оценки и агрегации релевантных фрагментов текста

A method for information extraction based on extractive question-answering models and strategies for evaluating and aggregating relevant text fragments

0000-0002-2429-1805

Мартынюк

Полина Антоновна

Martynyuk

Polina Antonovna

martynyuk.pa@bmstu.ru aff-1

Московский государственный технический университет имени Н.Э. Баумана Bauman Moscow State Technical University

01 01 2026

1 1

10.26102/2310-6018/2026.54.3.008

2026

This work is licensed under a Creative Commons Attribution 4.0 International License

В условиях ускоренного роста объемов текстовых данных разнородной структуры особую важность приобретают универсальные подходы к извлечению информации, не зависящие от конкретной структуры и предметной области исходных текстов. Несмотря на широкое распространение больших генеративных языковых моделей, проблема точного и ресурсоэффективного извлечения информации из текстовых данных сохраняет свою актуальность. Генеративные модели, обладая широкими возможностями, зачастую избыточны для решения специализированных задач информационного поиска и могут демонстрировать низкую интерпретируемость получаемых результатов. Настоящее исследование является частью исследовательской работы, направленной на разработку альтернативного метода извлечения информации из неструктурированных текстов с целью формирования структурной модели текстового документа. Предлагаемый подход фокусируется на выделении семантически насыщенных фрагментов текста через анализ релевантности относительно заданных тематических аспектов текста. В рамках данного исследования предлагается метод извлечения информации с использованием экстрактивной вопросно-ответной модели, основанный на многоуровневой агрегации ответов с использованием комбинации стратегий оценки релевантности текстовых фрагментов, семантической кластеризации и выбора результирующего ответа на заданный вопрос. Предлагаемый подход позволяет идентифицировать в тексте слова, наиболее релевантные по отношению к искомым тематическим аспектам, которые впоследствии могут быть использованы для извлечения достоверной информации из документа. В статье представлены результаты эксперимента, подтверждающие эффективность предложенного метода в задаче идентификации семантически релевантных элементов текстового документа. Полученные результаты имеют практическую ценность для разработки систем автоматического построения семантических структур текста и могут быть применены в задачах анализа документов, информационного поиска и интеллектуальной обработки текстовых данных.

In the context of accelerated growth of heterogeneous textual data volumes, universal approaches to information extraction that are independent of the specific structure and domain of source texts have become particularly important. Despite the widespread adoption of large generative language models, the problem of accurate and resource-efficient information extraction from textual data remains relevant. While possessing broad capabilities, generative models are often excessive for specialized information retrieval tasks and may demonstrate low interpretability of results. This study is part of research work aimed at developing an alternative method for information extraction from unstructured texts to form a structural model of a text document. The proposed approach focuses on identifying semantically rich text fragments through relevance analysis relative to given thematic aspects of the text. This research presents an information extraction method using an extractive question answering model, based on multi-level answer aggregation combining strategies for assessing text fragment relevance, semantic clustering, and final answer selection for a given question. The proposed approach enables identification of words in the text that are most relevant to the target thematic aspects, which can subsequently be used to extract reliable information from the document. The article presents experimental results confirming the effectiveness of the proposed method in identifying semantically relevant elements of a text document. The obtained results have practical value for developing automated systems of text semantic structure construction and can be applied in document analysis, information retrieval, and intelligent text processing tasks.

обработка естественного языка извлечение информации неструктурированный текст вопросно-ответная модель механизм самовнимания

natural language processing information extraction unstructured text question-answering model self-attention mechanism

Исследование выполнено без спонсорской поддержки.

The study was performed without external funding.

References 1

Xu D., Chen W., Peng W., et al. Large language models for generative information extraction: A survey. Frontiers of Computer Science. 2024;18(6). https://doi.org/10.1007/s11704-024-40555-y

Huang L., Yu W., Ma W., et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems. 2025;43(2). https://doi.org/10.1145/3703155

Zhao H., Chen H., Yang F., et al. Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology. 2024;15(2). https://doi.org/10.1145/3639372

Cong X., Yu B., Fang M., et al. Universal information extraction with meta-pretrained self-retrieval. In: Findings of the Association for Computational Linguistics: ACL 2023, 09–14 July 2023, Toronto, Canada. Association for Computational Linguistics; 2023. P. 4084–4100. https://doi.org/10.18653/v1/2023.findings-acl.251

Dagdelen J., Dunn A., Lee S., et al. Structured information extraction from scientific text with large language models. Nature Communications. 2024;15. https://doi.org/10.1038/s41467-024-45563-x

Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019: Volume 1, 02–07 June 2019, Minneapolis, MN, USA. Association for Computational Linguistics; 2019. P. 4171–4186.

Karpukhin V., Oguz B., Min S., et al. Dense Passage Retrieval for Open-Domain Question Answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, 16–20 November 2020, Online. Association for Computational Linguistics; 2020. P. 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550

Izacard G., Grave E. Distilling Knowledge from Reader to Retriever for Question Answering. arXiv. URL: https://doi.org/10.48550/arXiv.2012.04584 [Accessed 12th January 2026].

Mondal I., Yuan M., Natarajan A., et al. ADAPTIVE IE: Investigating the Complementarity of Human-AI Collaboration to Adaptively Extract Information on-the-fly. In: Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025, 19–24 January 2025, Abu Dhabi, UAE. Association for Computational Linguistics; 2025. P. 5870–5889.

Ngo N.T., Min B., Nguyen Th.H. Unsupervised domain adaptation for joint information extraction. In: Findings of the Association for Computational Linguistics: EMNLP 2022, 07–11 December 2022, Abu Dhabi, UAE. Association for Computational Linguistics; 2022. P. 5894–5905. https://doi.org/10.18653/v1/2022.findings-emnlp.434

Arzideh K., Schäfer H., Allende-Cid H., et al. From BERT to generative AI – Comparing encoder-only vs. large language models in a cohort of lung cancer patients for named entity recognition in unstructured medical reports. Computers in Biology and Medicine. 2025;195. https://doi.org/10.1016/j.compbiomed.2025.110665

Березкин Д.В., Козлов И.А., Мартынюк П.А., Панфилкин А.М. Метод создания структурных моделей текстовых документов с использованием нейронных сетей. Вестник Южно-Уральского государственного университета. Серия: Вычислительная математика и информатика. 2023;12(1):28–45. (На англ.). https://doi.org/10.14529/cmse230102

Jain S., Van Zuylen M., Hajishirzi H., Beltagy I. SciREX: A challenge dataset for document-level information extraction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, 05–10 July 2020, Online. Association for Computational Linguistics; 2020. P. 7506–7516. https://doi.org/10.18653/v1/2020.acl-main.670

Graesser A.C., McNamara D.S., Louwerse M.M., Cai Zh. Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers. 2004;36(2):193–202. https://doi.org/10.3758/BF03195564

Prentice Sh., Knight J., Rayson P., Haj M.E., Rutherford N. Problematising characteristicness: a biomedical association case study. International Journal of Corpus Linguistics. 2021;26(3):305–335. https://doi.org/10.1075/ijcl.19019.pre

The authors declare that there are no conflicts of interest present.