References

moitvivt

Моделирование, оптимизация и информационные технологии

Modeling, Optimization and Information Technology

2310-6018

Издательство

10.26102/2310-6018/2024.44.1.001

1489

Методы отбора признаков в задаче определения авторства в контексте кибербезопасности

Feature selection methods for authorship attribution in cybersecurity context

0000-0002-2587-2222

Романов

Александр Сергеевич

Romanov

Aleksandr Sergeevich

alexx.romanov@gmail.com aff-1

Томский государственный университет систем управления и радиоэлектроники Tomsk State University of Control Systems and Radioelectronics

01 01 2026

1 1

10.26102/2310-6018/2024.44.1.001

2026

This work is licensed under a Creative Commons Attribution 4.0 International License

В работе рассмотрены методы определения авторства естественных и искусственно-сгенерированных текстов, важных в контексте кибербезопасности и защиты интеллектуальной собственности с целью предотвращения дезинформации и мошенничества. Использование методов определения автора текста обосновано выводами об эффективности рассмотренных в прошлых исследованиях fastText и метода опорных векторов (SVM). Алгоритм отбора признаков выбран на основе сравнения пяти различных методов – генетического алгоритма, прямого и обратного последовательных методов, регуляризационного отбора и метода Шепли. Рассмотренные алгоритмы отбора включают эвристические методы, элементы теории игр и итерационные алгоритмы. Наиболее эффективным методом признан алгоритм, основанный на регуляризации, в то время как методы, основанные на полном переборе, признаны неэффективными для любого множества авторов. Точность отбора на основе регуляризации и SVM в среднем составила 77 %, что превосходит другие методы от 3 до 10 % при идентичном количестве признаков. При тех же задачах средняя точность fastText – 84 %. Было проведено исследование, направленное на устойчивость разработанного подхода к генеративным образцам. SVM оказался более устойчив к запутыванию модели. Максимальная потеря точности для fastText составила 16 %, а для SVM – 12 %.

This paper considers methods for authorship attribution of natural-language and artificially generated texts, which are important in the context of cybersecurity and intellectual property protection to prevent misinformation and fraud. The use of authorship methods is justified by the findings on the fastText and support vector method (SVM) effectiveness discussed in past studies. The feature selection algorithm is chosen based on the comparison of five different methods: genetic algorithm, forward and backward sequential methods, regularization selection and Shapley's method. The considered selection algorithms include heuristic methods, game theory elements and iterative algorithms. The regularisation-based algorithm is found to be the most efficient method, while methods based on complete brute-force selection are found to be inefficient for any set of authors. The regularization-based and SVM-based selection accuracy averaged 77 %, outperforming the other methods by between 3 and 10 % for an identical number of features. For the same tasks, the average accuracy of fastText is 84 %. A study was conducted to examine the robustness of the developed approach to generative samples. SVM proved to be more robust to model confounding. The maximum loss of accuracy for fastText was 16 % and for SVM was 12 %.

отбор признаков определение автора машинное обучение нейронные сети анализ текста информационная безопасность

feature selection authorship attribution machine learning neural networks text analysis information security

Данная работа выполнена при финансовой поддержке Министерства науки и высшего образования РФ в рамках базовой части государственного задания ТУСУРа на 2023–2025 гг. (проект № FEWM-2023-0015).

This research was supported by the Ministry of Science and Higher Education of the Russian Federation the basic part of the state assignment of TUSUR for 2023–2025 (project No. FEWM-2023-0015).

References 1

Romanov A., Kurtukova A., Shelupanov A., Fedotova A., Goncharov V. Authorship identification of a Russian-language text using support vector machine and deep neural networks. Future Internet. 2020;13(1):3. DOI: 10.3390/fi13010003.

Fedotova A., Romanov A., Kurtukova A., Shelupanov A. Authorship attribution of social media and literary Russian-language texts using machine learning methods and feature selection. Future Internet. 2021;14(1):4. DOI: 10.3390/fi14010004.

Wu H., Zhang Z., Wu Q. Exploring syntactic and semantic features for authorship attribution. Applied Soft Computing. 2021;111:107815–107822. DOI: 10.1016/j.asoc.2021.107815.

Khomytska I., Bazylevych I., Teslyuk V. The statistical parameters of Ivan Franko’s authorial style determined by the chi-square test. 2022 IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT), 2022. p. 73–76. DOI: 10.1109/CSIT56902.2022.10000491.

Chekhovich Y. V., Khazov A. V. Analysis of duplicated publications in Russian journals. Journal of informetrics. 2022;16(1):101246. DOI: 10.1016/j.joi.2021.101246.

Исаченко В.В., Апанович З.В. Система анализа и визуализации для кросс-языковой идентификации авторов научных публикаций. Вестник Новосибирского государственного университета. Серия: Информационные технологии. 2018;16(2):49–61. DOI: 10.25205/1818-7900-2018-16-2-49-61.

Agun H.V., Yilmazel O. Incorporating topic information in a global feature selection schema for authorship attribution. IEEE Access. 2019;7:98522–98529 DOI: 10.1109/ACCESS.2019.2930536.

Kou G., Yang P., Peng Y., Xiao F., Chen Y., Alsaadi F.E. Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Applied Soft Computing. 2020;86:105836. DOI: 10.1016/j.asoc.2019.105836.

Bardamova M., Hodashinsky I. Hybrid algorithm for tuning feature weights in a fuzzy classifier. 2021 Ural Symposium on Biomedical Engineering, Radioelectronics and Information Technology (USBEREIT), Yekaterinburg, Russia, 2021. p. 0354–0357. DOI: 10.1109/USBEREIT51232.2021.9455030.

Yaseen A., Laftah W., Kadhum I., Hamad A. Wrapper feature selection method based differential evolution and extreme learning machine for intrusion detection system. Pattern Recognition. 2022;108912. DOI: 10.1016/j.patcog.2022.108912.

Uchendu A., Le T., Lee D. Attribution and obfuscation of neural text authorship: A data mining perspective. ACM SIGKDD Explorations Newsletter. 2023;25(1):1–18. DOI: 10.48550/arXiv.2210.10488.

Shamardina T. et al. Findings of the the ruatd shared task 2022 on artificial text detection in Russian. arXiv preprint arXiv:2206;2022;01583. DOI: 10.48550/arXiv.2206.01583.

Xu W., Yuan, K., Li, W., Ding, W. An emerging fuzzy feature selection method using composite entropy-based uncertainty measure and data distribution. IEEE Transactions on Emerging Topics in Computational Intelligence. 2022;7(1):76–88. DOI: 10.1109/TETCI.2022.3171784.

Yao G., Xiaojian H., Guanxiong W. A novel ensemble feature selection method by integrating multiple ranking information combined with an SVM ensemble model for enterprise credit risk prediction in the supply chain. Expert Systems with Applications. 2022;117002. DOI: 10.1016/j.eswa.2022.117002.

Abu Khurma R., Aljarah I., Sharieh A., Abd Elaziz M., Damaševičius R., Krilavičius T. A review of the modification strategies of the nature inspired algorithms for feature selection problem. Mathematics 2022;10(464). DOI: 10.3390/math10030464.

Borboudakis G., Tsamardinos I. Forward-backward selection with early dropping. The Journal of Machine Learning Research; 2019:20(1):276–314. DOI: 10.5555/3322706.3322714.

Le N.Q.K., Ho Q.T., Nguyen V.N., Chang J.S. BERT-Promoter: An improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection. Computational Biology and Chemistry. 2022;99:107732. DOI: 10.1016/j.compbiolchem.2022.107732.

Новый частотный словарь русской лексики. URL: http://dict.ruslang.ru/freq.php (дата обращения 04.12.2023).

The authors declare that there are no conflicts of interest present.