<?xml version="1.0" encoding="UTF-8"?>
<article article-type="research-article" dtd-version="1.3" xml:lang="ru" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="https://metafora.rcsi.science/xsd_files/journal3.xsd">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">moitvivt</journal-id>
      <journal-title-group>
        <journal-title xml:lang="ru">Моделирование, оптимизация и информационные технологии</journal-title>
        <trans-title-group xml:lang="en">
          <trans-title>Modeling, Optimization and Information Technology</trans-title>
        </trans-title-group>
      </journal-title-group>
      <issn pub-type="epub">2310-6018</issn>
      <publisher>
        <publisher-name>Издательство</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.26102/2310-6018/2024.44.1.001</article-id>
      <article-id pub-id-type="custom" custom-type="elpub">1489</article-id>
      <title-group>
        <article-title xml:lang="ru">Методы отбора признаков в задаче определения авторства в контексте кибербезопасности</article-title>
        <trans-title-group xml:lang="en">
          <trans-title>Feature selection methods for authorship attribution in cybersecurity context</trans-title>
        </trans-title-group>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author" corresp="yes">
          <contrib-id contrib-id-type="orcid">0000-0002-2587-2222</contrib-id>
          <name-alternatives>
            <name name-style="eastern" xml:lang="ru">
              <surname>Романов</surname>
              <given-names>Александр Сергеевич</given-names>
            </name>
            <name name-style="western" xml:lang="en">
              <surname>Romanov</surname>
              <given-names>Aleksandr Sergeevich</given-names>
            </name>
          </name-alternatives>
          <email>alexx.romanov@gmail.com</email>
          <xref ref-type="aff">aff-1</xref>
        </contrib>
      </contrib-group>
      <aff-alternatives id="aff-1">
        <aff xml:lang="ru">Томский государственный университет систем управления и радиоэлектроники</aff>
        <aff xml:lang="en">Tomsk State University of Control Systems and Radioelectronics</aff>
      </aff-alternatives>
      <pub-date pub-type="epub">
        <day>01</day>
        <month>01</month>
        <year>2026</year>
      </pub-date>
      <volume>1</volume>
      <issue>1</issue>
      <elocation-id>10.26102/2310-6018/2024.44.1.001</elocation-id>
      <permissions>
        <copyright-statement>Copyright © Авторы, 2026</copyright-statement>
        <copyright-year>2026</copyright-year>
        <license license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/">
          <license-p>This work is licensed under a Creative Commons Attribution 4.0 International License</license-p>
        </license>
      </permissions>
      <self-uri xlink:href="https://moitvivt.ru/ru/journal/article?id=1489"/>
      <abstract xml:lang="ru">
        <p>В работе рассмотрены методы определения авторства естественных и искусственно-сгенерированных текстов, важных в контексте кибербезопасности и защиты интеллектуальной собственности с целью предотвращения дезинформации и мошенничества. Использование методов определения автора текста обосновано выводами об эффективности рассмотренных в прошлых исследованиях fastText и метода опорных векторов (SVM). Алгоритм отбора признаков выбран на основе сравнения пяти различных методов – генетического алгоритма, прямого и обратного последовательных методов, регуляризационного отбора и метода Шепли. Рассмотренные алгоритмы отбора включают эвристические методы, элементы теории игр и итерационные алгоритмы. Наиболее эффективным методом признан алгоритм, основанный на регуляризации, в то время как методы, основанные на полном переборе, признаны неэффективными для любого множества авторов. Точность отбора на основе регуляризации и SVM в среднем составила 77 %, что превосходит другие методы от 3 до 10 % при идентичном количестве признаков. При тех же задачах средняя точность fastText – 84 %. Было проведено исследование, направленное на устойчивость разработанного подхода к генеративным образцам. SVM оказался более устойчив к запутыванию модели. Максимальная потеря точности для fastText составила 16 %, а для SVM – 12 %.</p>
      </abstract>
      <trans-abstract xml:lang="en">
        <p>This paper considers methods for authorship attribution of natural-language and artificially generated texts, which are important in the context of cybersecurity and intellectual property protection to prevent misinformation and fraud. The use of authorship methods is justified by the findings on the fastText and support vector method (SVM) effectiveness discussed in past studies. The feature selection algorithm is chosen based on the comparison of five different methods: genetic algorithm, forward and backward sequential methods, regularization selection and Shapley's method. The considered selection algorithms include heuristic methods, game theory elements and iterative algorithms. The regularisation-based algorithm is found to be the most efficient method, while methods based on complete brute-force selection are found to be inefficient for any set of authors. The regularization-based and SVM-based selection accuracy averaged 77 %, outperforming the other methods by between 3 and 10 % for an identical number of features. For the same tasks, the average accuracy of fastText is 84 %. A study was conducted to examine the robustness of the developed approach to generative samples. SVM proved to be more robust to model confounding. The maximum loss of accuracy for fastText was 16 % and for SVM was 12 %.</p>
      </trans-abstract>
      <kwd-group xml:lang="ru">
        <kwd>отбор признаков</kwd>
        <kwd>определение автора</kwd>
        <kwd>машинное обучение</kwd>
        <kwd>нейронные сети</kwd>
        <kwd>анализ текста</kwd>
        <kwd>информационная безопасность</kwd>
      </kwd-group>
      <kwd-group xml:lang="en">
        <kwd>feature selection</kwd>
        <kwd>authorship attribution</kwd>
        <kwd>machine learning</kwd>
        <kwd>neural networks</kwd>
        <kwd>text analysis</kwd>
        <kwd>information security</kwd>
      </kwd-group>
      <funding-group>
        <funding-statement xml:lang="ru">Данная работа выполнена при финансовой поддержке Министерства науки и высшего образования РФ в рамках базовой части государственного задания ТУСУРа на 2023–2025 гг. (проект № FEWM-2023-0015).</funding-statement>
        <funding-statement xml:lang="en">This research was supported by the Ministry of Science and Higher Education of the Russian Federation the basic part of the state assignment of TUSUR for 2023–2025 (project No. FEWM-2023-0015).</funding-statement>
      </funding-group>
    </article-meta>
  </front>
  <back>
    <ref-list>
      <title>References</title>
      <ref id="cit1">
        <label>1</label>
        <mixed-citation xml:lang="ru">Romanov A., Kurtukova A., Shelupanov A., Fedotova A., Goncharov V. Authorship identification of a Russian-language text using support vector machine and deep neural networks. Future Internet. 2020;13(1):3. DOI: 10.3390/fi13010003.</mixed-citation>
      </ref>
      <ref id="cit2">
        <label>2</label>
        <mixed-citation xml:lang="ru">Fedotova A., Romanov A., Kurtukova A., Shelupanov A. Authorship attribution of social media and literary Russian-language texts using machine learning methods and feature selection. Future Internet. 2021;14(1):4. DOI: 10.3390/fi14010004.</mixed-citation>
      </ref>
      <ref id="cit3">
        <label>3</label>
        <mixed-citation xml:lang="ru">Wu H., Zhang Z., Wu Q. Exploring syntactic and semantic features for authorship attribution. Applied Soft Computing. 2021;111:107815–107822. DOI: 10.1016/j.asoc.2021.107815.</mixed-citation>
      </ref>
      <ref id="cit4">
        <label>4</label>
        <mixed-citation xml:lang="ru">Khomytska I., Bazylevych I., Teslyuk V. The statistical parameters of Ivan Franko’s authorial style determined by the chi-square test. 2022 IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT), 2022. p. 73–76. DOI: 10.1109/CSIT56902.2022.10000491.</mixed-citation>
      </ref>
      <ref id="cit5">
        <label>5</label>
        <mixed-citation xml:lang="ru">Chekhovich Y. V., Khazov A. V. Analysis of duplicated publications in Russian journals. Journal of informetrics. 2022;16(1):101246. DOI: 10.1016/j.joi.2021.101246.</mixed-citation>
      </ref>
      <ref id="cit6">
        <label>6</label>
        <mixed-citation xml:lang="ru">Исаченко В.В., Апанович З.В. Система анализа и визуализации для кросс-языковой идентификации авторов научных публикаций. Вестник Новосибирского государственного университета. Серия: Информационные технологии. 2018;16(2):49–61. DOI: 10.25205/1818-7900-2018-16-2-49-61.</mixed-citation>
      </ref>
      <ref id="cit7">
        <label>7</label>
        <mixed-citation xml:lang="ru">Agun H.V., Yilmazel O. Incorporating topic information in a global feature selection schema for authorship attribution. IEEE Access. 2019;7:98522–98529 DOI: 10.1109/ACCESS.2019.2930536.</mixed-citation>
      </ref>
      <ref id="cit8">
        <label>8</label>
        <mixed-citation xml:lang="ru">Kou G., Yang P., Peng Y., Xiao F., Chen Y., Alsaadi F.E. Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Applied Soft Computing. 2020;86:105836. DOI: 10.1016/j.asoc.2019.105836.</mixed-citation>
      </ref>
      <ref id="cit9">
        <label>9</label>
        <mixed-citation xml:lang="ru">Bardamova M., Hodashinsky I. Hybrid algorithm for tuning feature weights in a fuzzy classifier. 2021 Ural Symposium on Biomedical Engineering, Radioelectronics and Information Technology (USBEREIT), Yekaterinburg, Russia, 2021. p. 0354–0357. DOI: 10.1109/USBEREIT51232.2021.9455030.</mixed-citation>
      </ref>
      <ref id="cit10">
        <label>10</label>
        <mixed-citation xml:lang="ru">Yaseen A., Laftah W., Kadhum I., Hamad A. Wrapper feature selection method based differential evolution and extreme learning machine for intrusion detection system. Pattern Recognition. 2022;108912. DOI: 10.1016/j.patcog.2022.108912.</mixed-citation>
      </ref>
      <ref id="cit11">
        <label>11</label>
        <mixed-citation xml:lang="ru">Uchendu A., Le T., Lee D. Attribution and obfuscation of neural text authorship: A data mining perspective. ACM SIGKDD Explorations Newsletter. 2023;25(1):1–18. DOI: 10.48550/arXiv.2210.10488.</mixed-citation>
      </ref>
      <ref id="cit12">
        <label>12</label>
        <mixed-citation xml:lang="ru">Shamardina T. et al. Findings of the the ruatd shared task 2022 on artificial text detection in Russian. arXiv preprint arXiv:2206;2022;01583. DOI: 10.48550/arXiv.2206.01583.</mixed-citation>
      </ref>
      <ref id="cit13">
        <label>13</label>
        <mixed-citation xml:lang="ru">Xu W., Yuan, K., Li, W., Ding, W. An emerging fuzzy feature selection method using composite entropy-based uncertainty measure and data distribution. IEEE Transactions on Emerging Topics in Computational Intelligence. 2022;7(1):76–88. DOI: 10.1109/TETCI.2022.3171784.</mixed-citation>
      </ref>
      <ref id="cit14">
        <label>14</label>
        <mixed-citation xml:lang="ru">Yao G., Xiaojian H., Guanxiong W. A novel ensemble feature selection method by integrating multiple ranking information combined with an SVM ensemble model for enterprise credit risk prediction in the supply chain. Expert Systems with Applications. 2022;117002. DOI: 10.1016/j.eswa.2022.117002.</mixed-citation>
      </ref>
      <ref id="cit15">
        <label>15</label>
        <mixed-citation xml:lang="ru">Abu Khurma R., Aljarah I., Sharieh A., Abd Elaziz M., Damaševičius R., Krilavičius T. A review of the modification strategies of the nature inspired algorithms for feature selection problem. Mathematics 2022;10(464). DOI: 10.3390/math10030464.</mixed-citation>
      </ref>
      <ref id="cit16">
        <label>16</label>
        <mixed-citation xml:lang="ru">Borboudakis G., Tsamardinos I. Forward-backward selection with early dropping. The Journal of Machine Learning Research; 2019:20(1):276–314. DOI: 10.5555/3322706.3322714.</mixed-citation>
      </ref>
      <ref id="cit17">
        <label>17</label>
        <mixed-citation xml:lang="ru">Le N.Q.K., Ho Q.T., Nguyen V.N., Chang J.S. BERT-Promoter: An improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection. Computational Biology and Chemistry. 2022;99:107732. DOI: 10.1016/j.compbiolchem.2022.107732.</mixed-citation>
      </ref>
      <ref id="cit18">
        <label>18</label>
        <mixed-citation xml:lang="ru">Новый частотный словарь русской лексики. URL: http://dict.ruslang.ru/freq.php (дата обращения 04.12.2023).</mixed-citation>
      </ref>
    </ref-list>
    <fn-group>
      <fn fn-type="conflict">
        <p>The authors declare that there are no conflicts of interest present.</p>
      </fn>
    </fn-group>
  </back>
</article>