<?xml version="1.0" encoding="UTF-8"?>
<article article-type="research-article" dtd-version="1.3" xml:lang="ru" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="https://metafora.rcsi.science/xsd_files/journal3.xsd">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">moitvivt</journal-id>
      <journal-title-group>
        <journal-title xml:lang="ru">Моделирование, оптимизация и информационные технологии</journal-title>
        <trans-title-group xml:lang="en">
          <trans-title>Modeling, Optimization and Information Technology</trans-title>
        </trans-title-group>
      </journal-title-group>
      <issn pub-type="epub">2310-6018</issn>
      <publisher>
        <publisher-name>Издательство</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.26102/2310-6018/2024.45.2.041</article-id>
      <article-id pub-id-type="custom" custom-type="elpub">1588</article-id>
      <title-group>
        <article-title xml:lang="ru">Использование одновременной многопоточности в высокопроизводительных вычислительных алгоритмах</article-title>
        <trans-title-group xml:lang="en">
          <trans-title>Using simultaneous multithreading in high-performance numerical algorithms</trans-title>
        </trans-title-group>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <name-alternatives>
            <name name-style="eastern" xml:lang="ru">
              <surname>Буевич</surname>
              <given-names>Евгений Андреевич</given-names>
            </name>
            <name name-style="western" xml:lang="en">
              <surname>Buevich</surname>
              <given-names>Evgeniy Andreevich</given-names>
            </name>
          </name-alternatives>
          <email>gftregs@gmail.com</email>
          <xref ref-type="aff">aff-1</xref>
        </contrib>
      </contrib-group>
      <aff-alternatives id="aff-1">
        <aff xml:lang="ru">Московский государственный технологический университет “СТАНКИН”</aff>
        <aff xml:lang="en">Moscow State Technological University "STANKIN"</aff>
      </aff-alternatives>
      <pub-date pub-type="epub">
        <day>01</day>
        <month>01</month>
        <year>2026</year>
      </pub-date>
      <volume>1</volume>
      <issue>1</issue>
      <elocation-id>10.26102/2310-6018/2024.45.2.041</elocation-id>
      <permissions>
        <copyright-statement>Copyright © Авторы, 2026</copyright-statement>
        <copyright-year>2026</copyright-year>
        <license license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/">
          <license-p>This work is licensed under a Creative Commons Attribution 4.0 International License</license-p>
        </license>
      </permissions>
      <self-uri xlink:href="https://moitvivt.ru/ru/journal/article?id=1588"/>
      <abstract xml:lang="ru">
        <p>Технология одновременной многопоточности считается малоприменимой в программах, занимающихся интенсивными вычислениями, в частности, при умножении матриц – одной из основных операций машинного обучения. Целью данной работы является определение границ применимости этого типа многопоточности к интенсивным вычислениям на примере блочного матричного умножения. В работе выделен ряд характеристик кода умножения матриц и архитектуры процессора, влияющих на эффективность использования одновременной многопоточности. Предложен способ определения наличия структурных ограничений процессора при исполнении более чем одного потока и их количественной оценки. Рассмотрено влияние используемого примитива синхронизации и его особенности применительно к одновременной многопоточности. Рассмотрен существующий алгоритм разделения матриц на блоки, предложено изменение размеров блоков и параметров циклов для лучшей утилизации вычислительных модулей ядра процессора двумя потоками. Создана модель оценки производительности выполнения идентичного кода двумя потоками на одном физическом ядре. Создан критерий определения возможности оптимизации кода с интенсивными вычислениями с помощью этого типа многопоточности. Показано, что разделение вычислений между логическими потоками с использованием общего кэша L1 оправдано как минимум на одной из распространенных архитектур процессоров.</p>
      </abstract>
      <trans-abstract xml:lang="en">
        <p>The technology of simultaneous multithreading is considered to be of little use in programs involved in intensive calculations, in particular when multiplying matrices - one of the main operations of machine learning. The purpose of this work is to determine the limits of applicability of this type of multithreading to high performance numerical code using the example of block matrix multiplication. The paper highlights a number of characteristics of matrix multiplication code and processor architecture that affect the efficiency of using simultaneous multithreading. A method is proposed for determining the presence of structural limitations of the processor when executing more than one thread and their quantitative estimation. The influence of the used synchronization primitive and its features in relation to simultaneous multithreading are considered. The existing algorithm for dividing matrices into blocks is considered, and it is proposed to change the size of blocks and loop parameters for better utilization of the computing modules of the processor core by two threads. A model has been created to evaluate the performance of executing identical code by two threads on one physical core. A criteria has been created to determine whether computationally intensive code can be optimized using this type of multithreading. It is shown that dividing calculations between logical threads using a common L1 cache is beneficial in at least one of the common processor architectures.</p>
      </trans-abstract>
      <kwd-group xml:lang="ru">
        <kwd>одновременная многопоточность</kwd>
        <kwd>умножение матриц</kwd>
        <kwd>интенсивные вычисления</kwd>
        <kwd>микроядро</kwd>
        <kwd>BLAS</kwd>
        <kwd>BLIS</kwd>
        <kwd>синхронизация</kwd>
        <kwd>иерархия кэша</kwd>
        <kwd>спинлок</kwd>
      </kwd-group>
      <kwd-group xml:lang="en">
        <kwd>simultaneous multithreading</kwd>
        <kwd>matrix multiplication</kwd>
        <kwd>computation intensive</kwd>
        <kwd>microcore</kwd>
        <kwd>BLAS</kwd>
        <kwd>BLIS</kwd>
        <kwd>synchronization</kwd>
        <kwd>cache hierarchy</kwd>
        <kwd>spinlock</kwd>
      </kwd-group>
      <funding-group>
        <funding-statement xml:lang="ru">Исследование выполнено без спонсорской поддержки.</funding-statement>
        <funding-statement xml:lang="en">The study was performed without external funding.</funding-statement>
      </funding-group>
    </article-meta>
  </front>
  <back>
    <ref-list>
      <title>References</title>
      <ref id="cit1">
        <label>1</label>
        <mixed-citation xml:lang="ru">Tullsen D.M., Eggers S.J., Levy H.M. Simultaneous multithreading: maximizing on-chip parallelism. In: ISCA '95: Proceedings of the 22nd annual international symposium on Computer architecture, 22-24 June 1995, Santa Margherita Ligure, Italy. New York: Association for Computing Machinery; 1995. P. 392–403. https://doi.org/10.1145/223982.224449</mixed-citation>
      </ref>
      <ref id="cit2">
        <label>2</label>
        <mixed-citation xml:lang="ru">Marr D.T., Binns F., Hill D.L., Hinton G., Koufaty D.A., Miller J.A., Upton M.  Hyper-Threading Technology Architecture and Microarchitecture. Intel Technology Journal. 2002;6(1):4–15.</mixed-citation>
      </ref>
      <ref id="cit3">
        <label>3</label>
        <mixed-citation xml:lang="ru">Leng T., Ali R., Hsieh J., Mashayekhi V., Rooholamini R. An Empirical Study of Hyper-Threading in High Performance Computing Clusters. In: Proceedings of The 3rd LCI International Conference on Linux Clusters: The HPC Revolution 2002, 23-25 October 2002, Saint Petersburg, FL, USA. Linux Clusters Institute; 2002.</mixed-citation>
      </ref>
      <ref id="cit4">
        <label>4</label>
        <mixed-citation xml:lang="ru">Smith T.M., Van De Geijn R., Smelyanskiy M., Hammond J.R., Van Zee F.G. Anatomy of High-Performance Many-Threaded Matrix Multiplication. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 19-23 May 2014, Phoenix, AZ, USA. IEEE Computer Society; 2024. P. 1049–1059. https://doi.org/10.1109/IPDPS.2014.110</mixed-citation>
      </ref>
      <ref id="cit5">
        <label>5</label>
        <mixed-citation xml:lang="ru">Xu R.G., Van Zee F.G., Van De Geijn R.A. GEMMFIP: Unifying GEMM in BLIS. URL: https://doi.org/10.48550/arXiv.2302.08417 (Accessed 15th April 2024).</mixed-citation>
      </ref>
      <ref id="cit6">
        <label>6</label>
        <mixed-citation xml:lang="ru">Goto K., Van De Geijn R.A. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software. 2008;34(3). https://doi.org/10.1145/1356052.1356053</mixed-citation>
      </ref>
      <ref id="cit7">
        <label>7</label>
        <mixed-citation xml:lang="ru">Van Zee F.G., Van De Geijn R.A. BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM Transactions on Mathematical Software. 2015;41(3). https://doi.org/10.1145/2764454</mixed-citation>
      </ref>
      <ref id="cit8">
        <label>8</label>
        <mixed-citation xml:lang="ru">Huang J., Van De Geijn R.A. BLISlab: A Sandbox for Optimizing GEMM. URL: https://doi.org/10.48550/arXiv.1609.00076 (Accessed 15th April 2024)</mixed-citation>
      </ref>
      <ref id="cit9">
        <label>9</label>
        <mixed-citation xml:lang="ru">Low T.M., Igual F.D., Smith T.M., Quintana-Orti E.S. Analytical Modeling Is Enough for High-Performance BLIS. ACM Transactions on Mathematical Software. 2016;43(2). https://doi.org/10.1145/2925987</mixed-citation>
      </ref>
      <ref id="cit10">
        <label>10</label>
        <mixed-citation xml:lang="ru">Fog A. The Microarchitecture of Intel, AMD and VIA CPUs. URL: https://www.agner.org/optimize/microarchitecture.pdf (Accessed 17th April 2024).</mixed-citation>
      </ref>
      <ref id="cit11">
        <label>11</label>
        <mixed-citation xml:lang="ru">Ren X., Moody L., Taram M., Jordan M., Tullsen D.M., Venkat A. I See Dead µops: Leaking Secrets via Intel/AMD Micro-Op Caches. In: 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 14-18 June 2021, Valencia, Spain. IEEE; 2021. P. 361–374. https://doi.org/10.1109/ISCA52012.2021.00036</mixed-citation>
      </ref>
      <ref id="cit12">
        <label>12</label>
        <mixed-citation xml:lang="ru">Kim D., Liao S.S.-W., Wang P.H., Del Cuvillo J., Tian X., Zou X., Wang H., Yeung D., Girkar M., Shen J.P. Physical experimentation with prefetching helper threads on Intel's hyper-threaded processors. In: International Symposium on Code Generation and Optimization (CGO 2004), 20-24 March 2004, San Jose, CA, USA. IEEE; 2004. P. 27–38. https://doi.org/10.1109/CGO.2004.1281661</mixed-citation>
      </ref>
      <ref id="cit13">
        <label>13</label>
        <mixed-citation xml:lang="ru">Tullsen D.M., Lo J.L., Eggers S.J., Levy H.M. Supporting fine-grained synchronization on a simultaneous multithreading processor. In: HPCA '99: Proceedings of the 5th International Symposium on High-Performance Computer Architecture, 09-13 January 1999, Orlando, FL, USA. NW Washington: IEEE Computer Society; 1999. P. 54–58. https://doi.org/10.1109/HPCA.1999.744326</mixed-citation>
      </ref>
    </ref-list>
    <fn-group>
      <fn fn-type="conflict">
        <p>The authors declare that there are no conflicts of interest present.</p>
      </fn>
    </fn-group>
  </back>
</article>