References

moitvivt

Моделирование, оптимизация и информационные технологии

Modeling, Optimization and Information Technology

2310-6018

Издательство

10.26102/2310-6018/2024.45.2.041

1588

Использование одновременной многопоточности в высокопроизводительных вычислительных алгоритмах

Using simultaneous multithreading in high-performance numerical algorithms

Буевич

Евгений Андреевич

Buevich

Evgeniy Andreevich

gftregs@gmail.com aff-1

Московский государственный технологический университет “СТАНКИН” Moscow State Technological University "STANKIN"

01 01 2026

1 1

10.26102/2310-6018/2024.45.2.041

2026

This work is licensed under a Creative Commons Attribution 4.0 International License

Технология одновременной многопоточности считается малоприменимой в программах, занимающихся интенсивными вычислениями, в частности, при умножении матриц – одной из основных операций машинного обучения. Целью данной работы является определение границ применимости этого типа многопоточности к интенсивным вычислениям на примере блочного матричного умножения. В работе выделен ряд характеристик кода умножения матриц и архитектуры процессора, влияющих на эффективность использования одновременной многопоточности. Предложен способ определения наличия структурных ограничений процессора при исполнении более чем одного потока и их количественной оценки. Рассмотрено влияние используемого примитива синхронизации и его особенности применительно к одновременной многопоточности. Рассмотрен существующий алгоритм разделения матриц на блоки, предложено изменение размеров блоков и параметров циклов для лучшей утилизации вычислительных модулей ядра процессора двумя потоками. Создана модель оценки производительности выполнения идентичного кода двумя потоками на одном физическом ядре. Создан критерий определения возможности оптимизации кода с интенсивными вычислениями с помощью этого типа многопоточности. Показано, что разделение вычислений между логическими потоками с использованием общего кэша L1 оправдано как минимум на одной из распространенных архитектур процессоров.

The technology of simultaneous multithreading is considered to be of little use in programs involved in intensive calculations, in particular when multiplying matrices - one of the main operations of machine learning. The purpose of this work is to determine the limits of applicability of this type of multithreading to high performance numerical code using the example of block matrix multiplication. The paper highlights a number of characteristics of matrix multiplication code and processor architecture that affect the efficiency of using simultaneous multithreading. A method is proposed for determining the presence of structural limitations of the processor when executing more than one thread and their quantitative estimation. The influence of the used synchronization primitive and its features in relation to simultaneous multithreading are considered. The existing algorithm for dividing matrices into blocks is considered, and it is proposed to change the size of blocks and loop parameters for better utilization of the computing modules of the processor core by two threads. A model has been created to evaluate the performance of executing identical code by two threads on one physical core. A criteria has been created to determine whether computationally intensive code can be optimized using this type of multithreading. It is shown that dividing calculations between logical threads using a common L1 cache is beneficial in at least one of the common processor architectures.

одновременная многопоточность умножение матриц интенсивные вычисления микроядро BLAS BLIS синхронизация иерархия кэша спинлок

simultaneous multithreading matrix multiplication computation intensive microcore BLAS BLIS synchronization cache hierarchy spinlock

Исследование выполнено без спонсорской поддержки.

The study was performed without external funding.

References 1

Tullsen D.M., Eggers S.J., Levy H.M. Simultaneous multithreading: maximizing on-chip parallelism. In: ISCA '95: Proceedings of the 22nd annual international symposium on Computer architecture, 22-24 June 1995, Santa Margherita Ligure, Italy. New York: Association for Computing Machinery; 1995. P. 392–403. https://doi.org/10.1145/223982.224449

Marr D.T., Binns F., Hill D.L., Hinton G., Koufaty D.A., Miller J.A., Upton M. Hyper-Threading Technology Architecture and Microarchitecture. Intel Technology Journal. 2002;6(1):4–15.

Leng T., Ali R., Hsieh J., Mashayekhi V., Rooholamini R. An Empirical Study of Hyper-Threading in High Performance Computing Clusters. In: Proceedings of The 3rd LCI International Conference on Linux Clusters: The HPC Revolution 2002, 23-25 October 2002, Saint Petersburg, FL, USA. Linux Clusters Institute; 2002.

Smith T.M., Van De Geijn R., Smelyanskiy M., Hammond J.R., Van Zee F.G. Anatomy of High-Performance Many-Threaded Matrix Multiplication. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 19-23 May 2014, Phoenix, AZ, USA. IEEE Computer Society; 2024. P. 1049–1059. https://doi.org/10.1109/IPDPS.2014.110

Xu R.G., Van Zee F.G., Van De Geijn R.A. GEMMFIP: Unifying GEMM in BLIS. URL: https://doi.org/10.48550/arXiv.2302.08417 (Accessed 15th April 2024).

Goto K., Van De Geijn R.A. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software. 2008;34(3). https://doi.org/10.1145/1356052.1356053

Van Zee F.G., Van De Geijn R.A. BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM Transactions on Mathematical Software. 2015;41(3). https://doi.org/10.1145/2764454

Huang J., Van De Geijn R.A. BLISlab: A Sandbox for Optimizing GEMM. URL: https://doi.org/10.48550/arXiv.1609.00076 (Accessed 15th April 2024)

Low T.M., Igual F.D., Smith T.M., Quintana-Orti E.S. Analytical Modeling Is Enough for High-Performance BLIS. ACM Transactions on Mathematical Software. 2016;43(2). https://doi.org/10.1145/2925987

Fog A. The Microarchitecture of Intel, AMD and VIA CPUs. URL: https://www.agner.org/optimize/microarchitecture.pdf (Accessed 17th April 2024).

Ren X., Moody L., Taram M., Jordan M., Tullsen D.M., Venkat A. I See Dead µops: Leaking Secrets via Intel/AMD Micro-Op Caches. In: 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 14-18 June 2021, Valencia, Spain. IEEE; 2021. P. 361–374. https://doi.org/10.1109/ISCA52012.2021.00036

Kim D., Liao S.S.-W., Wang P.H., Del Cuvillo J., Tian X., Zou X., Wang H., Yeung D., Girkar M., Shen J.P. Physical experimentation with prefetching helper threads on Intel's hyper-threaded processors. In: International Symposium on Code Generation and Optimization (CGO 2004), 20-24 March 2004, San Jose, CA, USA. IEEE; 2004. P. 27–38. https://doi.org/10.1109/CGO.2004.1281661

Tullsen D.M., Lo J.L., Eggers S.J., Levy H.M. Supporting fine-grained synchronization on a simultaneous multithreading processor. In: HPCA '99: Proceedings of the 5th International Symposium on High-Performance Computer Architecture, 09-13 January 1999, Orlando, FL, USA. NW Washington: IEEE Computer Society; 1999. P. 54–58. https://doi.org/10.1109/HPCA.1999.744326

The authors declare that there are no conflicts of interest present.