Study of the problem of automated matching of audio files

idLevshin D.V., idBystryakov D.V., idZubkov A.V.

UDC 004.032.26
DOI: 10.26102/2310-6018/2025.51.4.004

Abstract
List of references
About authors

The volume of audio recording data has significantly increased and continues to grow, which complicates the processing of such data due to the presence of numerous duplicates, noisy recordings, and truncated audio clips. This article presents a solution to the problem of detecting fuzzy duplicates in large-scale audio datasets. The proposed method is based on the use of a cascaded ensemble. For feature extraction, temporal parameter analysis, and similarity evaluation between recordings, Convolutional Neural Networks (CNN), Temporal Shift Networks (TSN), and Siamese Networks were utilized. The input data were initially converted into mel-spectrogram images using the Short-Time Fourier Transform (STFT) algorithm. Each audio file was segmented at a specific sampling rate, with attention to temporal continuity, transformed using STFT, and then passed through the ensemble of models. The study focuses on the behavior of the ensemble when processing recordings that have undergone various modifications, such as noise addition, distortion, and trimming. Experiments conducted on the dataset demonstrated a high degree of correlation between the results obtained from human evaluators and the method, confirming the effectiveness of the proposed solution. The method showed strong robustness to different types of audio modifications, such as tempo changes, noise injection, and clipping. Future research may aim to adapt the ensemble to other types of data, including video and graphical recordings, which would expand the applicability of the proposed approach.

1. Kochegurova E.A., Saybert S.M., Tatyankina K.V. Optimization of Hybrid-Forecasting Algorithm Parameters Using a Model Ensemble in Real Time. Bulletin of the Tomsk Polytechnic University. Industrial Cybernetics. 2024;2(4):26–33. (In Russ.). https://doi.org/10.18799/29495407/2024/4/76

2. Six J., Bressan F., Renders K. Duplicate Detection for Digital Audio Archive Management: Two Case Studies. In: Advances in Speech and Music Technology: Computational Aspects and Applications. Cham: Springer; 2023. P. 311–329. https://doi.org/10.1007/978-3-031-18444-4_16

3. Reise W., Fernández X., Dominguez M., Harrington H.A., Beguerisse-Díaz M. Topological Fingerprints for Audio Identification. arXiv. URL: https://arxiv.org/abs/2309.03516 [Accessed 31st July 2025].

4. Malenko S.A. Uvelichenie proizvoditel'nosti algoritmov poiska dublikatov audiozapisei. Molodoi uchenyi. 2017;(49):22–26. (In Russ.).

5. Ryynanen M., Klapuri A. Query by Humming of Midi and Audio Using Locality Sensitive Hashing. In: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 01 March – 04 April 2008, Las Vegas, NV, USA. IEEE; 2008. P. 2249–2252. https://doi.org/10.1109/ICASSP.2008.4518093

6. Bulavin D.A., Kharitonov I.A. Analysis of Recognition and Transformation Methods Audio to Notes. Avtomatizirovannye sistemy upravleniya i pribory avtomatiki. 2011;(152):78–82. (In Russ.).

7. Novokhrestova D.I. Time Normalization of Syllables with the Dynamic Time Warping Algorithm in Assessing of Syllables Pronunciation Quality When Speaking. Proceedings of the TUSUR University. 2017;20(4):142–145. (In Russ.). https://doi.org/10.21293/1818-0442-2017-20-4-142-145

8. Wang Yi, Lyu X., Yang Sh. Ocean Observing Time-Series Anomaly Detection Based on DTW-TRSAX Method. The Journal of Supercomputing. 2024;80:18679–18704. https://doi.org/10.1007/s11227-024-06183-w

9. Ustubioglu A., Ustubioglu B., Ulutas G. Mel Spectrogram-Based Audio Forgery Detection Using CNN. Signal, Image and Video Processing. 2023;17(5):2211–2219. https://doi.org/10.1007/s11760-022-02436-4

10. Zhao H., Ye Ya., Shen X., Liu L. 1D-CNN-Based Audio Tampering Detection Using ENF Signals. Scientific Reports. 2024;14. https://doi.org/10.1038/s41598-024-60813-0

11. Wang W., Lu Zh. Few-Shot Bronze Vessel Classification via Siamese Fourier Networks. Scientific Reports. 2024;14. https://doi.org/10.1038/s41598-024-69272-z

12. Lin Ya.-B., Bertasius G. Siamese Vision Transformers Are Scalable Audio-Visual Learners. In: Computer Vision – ECCV 2024: 18th European Conference: Proceedings: Part XIV, 29 September – 04 October 2024, Milan, Italy. Cham: Springer; 2025. P. 303–321. https://doi.org/10.1007/978-3-031-72630-9_18

13. India M., Fonollosa J.A.R., Hernando J. LSTM Neural Network-Based Speaker Segmentation Using Acoustic and Language Modelling. In: Interspeech 2017: 18th Annual Conference of the International Speech Communication Association, 20–24 August 2017, Stockholm, Sweden. 2017. P. 2834–2838. https://doi.org/10.21437/Interspeech.2017-407

14. Hershey Sh., Chaudhuri S., Ellis D.P.W., et al. CNN Architectures for Large-Scale Audio Classification. arXiv. URL: https://arxiv.org/abs/1609.09430 [Accessed 11th April 2025].

15. Ananyev A.S., Butenko D.V., Popov K.V. Intelligent Technologies for Design Information Systems. The Method of Design Software Products in the Presence of a Prototype. Engineering Journal of Don. 2012;(2). (In Russ.). URL: http://www.ivdon.ru/en/magazine/archive/n2y2012/815

16. Kosheleva N.N. Korrelyatsionnyi analiz i ego primenenie dlya podscheta rangovoi korrelyatsii Spirmena. Aktual'nye problemy gumanitarnykh i estestvennykh nauk. 2012;(5):23–26. (In Russ.).

17. Men'shov M. Koeffitsient korrelyatsii Pirsona. Kazanskii federal'nyi universitet. (In Russ.). URL: https://kpfu.ru/portal/docs/F_2064674290/NPS_19.Pirson.Menshov.pdf [Accessed 11th April 2025].

Levshin Denis Vitalievich

Email: levshin01@bk.ru

ORCID |

Volgograd State Technical University
IC TMK

Volgograd, Russian Federation

Bystryakov Daniil Vladimirovich

Email: bystriackoff@yandex.ru

ORCID |

Volgograd State Technical University

Volgograd, Russian Federation

Zubkov Alexander Vladimirovich
Candidate of Engineering Sciences, Docent
Email: aleksandr.zubkov@volgmed.ru

ORCID |

Volgograd State Technical University
Volgograd State Medical University

Volgograd, Russian Federation

Keywords: audio duplicates, convolutional networks, fourier transform, audio noise, model robustness, mel-spectrogram, siamese architecture, temporal features, comparison of audio recordings

For citation: Levshin D.V., Bystryakov D.V., Zubkov A.V. Study of the problem of automated matching of audio files. Modeling, Optimization and Information Technology. 2025;13(4). URL: https://moitvivt.ru/ru/journal/pdf?id=1903 DOI: 10.26102/2310-6018/2025.51.4.004 (In Russ).

865

Full text in PDF

Received 17.04.2025

Revised 09.09.2025

Accepted 25.09.2025