Keywords: audio duplicates, convolutional networks, fourier transform, audio noise, model robustness, mel-spectrogram, siamese architecture, temporal features, comparison of audio recordings
Study of the problem of automated matching of audio files
UDC 004.032.26
DOI: 10.26102/2310-6018/2025.51.4.004
The volume of audio recording data has significantly increased and continues to grow, which complicates the processing of such data due to the presence of numerous duplicates, noisy recordings, and truncated audio clips. This article presents a solution to the problem of detecting fuzzy duplicates in large-scale audio datasets. The proposed method is based on the use of a cascaded ensemble. For feature extraction, temporal parameter analysis, and similarity evaluation between recordings, Convolutional Neural Networks (CNN), Temporal Shift Networks (TSN), and Siamese Networks were utilized. The input data were initially converted into mel-spectrogram images using the Short-Time Fourier Transform (STFT) algorithm. Each audio file was segmented at a specific sampling rate, with attention to temporal continuity, transformed using STFT, and then passed through the ensemble of models. The study focuses on the behavior of the ensemble when processing recordings that have undergone various modifications, such as noise addition, distortion, and trimming. Experiments conducted on the dataset demonstrated a high degree of correlation between the results obtained from human evaluators and the method, confirming the effectiveness of the proposed solution. The method showed strong robustness to different types of audio modifications, such as tempo changes, noise injection, and clipping. Future research may aim to adapt the ensemble to other types of data, including video and graphical recordings, which would expand the applicability of the proposed approach.
1. Kochegurova E.A., Saybert S.M., Tatyankina K.V. Optimization of Hybrid-Forecasting Algorithm Parameters Using a Model Ensemble in Real Time. Bulletin of the Tomsk Polytechnic University. Industrial Cybernetics. 2024;2(4):26–33. (In Russ.). https://doi.org/10.18799/29495407/2024/4/76
2. Six J., Bressan F., Renders K. Duplicate Detection for Digital Audio Archive Management: Two Case Studies. In: Advances in Speech and Music Technology: Computational Aspects and Applications. Cham: Springer; 2023. P. 311–329. https://doi.org/10.1007/978-3-031-18444-4_16
3. Reise W., Fernández X., Dominguez M., Harrington H.A., Beguerisse-Díaz M. Topological Fingerprints for Audio Identification. arXiv. URL: https://arxiv.org/abs/2309.03516 [Accessed 31st July 2025].
4. Malenko S.A. Uvelichenie proizvoditel'nosti algoritmov poiska dublikatov audiozapisei. Molodoi uchenyi. 2017;(49):22–26. (In Russ.).
5. Ryynanen M., Klapuri A. Query by Humming of Midi and Audio Using Locality Sensitive Hashing. In: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 01 March – 04 April 2008, Las Vegas, NV, USA. IEEE; 2008. P. 2249–2252. https://doi.org/10.1109/ICASSP.2008.4518093
6. Bulavin D.A., Kharitonov I.A. Analysis of Recognition and Transformation Methods Audio to Notes. Avtomatizirovannye sistemy upravleniya i pribory avtomatiki. 2011;(152):78–82. (In Russ.).
7. Novokhrestova D.I. Time Normalization of Syllables with the Dynamic Time Warping Algorithm in Assessing of Syllables Pronunciation Quality When Speaking. Proceedings of the TUSUR University. 2017;20(4):142–145. (In Russ.). https://doi.org/10.21293/1818-0442-2017-20-4-142-145
8. Wang Yi, Lyu X., Yang Sh. Ocean Observing Time-Series Anomaly Detection Based on DTW-TRSAX Method. The Journal of Supercomputing. 2024;80:18679–18704. https://doi.org/10.1007/s11227-024-06183-w
9. Ustubioglu A., Ustubioglu B., Ulutas G. Mel Spectrogram-Based Audio Forgery Detection Using CNN. Signal, Image and Video Processing. 2023;17(5):2211–2219. https://doi.org/10.1007/s11760-022-02436-4
10. Zhao H., Ye Ya., Shen X., Liu L. 1D-CNN-Based Audio Tampering Detection Using ENF Signals. Scientific Reports. 2024;14. https://doi.org/10.1038/s41598-024-60813-0
11. Wang W., Lu Zh. Few-Shot Bronze Vessel Classification via Siamese Fourier Networks. Scientific Reports. 2024;14. https://doi.org/10.1038/s41598-024-69272-z
12. Lin Ya.-B., Bertasius G. Siamese Vision Transformers Are Scalable Audio-Visual Learners. In: Computer Vision – ECCV 2024: 18th European Conference: Proceedings: Part XIV, 29 September – 04 October 2024, Milan, Italy. Cham: Springer; 2025. P. 303–321. https://doi.org/10.1007/978-3-031-72630-9_18
13. India M., Fonollosa J.A.R., Hernando J. LSTM Neural Network-Based Speaker Segmentation Using Acoustic and Language Modelling. In: Interspeech 2017: 18th Annual Conference of the International Speech Communication Association, 20–24 August 2017, Stockholm, Sweden. 2017. P. 2834–2838. https://doi.org/10.21437/Interspeech.2017-407
14. Hershey Sh., Chaudhuri S., Ellis D.P.W., et al. CNN Architectures for Large-Scale Audio Classification. arXiv. URL: https://arxiv.org/abs/1609.09430 [Accessed 11th April 2025].
15. Ananyev A.S., Butenko D.V., Popov K.V. Intelligent Technologies for Design Information Systems. The Method of Design Software Products in the Presence of a Prototype. Engineering Journal of Don. 2012;(2). (In Russ.). URL: http://www.ivdon.ru/en/magazine/archive/n2y2012/815
16. Kosheleva N.N. Korrelyatsionnyi analiz i ego primenenie dlya podscheta rangovoi korrelyatsii Spirmena. Aktual'nye problemy gumanitarnykh i estestvennykh nauk. 2012;(5):23–26. (In Russ.).
17. Men'shov M. Koeffitsient korrelyatsii Pirsona. Kazanskii federal'nyi universitet. (In Russ.). URL: https://kpfu.ru/portal/docs/F_2064674290/NPS_19.Pirson.Menshov.pdf [Accessed 11th April 2025].
Keywords: audio duplicates, convolutional networks, fourier transform, audio noise, model robustness, mel-spectrogram, siamese architecture, temporal features, comparison of audio recordings
For citation: Levshin D.V., Bystryakov D.V., Zubkov A.V. Study of the problem of automated matching of audio files. Modeling, Optimization and Information Technology. 2025;13(4). URL: https://moitvivt.ru/ru/journal/pdf?id=1903 DOI: 10.26102/2310-6018/2025.51.4.004 (In Russ).
Received 17.04.2025
Revised 09.09.2025
Accepted 25.09.2025