A Hybrid Model for Acoustic Source Separation Based on Deep Clustering

Authors

  • Dgiakh M. Shahoud
  • Evgeniy D. Agafonov

DOI:

https://doi.org/10.24160/1993-6982-2026-2-146-155

Keywords:

acoustic source separation, hybrid model, microphone array, reverberant environment, bidirectional recurrent neural network, ideal binary mask, clustering algorithm

Abstract

The need to solve the problem of separating acoustic sources arises in many areas and applications of engineering, technology, and digital processing of acoustic signals, such as sound separation in music signals, audio coding, speech recognition, automatic transcription of speech and music, and filtering of unwanted sounds. Simultaneous localizing of several overlapping sources is of greatest interest. The article presents a hybrid model for separating signals obtained using a small-sized orthogonal microphone array in a closed reverberant environment. The proposed approach is based on the use of a bidirectional recurrent deep learning neural network. An ideal binary mask obtained on the known signals of each source is involved in calculating the loss function. The loss function is the Frobenius norm between the estimated affinity matrix and the target affinity matrix. At the next stage, a clustering algorithm is applied to the output data of the trained model to estimate the target mask and reconstruct the signals of individual sources. The model was trained on three data sets taking into account different simulation scenarios and then tested on short acoustic signals of 500 ms duration. The model trained with taking into account all possible source locations in the room and including the corresponding room impulse responses has shown effective generalization ability, outperforming the same model trained considering fixed source locations, achieving improvements in the PESQ and STOI metrics by 2.8% and 11.5% respectively, and in the SDR, SIR, and SAR metrics by 3.1 dB, 3.9 dB, and 2.3 dB, respectively.

Author Biographies

Dgiakh M. Shahoud

3rd Year Postgraduate Student of Automation Systems, Automated Control, and Design Dept., Institute of Space and Information Technologies of the Siberian Federal University, Krasnoyarsk, e-mail: ghiathlovealaa@gmail.com

Evgeniy D. Agafonov

Dr.Sci. (Techn.), Professor, Professor of Automation Systems, Automated Control, and Design Dept., Institute of Space and Information Technologies of the Siberian Federal University, Professor of System Analysis and Operations Research Dept., Siberian State University of Science and Technology named after Academician M.F. Reshetnev, Krasnoyarsk, e-mail: evgeny.agafonov@mail.ru

References

1. Virtanen T. Monaural Sound Source Separation by Nonnegative Matrix Factorization with Temporal Continuity and Sparseness Criteria // IEEE Trans. Audio, Speech, and Language Proc. 2017. V. 15(3). Pp. 1066—1074.

2. Yu H., Finke M., Waibel A. Progress in Automatic Meeting Transcription // Proc. Eurospeech. Budapest, 1999.

3. Haeb-Umbach R. e. a. Speech Processing for Digital Home Assistants: Combining Signal Processing with Deep-learning Techniques // IEEE Signal Proc. Magazine. 2019. V. 36(6). Pp. 111—124.

4. Vincent E., Virtanen T. Audio Source Separation and Speech Enhancement. N.-Y.: John Wiley & Sons, 2018.

5. Huang P.S. e. a. Deep Learning for Monaural Speech Separation // Proc. IEEE Intern. Conf. Acoustics, Speech and Signal Proc. 2014. Pp. 1562—1566.

6. Erdogan H. e. a. Phase-sensitive and Recognition-boosted Speech Separation Using Deep Recurrent Neural Networks // Proc. IEEE Intern. Conf. Acoustics, Speech and Signal. 2015. Pp. 708—712.

7. Roweis S. One Microphone Source Separation // Advances in Neural Information Processing Systems. 2000. V. 13. Pp. 1—7.

8. Wang Y., Narayanan A., Wang D. On Training Targets for Supervised Speech Separation // IEEE/ACM Trans. Audio, Speech, and Language Proc. 2014. V. 12(12). Pp. 1849—1858.

9. Issa R.J., Al-Irhaym Y.F. Audio Source Separation Using Supervised Deep Neural Network // Proc. J. Phys.: Conf. Series. 2021. V. 1879(2). P. 022077.

10. Chen Z. e. a. Speech Enhancement and Recognition Using Long-short Term Memory Recurrent Neural Network // Proc. Interspeech. Dresden, 2015. Pp. 1—7.

11. Isik Y. e. a. Single-channel Multi-Speaker Separation Using Deep Clustering // Proc. Interspeech. San Francisco, 2016. P. 1176.

12. Шаход Д.М., Ибряева О.Л. Метод подавления акустического эха на основе рекуррентной нейронной сети и алгоритма кластеризации // Вестник ЮУрГУ. Серия «Вычислительная математика и информатика». 2022. Т. 11. № 2. С. 43—58.

13. Nugraha A.A., Liutkus A., Vincent E. Deep Neural Network Based Multichannel Audio Source Separation // Audio Source Separation. Signals and Communication Technol. 2018. Pp. 157—185.

14. Шаход Д.М., Агафонов Е.Д. Анализ подходов и методов локализации акустических источников // Техника и технологии. 2024. Т. 17. № 3. С. 380—398.

15. Шаход Д.М., Агафонов Е.Д. Комбинированная модель локализации акустических источников с применением технологии глубокого обучения // Вестник Томского гос. ун-та. Серия «Управление, вычислительная техника и информатика». 2024. № 68. С. 100—111.

16. Liu N. e. a. Deep Learning Assisted Sound Source Localization Using Two Orthogonal First-order Differential Microphone Arrays // J. Acoustical Soc. of America. 2021. V. 149(2). Pp. 1069—1084.

17. Ciaburro G., Iannace G. Acoustic Characterization of Rooms Using Reverberation Time Estimation Based on Supervised Learning Algorithm // Appl. Sci. 2021. V. 11(4). P. 1661.

18. Naithani G. e. a. Low-latency Sound Source Separation Using Deep Neural Networks // Proc. IEEE Global Conf. Signal and Information Proc. 2016. Pp. 272—276.

19. Siano D., Viscardi M., Panza M.A. Experimental Acoustic Measurements in Far Field and Near Field Conditions: Characterization of a Beauty Engine Cover // Proc. XII Intern. Conf. Fluid Mechanics and Aerodynamics. 2014. V. 12. Pp. 50—57.

20. RIR-Generator [Электрон. ресурс]. https://github.com/ehabets/RIR-Generator (дата обращения 05.02.2025).

21. Allen J.B., Berkley D.A. Image Method for Efficiently Simulating Small‐room Acoustics // J. Acoustical Soc. of America. 1979. V. 65(4). Pp. 943—650.

22. Щетинин Е.Ю., Севастьянов Л.А. О методах переноса глубокого обучения в задачах классификации биомедицинских изображений // Информатика и её применения. 2021. Т. 15. № 4. С. 59—64.

23. Lewkowycz A., Gur-Ari G. On the Training Dynamics of Deep Networks with L2 Regularization // Advances in Neural Information Proc. Systems. 2020. V. 33. Pp. 4790—4799.

24. Fu S.W., Liao C.F., Tsao Y. Learning with Learned Loss Function: Speech Enhancement with Quality-net to Improve Perceptual Evaluation of Speech Quality // IEEE Signal Proc. Letters. 2019. V. 27. Pp. 26—30.

25. Vincent E., Gribonval R., Févotte C. Performance Measurement in Blind Audio Source Separation // IEEE Trans. Audio, Speech, and Language Proc. 2006. V. 15(4). Pp. 59—64.

26. Wang S., Naithani G., Virtanen T. Low-latency Deep Clustering for Speech Separation // Proc. IEEE Intern. Conf. Acoustics, Speech and Signal Proc. Brighton, 2019. Pp. 76—80.

---

Для цитирования: Шаход Д.М., Агафонов Е.Д. Гибридная модель разделения акустических источников на основе глубокой кластеризации // Вестник МЭИ. 2026. № 2. С. 146—155. DOI: 10.24160/1993-6982-2026-2-146-155

---

Конфликт интересов: авторы заявляют об отсутствии конфликта интересов

#

1. Virtanen T. Monaural Sound Source Separation by Nonnegative Matrix Factorization with Temporal Continuity and Sparseness Criteria. IEEE Trans. Audio, Speech, and Language Proc. 2017;15(3):1066—1074.

2. Yu H., Finke M., Waibel A. Progress in Automatic Meeting Transcription. Proc. Eurospeech. Budapest, 1999.

3. Haeb-Umbach R. e. a. Speech Processing for Digital Home Assistants: Combining Signal Processing with Deep-learning Techniques. IEEE Signal Proc. Magazine. 2019;36(6):111—124.

4. Vincent E., Virtanen T. Audio Source Separation and Speech Enhancement. N.-Y.: John Wiley & Sons, 2018.

5. Huang P.S. e. a. Deep Learning for Monaural Speech Separation. Proc. IEEE Intern. Conf. Acoustics, Speech and Signal Proc. 2014:1562—1566.

6. Erdogan H. e. a. Phase-sensitive and Recognition-boosted Speech Separation Using Deep Recurrent Neural Networks. Proc. IEEE Intern. Conf. Acoustics, Speech and Signal. 2015:708—712.

7. Roweis S. One Microphone Source Separation. Advances in Neural Information Processing Systems. 2000;13:1—7.

8. Wang Y., Narayanan A., Wang D. On Training Targets for Supervised Speech Separation. IEEE/ACM Trans. Audio, Speech, and Language Proc. 2014;12(12):1849—1858.

9. Issa R.J., Al-Irhaym Y.F. Audio Source Separation Using Supervised Deep Neural Network. Proc. J. Phys.: Conf. Series. 2021;1879(2):022077.

10. Chen Z. e. a. Speech Enhancement and Recognition Using Long-short Term Memory Recurrent Neural Network. Proc. Interspeech. Dresden, 2015:1—7.

11. Isik Y. e. a. Single-channel Multi-Speaker Separation Using Deep Clustering. Proc. Interspeech. San Francisco, 2016:1176.

12. Shakhod D.M., Ibryaeva O.L. Metod Podavleniya Akusticheskogo Ekha na Osnove Rekurrentnoy Neyronnoy Seti i Algoritma Klasterizatsii. Vestnik Yuurgu. Seriya «Vychislitel'naya Matematika i Informatika». 2022;11;2:43—58. (in Russian).

13. Nugraha A.A., Liutkus A., Vincent E. Deep Neural Network Based Multichannel Audio Source Separation. Audio Source Separation. Signals and Communication Technol. 2018:157—185.

14. Shakhod D.M., Agafonov E.D. Analiz Podkhodov i Metodov Lokalizatsii Akusticheskikh Istochnikov. Tekhnika i Tekhnologii. 2024;17;3:380—398. (in Russian).

15. Shakhod D.M., Agafonov E.D. Kombinirovannaya Model' Lokalizatsii Akusticheskikh Istochnikov s Primeneniem Tekhnologii Glubokogo Obucheniya. Vestnik Tomskogo Gos. Un-ta. Seriya «Upravlenie, Vychislitel'naya Tekhnika i Informatika». 2024;68:100—111. (in Russian).

16. Liu N. e. a. Deep Learning Assisted Sound Source Localization Using Two Orthogonal First-order Differential Microphone Arrays. J. Acoustical Soc. of America. 2021;149(2):1069—1084.

17. Ciaburro G., Iannace G. Acoustic Characterization of Rooms Using Reverberation Time Estimation Based on Supervised Learning Algorithm. Appl. Sci. 2021;11(4):1661.

18. Naithani G. e. a. Low-latency Sound Source Separation Using Deep Neural Networks. Proc. IEEE Global Conf. Signal and Information Proc. 2016:272—276.

19. Siano D., Viscardi M., Panza M.A. Experimental Acoustic Measurements in Far Field and Near Field Conditions: Characterization of a Beauty Engine Cover. Proc. XII Intern. Conf. Fluid Mechanics and Aerodynamics. 2014;12:50—57.

20. RIR-Generator [Elektron. Resurs]. https://github.com/ehabets/RIR-Generator (Data Obrashcheniya 05.02.2025).

21. Allen J.B., Berkley D.A. Image Method for Efficiently Simulating Small‐room Acoustics. J. Acoustical Soc. of America. 1979;65(4):943—650.

22. Shchetinin E.Yu., Sevast'yanov L.A. O Metodakh Perenosa Glubokogo Obucheniya v Zadachakh Klassifikatsii Biomeditsinskikh Izobrazheniy. Informatika i Ee Primeneniya. 2021;15;4:59—64. (in Russian).

23. Lewkowycz A., Gur-Ari G. On the Training Dynamics of Deep Networks with L2 Regularization. Advances in Neural Information Proc. Systems. 2020;33:4790—4799.

24. Fu S.W., Liao C.F., Tsao Y. Learning with Learned Loss Function: Speech Enhancement with Quality-net to Improve Perceptual Evaluation of Speech Quality. IEEE Signal Proc. Letters. 2019;27:26—30.

25. Vincent E., Gribonval R., Févotte C. Performance Measurement in Blind Audio Source Separation. IEEE Trans. Audio, Speech, and Language Proc. 2006;15(4):59—64.

26. Wang S., Naithani G., Virtanen T. Low-latency Deep Clustering for Speech Separation. Proc. IEEE Intern. Conf. Acoustics, Speech and Signal Proc. Brighton, 2019:76—80

---

For citation: Shahoud G.M., Agafonov E.D. A Hybrid Model for Acoustic Source Separation Based on Deep Clustering. Bulletin of MPEI. 2026;2:146—155. (in Russian). DOI: 10.24160/1993-6982-2026-2-146-155

---

Conflict of interests: the authors declare no conflict of interest

Published

2026-04-20

Issue

Section

system analisSystem Analysis, Management and Information Processing (2.3.1)