Source Separation via Spectral Masking for Speech Recognition Systems

Gustavo Fernandes Rodrigues, Thiago de Souza Siqueira, Ana Cláudia Silva de Souza, Hani Camille Yehia

Abstract


In this paper we present an insight into the use of spectral masking techniques in time-frequency domain, as a preprocessing step for the speech signal recognition. Speech recognition systems have their performance negatively affected in noisy environments or in the presence of other speech signals. The limits of these masking techniques for different levels of the signal-to-noise ratio are discussed. We show the robustness of the spectral masking techniques against four types of noise: white, pink, brown and human speech noise (bubble noise). The main contribution of this work is to analyze the performance limits of recognition systems  using spectral masking. We obtain an increase of 18% on the speech hit rate, when the speech signals were corrupted by other speech signals or bubble noise, with different signal-to-noise ratio of approximately 1, 10 and 20 dB. On the other hand, applying the ideal binary masks to mixtures corrupted by white, pink and brown noise, results an average growth of 9% on the speech hit rate, with the same different signal-to-noise ratio. The experimental results suggest that the masking spectral techniques are more suitable for the case when it is applied a bubble noise, which is produced by human speech, than for the case of applying white, pink and brown noise.

Full Text:

PDF

References


D. Kolossa, R. F. Astudillo, E. Hoffmann and R. Orglmeister, Independent Component Analysis and Time-Frequency Masking for Speech

Recognition in Multitalker Conditions, EURASIP Journal on Audio, Speech, and Music Processing, 2010.

T. T. Kristjansson and B. J. Frey, Accounting for uncertainty in observations: a new paradigm for robust automatic speech recognition,

in Proceedings of the IEEE International Conference on Acustics, Speech, and Signal Processing, 2002.

V. Stouten, H. Van Hamme and P. Wambacq, Application of minimum statistics and minima controlled recursive averaging methods to

estimate a cepstral noise model for robust ASR, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal

Processing, vol. 1, 2006.

M. Van Segbroeck and H. Van Hamme, Robust speech recognition using missing data techniques in the prospect domain and fuzzy

masks, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, pp. 4393–4396.

D. Kolossa, A. Klimas and R. Orglmeister, Separation and robust recognition of noisy, convolutive speechmixtures using time-frequency

masking and missing data techniques, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005, vol. 13,

pp. 82–85.

M. Kddotuhne, R. Togneri and S. Nordholm, Time-frequency masking: linking blind source separation and robust speech recognition in

Speech Recognition: Technologies and Applications, IN-TECH, Vienna, Austria, 2008, pp. 61–80.

S. Srinivasan and D. Wang, Transforming binary uncertainties for robust speech recognition, IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 7, pp. 2130–2140, 2007.

O. Yilmaz and S. Rickard, Blind separation of speech mixtures via time-frequency masking, EURASIP Journal on Audio, Speech, and Music Processing, vol. 52, no. 7, pp. 1830–1847, 2004.

G. J. Brown and D. L. Wang, Separation of speech by computational auditory scene analysis, in Speech Enhancement, J.Benesty, S. Makino, and J. Chen, Ed. Springer, New York, 2005, pp. 371–402.

S. Srinivasan, N. Roman and D. L. Wang, Binary and ratio time-frequency masks for robust speech recognition, Speech

Communication, vol. 48,pp. 1486–1501, 2006.

S. Srinivasan, N. Roman and D. L. Wang, On binary and ratio timefrequency masks for robust speech recognition, in Proc. International Conference on Spoken Language Processing, 2004, pp. 2541–2544.

H. Sawada, S. Araki, R. Mukai and S. Makino, Blind extraction of dominant target sources using ICA and timefrequency masking, IEEE Transactions on Audio, Speech and Language Processing, vol. 14,

no. 6, pp. 2165–2173, 2006.

T. S. V. Souza, G. F. Rodrigues, A. C. S. Souza, J. M. Moreira and H. C. Yehia, Binary Spectral Masking for Speech Recognition

Systems, in Proc. 35th International Conference on Telecommunications and Signal Processing (TSP), 2012, pp. 432–436.

E. Hoffmann, D. Kolossa and R. Orglmeister, A batch algorithm for blind source separation of acoustic signals using ICA and time-frequency masking, in Proceedings of the 7th International Conference on Independent Component Analysis and Signal Separation, 2007, pp. 480– 487.

G. Hu and D. L. Wang, Speech segregation based on pitch tracking and amplitude modulation, in Proc. IEEE Workshop on

Applications of Signal Processing to Audio and Acoustics, 2001, pp. 79– 82.

A. Jourjine, S. Rickard and O. Yilmaz, Blind separation of disjoint orthogonal signals: Demixing N sources from 2 mixtures, in IEEE

Conference on Acoustics, Speech, and Signal Processing (ICASSP2000), Jun. 2000, vol. 5, pp. 2985–2988.

N. Roman, D. L. Wang and G. J. Brown, Speech segregation based on sound localization, J. Acoust. Soc. Am., vol. 114, pp. 2236–2252, 2003.

N. Roman and D. L. Wang, Binaural sound segregation for multisource reverberant environments, in Proc. IEEE ICASSP, 2004, vol. 2, pp. 373–376.

G. F. Rodrigues and H. C. Yehia, Limitations of the Spectrum Masking Technique for Blind Source Separation, Lecture Notes

in Computer Science, vol. 5441, pp. 621–628, 2009.

N. Li and P. C. Loizou, Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction, J. Acoust. Soc. Am., vol. 3, no. 123, pp. 1673–1682, 2008.

B. S. Kirei, M. D. Topa, I. Muresan, I. Homana and N. Toma, Blind Source Separation for Convolutive Mixtures with Neural Networks, Advances in Electrical and Computer Engineering, vol. 11, no. 1, pp. 63–68, 2011. Available:http://dx.doi.org/10.4316/AECE.2011.01010

A. M. Ahmad, S. Ismail and D. F. Samaon, Recurrent neural network with backpropagation through time for speech recognition, in Proceedings of the IEEE international symposium on communications and information technology, 2004, vol. 1, pp. 98–102.

R. P. Lippmann, Neural network classifiers for speech recognition, The Lincoln Laboratory Journal, vol. 1, pp. 107–128, 1988.

S. I. Amari and A. Cichocki, Adaptive Blind Signal Processing - Neural Network Approaches, in Proceedings of IEEE, 1998, vol.86, no. 10.

D. Kobayashi, S. Kajita, K. Takeda and F. Itakura, Extracting speech features from human speech like noise, in Proceedings of IEEE,

Fourth International Conference on Spoken Language (ICSLP 96), 1996, vol. 1, pp. 418–421.




DOI: http://dx.doi.org/10.11601/ijates.v1i2-3.16

Refbacks

  • There are currently no refbacks.