In spite of recent advances in automatic speech recognition, the performance of state-of-the-art speech recognisers ﬂuctuates depending on the speaker. Speaker normalisation aims at the reduction of differences between the acoustic space of a new speaker and the training acoustic space of a given speech recogniser, improving performance. Normalisation is based on an acoustic feature trans- formation, to be estimated from a small amount of speech signal. This paper introduces a mixture of recurrent neural networks as an effective regression technique to approach the problem. A suitable Vit- erbi-based time alignment procedure is proposed for generating the adaptation set. The mixture is com- pared with linear regression and single-model con- nectionist approaches. Speaker-dependent and speaker-independent continuous speech recognition experiments with a large vocabulary, using Hidden Markov Models, are presented. Results show that the mixture improves recognition performance, yielding a 21% relative reduction of the word error rate, i.e. comparable with that obtained with model- adaptation approaches.
Scheda prodotto non validato
Scheda prodotto in fase di analisi da parte dello staff di validazione
|Titolo:||A mixture of recurrent neural networks for speaker normalization|
|Rivista:||NEURAL COMPUTING & APPLICATIONS|
|Citazione:||Trentin, E., & Diego, G. (2001). A mixture of recurrent neural networks for speaker normalization. NEURAL COMPUTING & APPLICATIONS, 10(2), 120-135.|
|Appare nelle tipologie:||1.1 Articolo in rivista|