In spite of recent advances in automatic speech recognition, the performance of state-of-the-art speech recognisers fluctuates depending on the speaker. Speaker normalisation aims at the reduction of differences between the acoustic space of a new speaker and the training acoustic space of a given speech recogniser, improving performance. Normalisation is based on an acoustic feature transformation, to be estimated from a small amount of speech signal. This paper introduces a mixture of recurrent neural networks as an effective regression technique to approach the problem. A suitable Viterbi-based time alignment procedure is proposed for generating the adaptation set. The mixture is compared with linear regression and single-model connectionist approaches. Speaker-dependent and speaker-independent continuous speech recognition experiments with a large vocabulary, using Hidden Markov Models, are presented. Results show that the mixture improves recognition performance, yielding a 21% relative reduction of the word error rate, i.e. comparable with that obtained with model-adaptation approaches.
Trentin, E., Giuliani, D. (2001). A mixture of recurrent neural networks for speaker normalization. NEURAL COMPUTING & APPLICATIONS, 10(2), 120-135 [10.1007/s005210170004].
A mixture of recurrent neural networks for speaker normalization
Trentin E.;
2001-01-01
Abstract
In spite of recent advances in automatic speech recognition, the performance of state-of-the-art speech recognisers fluctuates depending on the speaker. Speaker normalisation aims at the reduction of differences between the acoustic space of a new speaker and the training acoustic space of a given speech recogniser, improving performance. Normalisation is based on an acoustic feature transformation, to be estimated from a small amount of speech signal. This paper introduces a mixture of recurrent neural networks as an effective regression technique to approach the problem. A suitable Viterbi-based time alignment procedure is proposed for generating the adaptation set. The mixture is compared with linear regression and single-model connectionist approaches. Speaker-dependent and speaker-independent continuous speech recognition experiments with a large vocabulary, using Hidden Markov Models, are presented. Results show that the mixture improves recognition performance, yielding a 21% relative reduction of the word error rate, i.e. comparable with that obtained with model-adaptation approaches.File | Dimensione | Formato | |
---|---|---|---|
06-TrentinGiuliani.pdf
non disponibili
Tipologia:
Post-print
Licenza:
PUBBLICO - Pubblico con Copyright
Dimensione
185.45 kB
Formato
Adobe PDF
|
185.45 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11365/11372
Attenzione
Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo