In spite of recent advances in automatic speech recognition, the performance of state-of-the-art speech recognisers fluctuates depending on the speaker. Speaker normalisation aims at the reduction of differences between the acoustic space of a new speaker and the training acoustic space of a given speech recogniser, improving performance. Normalisation is based on an acoustic feature trans- formation, to be estimated from a small amount of speech signal. This paper introduces a mixture of recurrent neural networks as an effective regression technique to approach the problem. A suitable Vit- erbi-based time alignment procedure is proposed for generating the adaptation set. The mixture is com- pared with linear regression and single-model con- nectionist approaches. Speaker-dependent and speaker-independent continuous speech recognition experiments with a large vocabulary, using Hidden Markov Models, are presented. Results show that the mixture improves recognition performance, yielding a 21% relative reduction of the word error rate, i.e. comparable with that obtained with model- adaptation approaches.

Trentin, E., Diego, G. (2001). A mixture of recurrent neural networks for speaker normalization. NEURAL COMPUTING & APPLICATIONS, 10(2), 120-135 [10.1007/s005210170004].

A mixture of recurrent neural networks for speaker normalization

TRENTIN, EDMONDO;
2001-01-01

Abstract

In spite of recent advances in automatic speech recognition, the performance of state-of-the-art speech recognisers fluctuates depending on the speaker. Speaker normalisation aims at the reduction of differences between the acoustic space of a new speaker and the training acoustic space of a given speech recogniser, improving performance. Normalisation is based on an acoustic feature trans- formation, to be estimated from a small amount of speech signal. This paper introduces a mixture of recurrent neural networks as an effective regression technique to approach the problem. A suitable Vit- erbi-based time alignment procedure is proposed for generating the adaptation set. The mixture is com- pared with linear regression and single-model con- nectionist approaches. Speaker-dependent and speaker-independent continuous speech recognition experiments with a large vocabulary, using Hidden Markov Models, are presented. Results show that the mixture improves recognition performance, yielding a 21% relative reduction of the word error rate, i.e. comparable with that obtained with model- adaptation approaches.
2001
Trentin, E., Diego, G. (2001). A mixture of recurrent neural networks for speaker normalization. NEURAL COMPUTING & APPLICATIONS, 10(2), 120-135 [10.1007/s005210170004].
File in questo prodotto:
File Dimensione Formato  
06-TrentinGiuliani.pdf

non disponibili

Tipologia: Post-print
Licenza: PUBBLICO - Pubblico con Copyright
Dimensione 185.45 kB
Formato Adobe PDF
185.45 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/11372
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo