Troubleshooting

Some common issues can affect the adaptation process and impair the performance of an adapted acoustic model.

Non-verbatim transcripts If the transcripts do not match the audio word for word, mismatches occur between the audio frames and the models being adapted. Mismatches can produce a poor quality adaptation.
Music, noise, or stuttering in the audio As with non-verbatim transcripts, an excessive amount of noise or music in the audio can lead to mismatches during the adaptation. Stuttering and poorly articulated speech can also cause issues, especially because these might not be represented in the transcripts.
Model over-trained to the adaptation data

If the quantity of the adaptation data is low, there might not be enough example data to reliably reestimate the model parameters. If you set the MinEgs parameter to a low value, the model might still be updated, but based on very little data. This is even more of an issue if you set the Relevance parameter to a low value. The resulting model might perform very well when you run speech-to-text against the adaptation data (because the model is very well fitted to that data); however, the performance on any other audio is likely to be poor.

For best results, run the AmTrain task in rapid adaptation mode when using small amounts of adaptation data. This mode is designed to work with minutes of data, rather than hours.

Poor initial alignments

If the initial timestamps are inaccurate, the adaptation process might not be able to recover the difference. This leads to a mismatch between the audio examples and the models, and therefore a poor adaptation.

HPE recommends that you check the timestamps produced by the TranscriptAlign task when you prepare the transcriptions.

Poor alignments during adaptation

Even given accurate initial alignments, the adaptation system can make mistakes when aligning audio frames to models. For more information about how to view the alignments calculated during the adaptation process, see Acoustic Adaptation Diagnostics.

Poor word alignment at the adaptation stage can be caused by audio quality issues (music, noise, poor articulation, and so on), issues with the input transcript files (both in terms of the transcribed word sequence and time position issues), or algorithm search failures. After you check the audio and transcription data, consider running the adaptation process with a higher beam. Relaxing or tightening the time restraints might also help, depending on how accurate the input timestamps are (if accurate, tighten; if inaccurate, relax).

Finally, if the files are long, consider manually splitting the audio and transcript files (you can use the timestamps produced by the TranscriptAlign task as a guide).

Silence model pollution

During adaptation, the silence model is likely to pull in all non-speech audio data, such as noise or music. This might not be a problem, because during speech-to-text it is the silence model that should be matching over such periods. However, speech frames might also be pulled into the silence model, especially if the transcript is missing some spoken words.

With this in mind, unless you suspect the silence model in the original acoustic model to be weak, HPE recommends that you disable silence model adaptation. Disabling it prevents the new model from suffering from silence model pollution. Silence model adaptation is disabled by default, and is controlled by the AdaptSil parameter in the [amadaptadddata] module configuration.


_HP_HTML5_bannerTitle.htm