It is important to ensure that the text used for custom language model building is cleaned up to a reasonable level.
Japanese, Korean, Mandarin, and Taiwanese Mandarin languages require text segmentation before HPE IDOL Speech Server can process them. Text segmentation inserts whitespace between words. The LanguageModelBuild
task segments text if you set the DoSegment
parameter to True
(see Build the Language Model).
You must normalize the text used for language model building before processing so that word representations are standardized. For example, ‘1’ and ‘one’ are treated as two different representations of the same digit. For more information about the normalization scheme used in HPE IDOL Speech Server, see Audio Transcript Requirements.
Many of the HPE IDOL Speech Server text operations require you to normalize input text as an initial step. For details of the normalization procedure, see Run Text Normalization.
|