Prepare Text for Training
You might have to clean up your sample text before you can use it for training.
To prepare sample text
- Remove anything that does not occur in spoken language, such as HTML tags and tables.
- Ensure that sentence breaks (periods) are present in appropriate places.
- Ensure that there are no duplicated sections in the text.
- Ensure the text is encoded in UTF-8.
Text must also be normalized. Media Server automatically normalizes the text for most languages. You only need to normalize text yourself when automatic normalization is not supported for your language, or you set normalize=false
when you train the custom language model.
To normalize text
- Change digits to words. For example, replace "2" with "two". Replace "37" with "thirty seven" (not "three seven"). Replace the year "1997" with "nineteen ninety seven".
- Write individually pronounced character sequences as spaced characters; for example, “the word rules is spelled R U L E S”.
- Write pronounced punctuation as it sounds; for example, “sales at Micro Focus dot com".
- For all sentence breaks, replace periods (.) with <s>. Other punctuation must be removed.