Introduction
Transcript alignment takes a transcript of the speech in a media file and, by processing the speech, assigns timestamps to all the words in the transcript. This is useful because it allows an application to provide search results from the transcript and open the media file at the correct position. You can also synchronize manually created subtitle text with the speech in a video file.
The transcript does not need to be an exact match for the speech. Media Server can tolerate small numbers of errors in the transcript, and some background noise and music in the audio. However, transcript alignment is intended to be used when you already have a transcript that contains the words that are spoken (such the script for a television broadcast). If you do not have a transcript you can create one manually; otherwise, you might prefer to run speech-to-text instead. There is no advantage in running transcript alignment on a transcript that was produced by speech-to-text (unless you find and correct any errors in the speech-to-text output).
The transcript must be normalized, but Media Server automatically normalizes transcripts in many languages. You only need to normalize the transcript yourself when automatic normalization is not supported for the language you are using. If you need to normalize the transcript yourself, use the same procedure as described for custom language models (see Prepare Text for Training). Then, run the AlignAudioTranscript
action again but send the normalized text and set normalize=false
.