If a transcript is available for an audio recording, you can use the TranscriptAlign
task to place the time location for each word in the transcript. Use this task to align subtitles to audio or video files.
The transcript does not need to match the speech data exactly. The transcript aligner can tolerate small numbers of errors in the transcript, as well as mitigating factors related to audio, such as background noise and music.
The transcript aligner can also place metadata tags in the transcript to allow you to easily identify sections. These metadata tags do not affect the alignment process. For more information, see Metadata Tag Syntax.
The alignment process works by using speech-to-text to identify words, sounds, or characters from the transcript within the audio, and assigning them a time location.
The accuracy of the speech-to-text process affects the accuracy of the end alignment. For best results, you should run speech-to-text using a custom language model built from the transcript text. The custom language model models the words in the transcript text and makes them much more likely to come out in the speech-to-text transcript.
The following diagram describes the transcript alignment workflow.
The transcript alignment process includes the following steps:
Normalize the transcript so that you can identify numbers written in numeric form, and so on. See Normalize the Transcript.
|