AlignAudioTranscript

Transcript alignment takes a transcript of the speech in a media file and, by processing the speech, assigns timestamps to all the words in the transcript. This is useful because it allows an application to provide search results from the transcript and open the media file at the correct position. You can also synchronize manually created subtitle text with the speech in a video file.

Type: asynchronous

Parameter Description Required
AudioData The media file to align the transcript to. Files must be uploaded as multipart/form-data. For more information about sending data to Media Server, refer to the Media Server Administration Guide. Set this or audiopath
AudioPath The path of the media file to align the transcript to. The path must be absolute, or relative to the Media Server executable file. Set this or audiodata
LanguagePack The speech-to-text language pack to use to process the audio. To obtain a list of language packs that have been installed with your Media Server, use the action ListSpeechLanguagePacks. Yes
Normalize A Boolean value (default true) that specifies whether to normalize the text. This is not supported for all languages. If normalization is not supported, normalize the text manually and set this parameter to false. No
SampleFrequency The sample frequency at which to process the audio (default 16000). No
TextData The text file that contains the transcript to align. Text files must be uploaded as multipart/form-data. For more information about sending data to Media Server, refer to the Media Server Administration Guide. Set this or textpath
TextPath The path of the text file that contains the transcript to align. The path must be absolute, or relative to the Media Server executable file. Set this or textdata

Example

curl http://localhost:14000 -F action=AlignAudioTranscript
                            -F audiodata=@audio.wav
                            -F textdata=@transcript.txt
                            -F languagepack=ENUS
                            -F samplefrequency=16000

Response

The AlignAudioTranscript action is asynchronous, so Media Server returns a token. You can use the token with the QueueInfo action to obtain the results. A sample response from the QueueInfo action appears below.

The response includes the start time, end time, and duration for each word in seconds.

<autnresponse>
  <action>QUEUEINFO</action>
  <response>SUCCESS</response>
  <responsedata>
    <actions>
      <action>
        <status>Finished</status>
        <queued_time>2018-Aug-22 08:27:49</queued_time>
        <time_in_queue>0</time_in_queue>
        <process_start_time>2018-Aug-22 08:27:49</process_start_time>
        <time_processing>51</time_processing>
        <process_end_time>2018-Aug-22 08:28:40</process_end_time>
        <output>
          <TranscriptAlignResult>
            <duration>0.59</duration>
            <endTime>0.59</endTime>
            <startTime>0.00</startTime>
            <text>The</text>
          </TranscriptAlignResult>
          <TranscriptAlignResult>
            <duration>0.91</duration>
            <endTime>1.50</endTime>
            <startTime>0.59</startTime>
            <text>news</text>
          </TranscriptAlignResult>
          ...
        </output>
      </action>
    </actions>
  </responsedata>
</autnresponse>