AlignAudioTranscript

Transcript alignment takes a transcript of the speech in a media file and, by processing the speech, assigns timestamps to all the words in the transcript. This is useful because it allows an application to provide search results from the transcript and open the media file at the correct position. You can also synchronize manually created subtitle text with the speech in a video file.

Type: asynchronous

Parameter Description Required
AudioData The media file to align the transcript to. Files must be uploaded as multipart/form-data. For more information about sending data to Media Server, refer to the Media Server Administration Guide. Set this or audiopath
AudioPath The path of the media file to align the transcript to. The path must be absolute, or relative to the Media Server executable file. Set this or audiodata
IngestDateTime

You can use this parameter to configure the start time for the timestamps in the output. For example, set this parameter when you want the timestamps to match the time when the video was broadcast.

Specify the date and time in one of the following ways:

  • YYYYMMDDThhmmssZ (ISO 8601 UTC)
  • YYYY-MM-DDThh:mm:ss.sssZ

  • YYYY-MM-DD hh:mm:ss.sss

  • YYYY/MM/DD hh:mm:ss.sss

  • YYYY-MMM-DD hh:mm:ss.sss (where MMM is the three-character month name, for example Jan or Feb)

  • Epoch milliseconds. Epoch milliseconds are the number of milliseconds that have passed since January 1st 1970, 00:00:00 UTC. For example, 1430214314000 specifies 09:45:14 on 28 April 2015.
No
LanguagePack The speech-to-text language pack to use to process the audio. To obtain a list of language packs that have been installed with your Media Server, use the action ListSpeechLanguagePacks. Yes
Normalize A Boolean value (default true) that specifies whether to normalize the text. This is not supported for all languages. If normalization is not supported, normalize the text manually and set this parameter to false. No
SampleFrequency (Deprecated) The sample frequency at which to process the audio. This parameter is deprecated. Media Server can determine the correct sample frequency from the language pack. The parameter might be removed in future. No
TextData The text file that contains the transcript to align. Text files must be uploaded as multipart/form-data. For more information about sending data to Media Server, refer to the Media Server Administration Guide. Set this or textpath
TextPath The path of the text file that contains the transcript to align. The path must be absolute, or relative to the Media Server executable file. Set this or textdata

Example

curl http://localhost:14000/action=AlignAudioTranscript
                            -F audiodata=@audio.wav
                            -F textdata=@transcript.txt
                            -F languagepack=ENUS
                            -F samplefrequency=16000

Response

The AlignAudioTranscript action is asynchronous, so Media Server returns a token. You can use the token with the QueueInfo action to obtain the results. A sample response from the QueueInfo action appears below.

The response includes the start time, end time, and duration for each word. OpenText recommends using the timestamp field, which includes times in both ISO 8601 format and epoch microseconds. The other startTime, duration, and endTime fields provide timestamps only in milliseconds from the beginning of the file, and are deprecated. In the example below, the first timestamp begins at 1970-01-01 (the beginning of the UNIX epoch). If you want the timestamps to match the broadcast time, set the action parameter IngestDateTime in the AlignAudioTranscript request.

The text field identifies the word corresponding to the timestamp.

<autnresponse>
  <action>QUEUEINFO</action>
  <response>SUCCESS</response>
  <responsedata>
    <actions>
      <action>
        <status>Finished</status>
        <queued_time>2023-Feb-13 11:16:24</queued_time>
        <time_in_queue>3</time_in_queue>
        <process_start_time>2023-Feb-13 11:16:27</process_start_time>
        <time_processing>21</time_processing>
        <process_end_time>2023-Feb-13 11:16:48</process_end_time>
        <output>
          <TranscriptAlignResult>
            <timestamp>
              <startTime iso8601="1970-01-01T00:00:00.000000Z">0</startTime>
              <duration iso8601="PT00H00M00.100000S">100000</duration>
              <peakTime iso8601="1970-01-01T00:00:00.000000Z">0</peakTime>
              <endTime iso8601="1970-01-01T00:00:00.100000Z">100000</endTime>
            </timestamp>
            <text>The</text>
            <startTime>0</startTime>
            <duration>100</duration>
            <endTime>100</endTime>
          </TranscriptAlignResult>
          <TranscriptAlignResult>
            <timestamp>
              <startTime iso8601="1970-01-01T00:00:00.100000Z">100000</startTime>
              <duration iso8601="PT00H00M00.460000S">460000</duration>
              <peakTime iso8601="1970-01-01T00:00:00.100000Z">100000</peakTime>
              <endTime iso8601="1970-01-01T00:00:00.560000Z">560000</endTime>
            </timestamp>
            <text>news</text>
            <startTime>100</startTime>
            <duration>460</duration>
            <endTime>560</endTime>
          </TranscriptAlignResult>
          ...
        </output>
      </action>
    </actions>
  </responsedata>
</autnresponse>