SpeechToText

Runs speech-to-text on the file(s) associated with an IDOL document FlowFile, and adds the text to the IDOL document.

For information about the audio and video file formats that are supported, refer to the Media Server Administration Guide.

Properties

Name Default Value Description
IDOL License Service   An IdolLicenseServiceImpl that provides a way to communicate with an IDOL License Server.
Media Service   A MediaServiceImpl that manages media analysis resources.
Language Pack English (UK)

The language pack to use for speech-to-text.

NOTE: Language packs can contain hundreds of megabytes of data, so they are not included in the installation and must be downloaded separately. Extract language packs to the "Speech Language Pack Directory" specified in your Media Service.

Telephony False Specifies whether the audio is telephony with an 8kHz sample rate.
Shared Custom Language Model   The identifier of an optional custom language model to use. Set this property to use a custom language model stored in the external database specified by the Media Service (see the "Media Service" property).
Shared Custom Word Database   The name of an optional custom word database to use. Set this property to use a custom word database stored in the external database specified by the Media Service (see the "Media Service" property).
Custom Language Model File   An optional custom language model to use. Specify the path of a file generated by the Media Server action ExportCustomSpeechLanguageModel.
Custom Word Database File   An optional custom word database to use. Specify the path of a file generated by the Media Server action ExportCustomSpeechWordDatabase.
Custom Language Model Weighting   The interpolation weight to use for the custom language model. You only need to set this property if you want to override the recommended weight, as returned by the Media Server action ListCustomSpeechLanguageModels.

Relationships

Name Description
success Processing was successful.
failure Processing failed.

Example Output

The processor adds the transcribed speech to the document content. It also adds metadata to the document, as shown in the following example.

<idol_media>
  <speechtotext>
    <word duration="0.039" start="0">
      <text>&lt;SIL&gt;</text>
    </word>
    <word duration="0.22" start="0.039">
      <text>now</text>
    </word>
    <word duration="0.14" start="0.259">
      <text>the</text>
    </word>
    <word duration="0.38" start="0.399">
      <text>latest</text>
    </word>
    <word duration="0.32" start="0.779">
      <text>news</text>
    </word>
    ...
  </speechtotext>
</idol_media>

The metadata contains a word element for each word. The start and duration attributes provide timestamps, in seconds.

The text element provides the word that was recognized. This element can also have a value of <SIL> or <s>, which indicates a period of audio without speech, such as silence or background noise. <SIL> indicates silence that probably has no linguistic role. <s> is more likely to end a chain of words, for example when a speaker begins a new sentence.