Control Speech-to-Text Speed

The speech-to-text analysis engine in IDOL Media Server provides options for configuring the speed of analysis.

In some cases, you might want to prioritize accuracy over processing speed (for example, if you are batch-processing a large library of audio files, and optimizing accuracy is your only goal). In other cases you might want to prioritize processing speed. When you are processing a live stream, you must configure speech-to-text so that it keeps up with the stream.

The following sections describe how to configure the speed of speech-to-text.

Process Live Streams

To process a live stream such as a television broadcast, you must set the parameter IngestRate to 1 (in the [Session] section of your session configuration file), and the parameter SpeedBias to Live (in the speech-to-text task). In this mode, Media Server maximizes accuracy while ensuring that speech-to-text processing keeps up with the live stream. There is always some audio data waiting to be processed, but not so much that the processing falls behind.

Example configuration

[Session]
IngestRate=1
Engine0=Ingest
Engine1=SpeechToText
...

[Ingest]
Type=Video

[SpeechToText]
Type=SpeechToText
LanguagePack=ENUK
SampleFrequency=16000
SpeedBias=Live

Process Files

When you process a file, OpenText recommends that you set IngestRate=0 (in the [Session] section of your session configuration file). This allows Media Server to ingest the file as fast or as slowly as your analysis tasks require.

In your speech-to-text task, you can use the SpeedBias parameter to prioritize either accuracy or processing speed. The parameter accepts an integer value from one to six. To prioritize accuracy, set SpeedBias=1. For a balanced approach, set SpeedBias=3. To prioritize speed, set SpeedBias=6.

Example configuration

[Session]
IngestRate=0
Engine0=Ingest
Engine1=SpeechToText
...

[Ingest]
Type=Video

[SpeechToText]
Type=SpeechToText
LanguagePack=ENUK
SampleFrequency=16000
SpeedBias=2

Improve Speed for a Single File

To improve the time it takes to process a single long audio file, you can share the processing across multiple CPU cores, rather than processing the entire file sequentially on a single core. For example, in the case of a one-hour audio file that would typically take one hour to transcribe, Media Server can split the processing across four cores and process the file in approximately fifteen minutes.

To do this, set the configuration parameter NumParallel in your speech-to-text task. For example:

[Session]
IngestRate=0
Engine0=Ingest
Engine1=SpeechToText
...

[Ingest]
Type=Video

[SpeechToText]
Type=SpeechToText
LanguagePack=ENUK
SampleFrequency=16000
SpeedBias=2
NumParallel=4

NOTE: This approach has no benefit when you process streams, files shorter than 60 seconds, or files with IngestRate=1.

TIP: If you are processing large numbers of files and want to increase throughput, it is more efficient to process many files in parallel than to use multiple CPU cores for each individual file. OpenText recommends that you increase the number of concurrent process actions (by setting the parameter MaximumThreads) instead of setting NumParallel. Set NumParallel when you want to reduce the processing time for an individual file.