Speech To Text Requirements

This topic provides guidance on processing time and system requirements for speech-to-text, so that you can choose which model to use.

TIP: Performance figures vary depending on system specifications. The figures given below are based on one example system and are intended only as a guide. OpenText recommends running performance tests on your own system(s).

Accuracy

The new speech-to-text models introduced in IDOL 23.x (even the micro model) provide much better out-of-the-box accuracy than the legacy models.

The new models perform extremely well on English speech with good audio quality. Performance on English speech is so good that the medium model only provides a small accuracy improvement over the small model. We observe a greater increase in accuracy between the small and medium models when processing non-English speech.

Audio quality is one of the largest factors affecting speech-to-text accuracy. For clear English speech with high-quality audio our testing shows that an F1 score of more than 98% is achievable. When we processed a clip with extremely challenging audio quality the F1 score was much lower: 26% for the legacy speech models, 52% for the micro model, 60% for the small model, and 62% for the medium model. If you have control over the input, ensuring you have the best possible audio quality is the easiest way to improve speech-to-text performance.

Processing Time

You might need to choose a model based on processing time, especially if you need to process a live stream.

The table below shows that when processing on a CPU, one hour of speech can be processed in approximately 0.8 hours using the micro model but takes 2.2 hours using the small model. Therefore only the micro model can be used to process live video on this system. Processing with a GPU is much faster, and even the large multilingual model can keep up with a live stream using the GPU.

Model Approximate processing time
(as a proportion of the source duration)
CPU GPU
Micro 0.8 0.13
Small 2.2 0.18
Medium 6.3 0.32
Large (multilingual only) 13.1 0.43

CPU Usage

CPU usage is not significantly affected by the model that you choose. As an approximate guide, you should expect speech-to-text to fully load a single CPU core (even when processing with a GPU).

Memory Usage

Speech-to-text models can consume a large amount of memory. When processing on a CPU, you only need to consider system memory (RAM). When processing on a GPU you must consider both the system memory and the GPU memory.

There are several factors that affect memory use:

  • The speech-to-text model that you use (configured by setting the ModelVersion parameter).
  • The value of the SpeedBias parameter.

    TIP: The new speech-to-text models that were introduced in IDOL 23.x use transformer-style neural networks. In this case, SpeedBias controls the number of decoders used to produce the text transcript. Lower SpeedBias values prioritize accuracy, so the number of decoders in use increases as SpeedBias decreases. Using more decoders offers the speech-to-text algorithm more word choices to choose from, and might increase accuracy in some cases.

    When using any of the new models, OpenText recommends setting SpeedBias to either 6 or Live. Setting this parameter to other values can significantly increase memory use while having negligible impact on accuracy. Even if you have memory to spare and want to obtain the maximum possible accuracy, SpeedBias should be the last setting that you change.

The following table provides example memory use figures with SpeedBias=6, when processing on the CPU.

Model (CPU) Approximate Media Server memory use (GB)
System RAM (mean) GPU memory (peak)
Micro 3.8 n/a
Small 11.3 n/a
Medium 26.5 n/a
Large (multilingual only) 44.7 n/a

The following table provides example memory use figures with SpeedBias=6, when processing with a GPU.

Model (GPU) Approximate Media Server memory use (GB)
System RAM (mean) GPU memory (peak)
Micro 3.7 3.1
Small 7.0 8.8
Medium 15.0 18.4
Large (multilingual only) 27.7 34.0

Multiple Concurrent Actions

The figures presented above are for running a single speech-to-text task. Running multiple process actions concurrently requires more memory (almost double to run two actions). For example, if you want to use the micro model and run four concurrent process actions, you should expect to fully load four CPU cores and consume almost 16 GB of system RAM.

To run multiple process actions concurrently on a GPU, ensure the total amount of memory required does not exceed the amount available. Media Server can automatically switch between tasks, moving data in and out of GPU memory as required, but this significantly reduces performance. For example, if you have a GPU with 48GB of memory, and want to process multiple streams concurrently, you could run at most five concurrent actions using the small model, or at most two concurrent actions using the medium model.