Speech-to-Text

The process flow for speech-to-text from the audio data point of view is:

  1. The audio signal stream is broken into short overlapping windows.
  2. The overlapping windows are analyzed to produce a vector time series of short-term feature vectors.
  3. The feature vectors are then range-normalized to eliminate or reduce systematic variations.
  4. The speech-to-text engine matches the feature vectors against the language pack contents to produce a series of word outputs.

HPE IDOL Speech Server contains a separate module for each processing step. The following diagram shows how you can implement speech-to-text from the processing modules available in HPE IDOL Speech Server.

The wav module reads the audio file and prepares windowed data.

a is the audio window series.

The frontend module takes each window of samples and converts it to a feature vector.

f is the feature vector series.

The normalizer module adjusts the feature vectors to produce normalized feature vectors.

nf is the normalized feature vector series.

The stt (speech-to-text) module converts the feature time series into a sequence of recognized words.

w is the output time-marked word series.

The wout module prepares the output words for storage and result reporting.

You must create this processing sequence in the HPE IDOL Speech Server tasks configuration file to represent a single action. In this case, the sequence results in the following configuration section.

[MySpeechToText]
0 = a ← wav (MONO, input)
1 = f ← frontend (_, a)
2 = nf ← normalizer (_, f)
3 = w ← stt(_, nf)
4 = output ← wout(_, w)

The notation comes from the functional programming style.

So line 0 specifies that the wav module operates in MONO mode, receiving an input file and producing a as the output data stream (a represents mono audio data–see Name a Data Stream Instance for data stream types). Line 1 specifies that the frontend module operates in default mode, receiving a as the input data stream and producing f as the output data stream.

You must also configure each of the processing modules in the sequence. The following example configures the wav module.

[wav]
SampleFrequency = 16000
WavFile = C:\Audio\Speech.wav

This example configures the wav module to:

The following settings configure the frontend module to operate at 16 kHz.

[frontend]
SampleFrequency = 16000

The following settings configure the normalizer module to use a parameter file that is usually available with the language pack. For more information about the IanFile parameter, see the HPE IDOL Speech Server Reference.

[normalizer]
IanFile=$Stt.Lang.NormFile

The following settings configure the stt module to use the UK English language pack, turn on run-time diagnostics, and set the running mode to fixed, with a mode value of 4. You must also ensure the language pack section for ENUK is configured (see Language Configuration).

[stt]
Lang = ENUK
Diag = True
DiagFile = diag.log
Mode = fixed
ModeValue = 4 

Finally, the wout module is configured to write the results in CTM format to the output.ctm file.

[wout]
Format = ctm
Output = output.ctm 

Language Configuration

The language pack section needs to be set up once, or is set up by the installer. The entries in this section change only when a new language pack is installed.

The default configuration is:

[ENUK]
PackDir = U:\\lang\\ENUK
Pack = ENUK-5.0
SampleFrequency = 16000
DnnFile = $params.DnnFile

By default, HPE IDOL Speech Server picks up the value of the DnnFile parameter from Pack and PackDir like other parameters. Alternatively, you can specify another DnnFile to use at the command line or in the task configuration file. For example, in fixed mode, you might want to use the *-fast.DNN file included in each language pack. This faster version is generally necessary for live or relative mode where processing speed is critical. In this case, it is used automatically and does not need to be explicitly selected.

To use GMM acoustic modeling (as used in the 10.7 and earlier versions of HPE IDOL Speech Server) instead of DNN files, set DnnFile to None.

For information on how to configure the language pack section, see Configure Language Packs.

NOTE:

HPE recommends (and for 7.0+ versions of language packs, it is compulsory) that you include the following lines in the configuration file for the [frontend] and [normalizer] modules, so that HPE IDOL Speech Server can access the header to determine the quantity and nature of the extracted acoustic feature vectors:

DnnFile = $stt.lang.DnnFile
DnnFileStd = $stt.lang.DnnFileStd

For more information, see the HPE IDOL Speech Server Reference.

The complete configuration file section for the speech-to-text function is shown below. You must declare all schemas and language packs above this section in the tasks configuration file.

[TaskTypes]
0 = MySpeechToText
[Resources]
0 = ENUK
[MySpeechToText]
0 = a ← wav (MONO, input)
1 = f ← frontend (_, a)
2 = nf ← normalizer (_, f)
3 = w ← stt(_, nf)
4 = output ← wout(_, w)
[wav]
SampleFrequency = 16000
WavFile = Speech.wav
[frontend]
SampleFrequency = 16000
DnnFile = $stt.lang.DnnFile
DnnFileStd = $stt.lang.DnnFileStd
[normalizer]
IanFile = $stt.Lang.NormFile
DnnFile = $stt.lang.DnnFile
DnnFileStd = $stt.lang.DnnFileStd
[stt]
Lang = ENUK
Diag = True
DiagFile = diag.log
Mode = fixed
ModeValue = 4
[wout]
Format = ctm 
Output = output.ctm
[ENUK]
PackDir = U:\\lang\\ENUK
Pack = ENUK-5.0
SampleFrequency = 16000

The action command that runs this speech-to-text task is

http://SpeechServerhost:ACIport/action=AddTask&Type=MySpeechToText

_HP_HTML5_bannerTitle.htm