This section takes you through the speech-to-text process for telephony content in IDOL Speech Server, and provides you with basic information on the output of the speech-to-text algorithm, so that you know what to expect from your IDOL Speech Server installation. For more details, refer to the IDOL Speech Server Administration Guide and the IDOL Speech Server Reference.
To obtain the files that you need for this tutorial, go to the to the SoftSound Server area on the HPE Big Data Customer Support Site at https://customers.autonomy.com, click ZIP in the Format section, and then download IDOLSpeechtoTextTelephonyTutorial.zip
.
The .zip file contains the following example video files, and the corresponding speech-to-text output files:
voicecall1.mp3
voicecall1_out.ctm
voicecall2.mp3
voicecall2_out.ctm
Unzip the contents of the .zip file to the Speech Server data
directory.
To use this tutorial, you need the following components:
ENUK-tel-6.4
language pack. You can download the language pack from the HPE Big Data Customer Support Site at https://customers.autonomy.com.Note: You can perform the tutorial with a later language pack, although you may find slight differences between your output and the output files provided. In addition, results might vary depending on the operating system that you are using.
For more details on how to install and run IDOL Speech Server, refer to the IDOL Speech Server Administration Guide.
You must configure administrative access for your local machine by specifying the Access-Control-Allow-Origin
parameter in the IDOL configuration file. For example: Access-Control-Allow-Origin=http://localhost:15000
. For more information , refer to the IDOL Server Reference.
After you install IDOL Speech Server, type http://localhost:15000/action=Admin
in a Web browser to start IDOL Admin and check that IDOL Speech Server is running. You can also use the IDOL Admin interface to check that the language pack is installed and configured correctly, and to load the language pack.
To perform speech-to-text, type the following in your Web browser:
http://localhost:15000/action=AddTask&Type=WavToText&File=voicecall1.mp3&Out=voicecall1.ctm&lang=ENUK-tel
By default, Speech Server writes the output to the temp
directory.
This action starts a WavToText
task, which processes a media file and produces a speech-to-text transcript. The response should look something like this:
<autnresponse xmlns:autn="http://schemas.autonomy.com/aci/"> <action>ADDTASK</action> <response>SUCCESS</response> <responsedata> <token>MTkyLjE2OC4xLjkzOjE1MDAwOkFERFRBU0s6LTE3NjgyNTE3NjQ=</token> </responsedata> </autnresponse>
You can use the task token in the <token>
tag to check the status of the task. In your Web browser, type
http://localhost:15000/action=GetStatus&Token=token
For example, http://localhost:15000/action=GetStatus&Token=MTkyLjE2OC4xLjkzOjE1MDAwOkFERFRBU0s6LTE3NjgyNTE3NjQ=
. You should see output similar to the following when the task is complete:
<tasks> <maxTasks>1</maxTasks> <task> <token>MTkyLjE2OC4xLjkzOjE1MDAwOkFERFRBU0s6LTE3NjgyNTE3NjQ=</token> <category>STANDARD</category> <type>wavToText</type> <params>file: c:/tutorialfiles/voicecall1.mp3, out: c:/tutorialfiles/voicecall1.ctm, overlap: 5, splitsize: 120, taskdata: none, taskmanagers: 1, type: wavToText</params> <nwarnings>0</nwarnings> <nsubtasks>-1</nsubtasks> <status>FINISHED</status> <lang>9429d571242252fe</lang> <secondsRun>0</secondsRun> <processingStart>0</processingStart> <processingEnd>0</processingEnd> </task> </tasks>
The <status>
tag indicates that the task is finished.
The output file voicecall1.ctm
should be identical to the voicecall1_out.ctm
file provided in IDOLSpeechtoTextTelephonyTutorial.zip
.
This .ctm
file is a time-ordered list of the words recognized, in the following format:
1 A 0.000 1.230 <SIL> 0.000 1 A 1.230 0.360 <SIL> 0.000 1 A 1.590 0.740 innovation 0.000
Each entry in the file consists of the following information:
1
).A
.<SIL>
(this includes certain non-speech sounds such as lip smacking). Sentence boundaries are labeled as <s>
.The word output as a paragraph of words for a speech-to-text task on the voicecall1.mp3
sample file should look something like this:
the Conservative leader will pledge legislation within a hundred days of assuming office to ensure rates do not write in the next parliament but Labour said the case was a gimmick and expected the ATC right party leader Ed Miliband will say later that I plan to reduce the welfare budget would mean cuts to tax credits <s> elsewhere in the election campaign Liberal Democrat leader Nick Clegg will focus on his party's pledge to offer free school meals to all primary schoolchildren in England Ukip leader Nigel Farage is to break off a campaign to give a speech in the European Parliament attacking what he says is a common EU migration policy decision not to renew UK Trident nuclear submarines could threaten the survival of our nation twenty ex military officials say in a letter to The Times newspaper <s> both the Conservatives and Labour will focus on economic issues is just over a week to go until polling day and with opinion polls suggesting neither has been able to open up a decisive lead <s>
You can see that it is fairly readable text, showing good recognition capabilities. If you listen to the file, you can see that this is because there is one speaker, who speaks clearly and with a typical British accent, with no background noise or music.
Type the following in your Web browser to run a speech-to-text task:
http://localhost:15000/action=AddTask&Type=WavToText&File=path/voicecall2.mp3&Out=path/voicecall2_out.ctm
where path
is the location on your machine of the video file and its corresponding output file.
The word output as a paragraph of words for a speech-to-text task on the voicecall2.mp3
sample file should look something like this:
oh hi there yet so I got some chairs a new website a And then I have messaged to say it is not in store <s> um so yeah I can't have them <s> we like to have something else <s> erm nothing probably probably basing stays just cancel the order in the air and yes that to my name is Helen Smith erm thing you can insert as it was number seven six five four three two one am nothing to address as sixty four separate Cheshire erm yeh sorry if you can just cancel that and there and you notice the difference until it was us re order something alright Okay thanks Barry <s>
In this case, a number of factors here contribute to reduced overall word accuracy:
You can optimize the results of speech-to-text tasks in several ways. For example, if the content is about a particular subject, you can produce a custom language model for that subject to guide the recognition to words from that particular subject domain. You can also filter music and noise, and perform speech-to-text only on those parts of the file that are classified as speech. If you have content that might contain a number of languages or accents, you can use language identification to determine the most likely language or accent variant. For further information on language identification, refer to the IDOL Speech Server Administration Guide.
In particular, when processing voice calls from a call center, HP recommends that you create a custom language model using verbatim transcriptions of real voice calls ( typically 20+ hours of calls ). The more example text data that you can provide, the better the resulting recognition should be. For more details on how to create and tune custom language models, see the IDOL Speech-to-Text Language Modelling Tutorial.
IDOL Speech Server provides several other speech and audio processing capabilities, including audio classification, language identification, speaker segmentation and identification, audio fingerprinting, and phrase matching. IDOL Speech Server supports over 30 languages, with a number of those having telephony language packs, which enable speech-to-text over voice call content. Refer to the IDOL Speech Server Administration Guide for more details.