Use Your Content > Improve > IDOL Telephony Speech-to-Text Tutorial

IDOL Telephony Speech-to-Text Tutorial

This section takes you through the speech-to-text process for telephony content in IDOL Speech Server, and provides you with basic information on the output of the speech-to-text algorithm, so that you know what to expect from your IDOL Speech Server installation. For more details, refer to the IDOL Speech Server Administration Guide and the IDOL Speech Server Reference.

Supporting Files

To obtain the files that you need for this tutorial, go to the to the SoftSound Server area on the HPE Big Data Customer Support Site at https://customers.autonomy.com, click ZIP in the Format section, and then download IDOLSpeechtoTextTelephonyTutorial.zip.

The .zip file contains the following example video files, and the corresponding speech-to-text output files:

Unzip the contents of the .zip file to the Speech Server data directory.

Requirements

To use this tutorial, you need the following components:

NOTE:

You can perform the tutorial with a later language pack, although you may find slight differences between your output and the output files provided. In addition, results might vary depending on the operating system that you are using.

For more details on how to install and run IDOL Speech Server, refer to the IDOL Speech Server Administration Guide.

You must configure administrative access for your local machine by specifying the Access-Control-Allow-Origin parameter in the IDOL configuration file. For example: Access-Control-Allow-Origin=http://localhost:15000. For more information , refer to the IDOL Server Reference.

After you install IDOL Speech Server, type http://localhost:15000/action=Admin in a Web browser to start IDOL Admin and check that IDOL Speech Server is running. You can also use the IDOL Admin interface to check that the language pack is installed and configured correctly, and to load the language pack.

Perform Speech-to-Text

To perform speech-to-text, type the following in your Web browser:

      http://localhost:15000/action=AddTask&Type=WavToText&File=voicecall1.mp3&Out=voicecall1.ctm&lang=ENUK-tel

By default, Speech Server writes the output to the temp directory.

This action starts a WavToText task, which processes a media file and produces a speech-to-text transcript. The response should look something like this:

      <autnresponse xmlns:autn="http://schemas.autonomy.com/aci/">   
      <action>ADDTASK</action>
      <response>SUCCESS</response>
      <responsedata>
      <token>MTkyLjE2OC4xLjkzOjE1MDAwOkFERFRBU0s6LTE3NjgyNTE3NjQ=</token>
      </responsedata>
      </autnresponse>

You can use the task token in the <token> tag to check the status of the task. In your Web browser, type

      http://localhost:15000/action=GetStatus&Token=token

For example, http://localhost:15000/action=GetStatus&Token=MTkyLjE2OC4xLjkzOjE1MDAwOkFERFRBU0s6LTE3NjgyNTE3NjQ=. You should see output similar to the following when the task is complete:

      <tasks>
      <maxTasks>1</maxTasks>
         <task>
            <token>MTkyLjE2OC4xLjkzOjE1MDAwOkFERFRBU0s6LTE3NjgyNTE3NjQ=</token>
            <category>STANDARD</category>
            <type>wavToText</type>
            <params>file: c:/tutorialfiles/voicecall1.mp3, out: c:/tutorialfiles/voicecall1.ctm, overlap: 5, splitsize: 120, taskdata: none, taskmanagers: 1, type: wavToText</params>
            <nwarnings>0</nwarnings>
            <nsubtasks>-1</nsubtasks>
            <status>FINISHED</status>
            <lang>9429d571242252fe</lang>
            <secondsRun>0</secondsRun>
            <processingStart>0</processingStart>
            <processingEnd>0</processingEnd>
        </task>
      </tasks>

The <status> tag indicates that the task is finished.

The output file voicecall1.ctm should be identical to the voicecall1_out.ctm file provided in IDOLSpeechtoTextTelephonyTutorial.zip.

This .ctm file is a time-ordered list of the words recognized, in the following format:

1 A 0.000 1.230 <SIL> 0.000
1 A 1.230 0.360 <SIL> 0.000
1 A 1.590 0.740 innovation 0.000

Each entry in the file consists of the following information:

The word output as a paragraph of words for a speech-to-text task on the voicecall1.mp3 sample file should look something like this:

the Conservative leader will pledge  legislation within a hundred days of assuming office to ensure rates do not write in the next parliament   but Labour said the case was a gimmick  and expected the ATC right  party leader Ed Miliband  will say later that I plan to reduce the welfare budget would mean cuts to tax credits <s> elsewhere in the election campaign  Liberal Democrat leader Nick Clegg  will focus on his party's pledge to offer free school meals to  all primary schoolchildren in England  Ukip leader  Nigel Farage is to break  off a campaign to give a speech in the European Parliament attacking what he says is a common  EU migration policy  decision not to renew UK Trident nuclear submarines could threaten  the survival of our nation  twenty ex military officials say in a letter to The Times newspaper <s> both the Conservatives and Labour will focus on  economic issues  is just over a week to go until polling day  and with opinion polls suggesting neither has been able to open up  a decisive lead   <s>

You can see that it is fairly readable text, showing good recognition capabilities. If you listen to the file, you can see that this is because there is one speaker, who speaks clearly and with a typical British accent, with no background noise or music.

A Tougher Example

Type the following in your Web browser to run a speech-to-text task:

      http://localhost:15000/action=AddTask&Type=WavToText&File=path/voicecall2.mp3&Out=path/voicecall2_out.ctm

where path is the location on your machine of the video file and its corresponding output file.

The word output as a paragraph of words for a speech-to-text task on the voicecall2.mp3 sample file should look something like this:

oh hi there  yet so  I got some chairs  a new website  a  And then I have messaged to say it is not in store <s> um  so  yeah  I can't have them <s> we like to have something else <s> erm  nothing probably  probably basing stays just  cancel the order in the  air  and  yes that to my name is Helen  Smith  erm  thing you can  insert as it was number seven six  five four three  two one  am nothing to address as  sixty four separate Cheshire  erm  yeh sorry if you can just cancel that and there and  you notice the  difference until it was us re order something  alright  Okay thanks  Barry  <s>

In this case, a number of factors here contribute to reduced overall word accuracy:

Improve Speech-to-Text Results

You can optimize the results of speech-to-text tasks in several ways. For example, if the content is about a particular subject, you can produce a custom language model for that subject to guide the recognition to words from that particular subject domain. You can also filter music and noise, and perform speech-to-text only on those parts of the file that are classified as speech. If you have content that might contain a number of languages or accents, you can use language identification to determine the most likely language or accent variant. For further information on language identification, refer to the IDOL Speech Server Administration Guide.

In particular, when processing voice calls from a call center, HP recommends that you create a custom language model using verbatim transcriptions of real voice calls ( typically 20+ hours of calls ). The more example text data that you can provide, the better the resulting recognition should be. For more details on how to create and tune custom language models, see the IDOL Speech-to-Text Language Modelling Tutorial.

IDOL Speech Server provides several other speech and audio processing capabilities, including audio classification, language identification, speaker segmentation and identification, audio fingerprinting, and phrase matching. IDOL Speech Server supports over 30 languages, with a number of those having telephony language packs, which enable speech-to-text over voice call content. Refer to the IDOL Speech Server Administration Guide for more details.


_HP_HTML5_bannerTitle.htm