Use Your Content > Improve > IDOL Speech-to-Text Tutorial

IDOL Speech-to-Text Tutorial

This section takes you through the speech-to-text process in IDOL Speech Server, and provides you with basic information on the output of the speech-to-text algorithm, so that you know what to expect from your IDOL Speech Server installation. For more details, refer to the IDOL Speech Server Administration Guide and the IDOL Speech Server Reference.

Supporting Files

To obtain the files that you need for this tutorial, go to the to the SoftSound Server area on the HPE Big Data Customer Support Site at https://customers.autonomy.com, click ZIP in the Format section, and then download IDOLSpeechtoTextTutorial.zip.

The .zip file contains the following example video files, and the corresponding speech-to-text output files:

Unzip the contents of the .zip file to the Speech Server data directory.

Requirements

To use this tutorial, you need the following components:

Note: You can perform the tutorial with a later language pack, although you may find slight differences between your output and the output files provided. In addition, results might vary depending on the operating system that you are using.

For more details on how to install and run IDOL Speech Server, refer to the IDOL Speech Server Administration Guide.

You must configure administrative access for your local machine by specifying the Access-Control-Allow-Origin parameter in the IDOL configuration file. For example: Access-Control-Allow-Origin=http://localhost:15000. For more information , refer to the IDOL Server Reference.

After you install IDOL Speech Server, type http://localhost:15000/action=Admin in a Web browser to start IDOL Admin and check that IDOL Speech Server is running. You can also use the IDOL Admin interface to check that the language pack is installed and configured correctly, and to load the language pack.

Perform Speech-to-Text

To perform speech-to-text, type the following in your Web browser:

      http://localhost:15000/action=AddTask&Type=WavToText&File=moonshot101.mp4&Out=moonshot101.ctm&Lang=ENUS  

By default, Speech Server writes the output to the temp directory.

This action starts a WavToText task, which processes a media file and produces a speech-to-text transcript. The response should look something like this:

      <autnresponse xmlns:autn="http://schemas.autonomy.com/aci/">   
      <action>ADDTASK</action>
      <response>SUCCESS</response>
      <responsedata>
      <token>MTkyLjE2OC4xLjkzOjE1MDAwOkFERFRBU0s6LTE3NjgyNTE3NjQ=</token>
      </responsedata>
      </autnresponse>

You can use the task token in the <token> tag to check the status of the task. In your Web browser, type

      http://localhost:15000/action=GetStatus&Token=token

For example, http://localhost:15000/action=GetStatus&Token=MTkyLjE2OC4xLjkzOjE1MDAwOkFERFRBU0s6LTE3NjgyNTE3NjQ=. You should see output similar to the following when the task is complete:

      <tasks>
      <maxTasks>1</maxTasks>
         <task>
            <token>MTkyLjE2OC4xLjkzOjE1MDAwOkFERFRBU0s6LTE3NjgyNTE3NjQ=</token>
            <category>STANDARD</category>
            <type>wavToText</type>
            <params>file: c:/tutorialfiles/moonshot101.mp4, out: c:/tutorialfiles/moonshot101.ctm, overlap: 5, splitsize: 120, taskdata: none, taskmanagers: 1, type: wavToText</params>
            <nwarnings>0</nwarnings>
            <nsubtasks>-1</nsubtasks>
            <status>FINISHED</status>
            <lang>9429d571242252fe</lang>
            <secondsRun>0</secondsRun>
            <processingStart>0</processingStart>
            <processingEnd>0</processingEnd>
        </task>
      </tasks>

The <status> tag indicates that the task is finished.

The output file moonshot101.ctm should be identical to the moonshot101_out.ctm file provided in IDOLSpeechtoTextTutorial.zip.

This .ctm file is a time-ordered list of the words recognized, in the following format:

1 A 0.000 1.230 <SIL> 0.000
1 A 1.230 0.360 <SIL> 0.000
1 A 1.590 0.740 innovation 0.000

Each entry in the file consists of the following information:

The word output as a paragraph of words for a speech-to-text task on the Moonshot101.mp4 sample file should look something like this:

innovation  is core  to our DNA HP   reviews and innovation  to create new classes of solutions  like System Pro   first  x eighty six or   a series of highly successful  flight systems  even  the first rack mount server  when HP moonshine <s> we are doing it again   where launching an  innovative solution  for the new style  of ninety <s> so why are we doing much  today  almost  anything can be connected to the Internet  millions of devices can be tracked  gather  and process information  or provide a service  all  while seamlessly  interacting  with other devices <s> we call this  the Internet of Things   twenty twenty  the number of devices  in data  are expected to grow  exponentially  reaching thirty billion devices   forty trillion gigabytes of data  when ten million applications  these new applications are doing things  that a few years ago  we didn't think were possible  and doing it  at huge  scale  when running these new applications that scale <s> we had to think differently  in the design of the solution <s> H B Moore shot  is a revolutionary new server  design for  at scale applications  applications used by millions of users  applications which support  the new style of I T  H Freeman shot uses  ultra low energy servers  derivatives and devices designed for notebooks  and cell phones   designed to run  all day  on a single battery charge  designed to be incredibly  energy efficient  and the end result  is a dramatic reduction  in cost  power  and space <s> one of the most  innovative things about moon shot  is the overall  architecture  <s> we have created  software to find servers  each one  targeted  at specific workloads  share  everything  provided discreetly  in a traditional server <s> we created a resourceful  the power supply  the power cords  the cooling fans <s>  the network a network of links  the management  interface  management network  the firmware  firmware updates   everything  shared across forty five  hot plausible  server cartridges  in essence  a community of servers  where  only components that need to be dedicating  our dedicated  in everything else  share  to provide savings in power  space  in costs  one shot is different from servers designed in the past <s>  for the targeted applications  we see a dramatic reduction  in the number of racks required  to do the same amount of work  saving purchase cost  operational cost  and data center space   typical server designs  take  eighteen months  from beginning  to end   one shot accelerate that pace  of innovation  by three times  to address the rate  of the explosion  of these new types  of that scale applications   and this can be achieved  through tailored  and tuned <s> software defined  server cartridges  <s> nobody  other than HP  can put together a comprehensive plan  from beginning  to  end  like we done with one shot   using the  HP Pathfinder  innovation  ecosystem  we bring together  leading technology partners   to provide the latest solutions  to address  the new style I tee   shot is a great example  of  HP's innovation   in shows how we're addressing what CIO  and CFO has made  lowering cost  reducing energy  and  saving valuable  data center space   we define the industry standard server market  and we've been a leader for years  with moonshine  we've redefined the market  and taking it  to the next level  the more shot journey has just started  and I'm excited  about where we'll take us   find out how much I can help you  accelerate innovation  while delivering  breakthrough efficiency  and scale  today  thanks to Ty

You can see that it is fairly readable text, showing good recognition capabilities. If you listen to the file, you can see that this is because there is one speaker, who speaks clearly and with a typical American accent, with no background noise or music.

A Tougher Example

Type the following in your Web browser to run a speech-to-text task:

      http://localhost:15000/action=AddTask&Type=WavToText&File=path/moonshotOverview.mp4&Out=path/moonshotOverview.ctm

where path is the location on your machine of the video file and its corresponding output file.

The word output as a paragraph of words for a speech-to-text task on the MoonshotOverview.mp4 sample file should look something like this:

the  older per  unit is probably   the  buckling under its  only  the  forty three thousand searches in some  sixty million three thousand  nine hundred million dollars  online transactions <s> every now  and it's growing  exponentially   doubling in size  every thirty six miles <s> the  tax  cuts the state  to build a new car the server  for a new time  for a new kind word or a new kind of world  I'm  lazy exponentially fewer resources to do and  most people call that  changed  somewhat colder  we call it <s> H The moonshine Show  which I  ate spinach I  go to the Internet  the room and it's around  eighty percent smaller than traditional servers  use  eighty nine percent  less energy <s> possibly some  strong  but it's bigger than  over the next thirty six months  its two hundred  metric  tons of steel to  never being an  eight million square feet of forest  but  it  just  means  it's ten power  is  at  its the reason the internet just might have a future after all  so empowered has  allowed lives  our future <s> the flu shot  is from HP <s> it's time to build a better read  this  get  together  again in  Madison on  the  hp  shop <s> this  is going to pay  the  the  the  the  <s>

If you listen to the file, it has background music almost the entire way through, and also contains very disjointed speech. The file mainly consists of short sentences with periods of music between, with some sentences even being broken in to two parts with music in between. The IDOL Speech Server standard Broadcast models are optimized for Broadcast news content, and as such do not perform as well on this disjointed speech with background music..

The longer periods of music are commonly recognized as words in the example above. You can configure IDOL Speech Server to perform music and noise filtering, which attempts to recognize speech only on areas classified as speech, and excludes areas classified as music or noise.

Note: You might miss some segments of speech if you enable music and noise filtering.

Factors that Affect Speech-to-Text

A number of factors affect the recall rate (correct detection) of words or phrases. These include signal bandwidth, background noise, speech clarity (which can be affected by factors such as whether a speaker is native), audio signal distortion because of compression and storage, and also factors that affect the spoken speech, such as speech clarity (which can be affected by factors such as whether a speaker is native), speed of speech, presence of cross-talk ( multiple speakers at the same time ), and the format and breadth of language used.

 

Improve Speech-to-Text Results

You can optimize the results of speech-to-text tasks in several ways. For example, if the content is about a particular subject, you can produce a custom language model for that subject to guide the recognition to words from that particular subject domain. You can also filter music and noise, and perform speech-to-text only on those parts of the file that are classified as speech. If you have content that might contain a number of languages or accents, you can use language identification to determine the most likely language or accent variant. For further information on language identification, refer to the IDOL Speech Server Administration Guide.

IDOL Speech Server provides several other speech and audio processing capabilities, including audio classification, language identification, speaker segmentation and identification, audio fingerprinting, and phrase matching. IDOL Speech Server supports over 30 languages, with a number of those having telephony language packs, which enable speech-to-text over voice call content. Refer to the IDOL Speech Server Administration Guide for more details.


_HP_HTML5_bannerTitle.htm