This section takes you through the speech-to-text process in IDOL Speech Server, and provides you with basic information on the output of the speech-to-text algorithm, so that you know what to expect from your IDOL Speech Server installation. For more details, refer to the IDOL Speech Server Administration Guide and the IDOL Speech Server Reference.
To obtain the files that you need for this tutorial, go to the to the SoftSound Server area on the HPE Big Data Customer Support Site at https://customers.autonomy.com, click ZIP in the Format section, and then download IDOLSpeechtoTextTutorial.zip
.
The .zip file contains the following example video files, and the corresponding speech-to-text output files:
Moonshot101.mp4
Moonshot101_out.ctm
MoonshotOverview.mp4
MoonshotOverview_out.ctm
Unzip the contents of the .zip file to the Speech Server data
directory.
To use this tutorial, you need the following components:
ENUS-6.4
language pack. You can download the language pack from the HPE Big Data Customer Support Site at https://customers.autonomy.com.You can perform the tutorial with a later language pack, although you may find slight differences between your output and the output files provided. In addition, results might vary depending on the operating system that you are using.
For more details on how to install and run IDOL Speech Server, refer to the IDOL Speech Server Administration Guide.
You must configure administrative access for your local machine by specifying the Access-Control-Allow-Origin
parameter in the IDOL configuration file. For example: Access-Control-Allow-Origin=http://localhost:15000
. For more information , refer to the IDOL Server Reference.
After you install IDOL Speech Server, type http://localhost:15000/action=Admin
in a Web browser to start IDOL Admin and check that IDOL Speech Server is running. You can also use the IDOL Admin interface to check that the language pack is installed and configured correctly, and to load the language pack.
To perform speech-to-text, type the following in your Web browser:
http://localhost:15000/action=AddTask&Type=WavToText&File=moonshot101.mp4&Out=moonshot101.ctm&Lang=ENUS
By default, Speech Server writes the output to the temp
directory.
This action starts a WavToText
task, which processes a media file and produces a speech-to-text transcript. The response should look something like this:
<autnresponse xmlns:autn="http://schemas.autonomy.com/aci/"> <action>ADDTASK</action> <response>SUCCESS</response> <responsedata> <token>MTkyLjE2OC4xLjkzOjE1MDAwOkFERFRBU0s6LTE3NjgyNTE3NjQ=</token> </responsedata> </autnresponse>
You can use the task token in the <token>
tag to check the status of the task. In your Web browser, type
http://localhost:15000/action=GetStatus&Token=token
For example, http://localhost:15000/action=GetStatus&Token=MTkyLjE2OC4xLjkzOjE1MDAwOkFERFRBU0s6LTE3NjgyNTE3NjQ=
. You should see output similar to the following when the task is complete:
<tasks> <maxTasks>1</maxTasks> <task> <token>MTkyLjE2OC4xLjkzOjE1MDAwOkFERFRBU0s6LTE3NjgyNTE3NjQ=</token> <category>STANDARD</category> <type>wavToText</type> <params>file: c:/tutorialfiles/moonshot101.mp4, out: c:/tutorialfiles/moonshot101.ctm, overlap: 5, splitsize: 120, taskdata: none, taskmanagers: 1, type: wavToText</params> <nwarnings>0</nwarnings> <nsubtasks>-1</nsubtasks> <status>FINISHED</status> <lang>9429d571242252fe</lang> <secondsRun>0</secondsRun> <processingStart>0</processingStart> <processingEnd>0</processingEnd> </task> </tasks>
The <status>
tag indicates that the task is finished.
The output file moonshot101.ctm
should be identical to the moonshot101_out.ctm
file provided in IDOLSpeechtoTextTutorial.zip
.
This .ctm
file is a time-ordered list of the words recognized, in the following format:
1 A 0.000 1.230 <SIL> 0.000 1 A 1.230 0.360 <SIL> 0.000 1 A 1.590 0.740 innovation 0.000
Each entry in the file consists of the following information:
1
).A
.<SIL>
(this includes certain non-speech sounds such as lip smacking). Sentence boundaries are labeled as <s>
.The word output as a paragraph of words for a speech-to-text task on the Moonshot101.mp4
sample file should look something like this:
innovation is core to our DNA HP reviews and innovation to create new classes of solutions like System Pro first x eighty six or a series of highly successful flight systems even the first rack mount server when HP moonshine <s> we are doing it again where launching an innovative solution for the new style of ninety <s> so why are we doing much today almost anything can be connected to the Internet millions of devices can be tracked gather and process information or provide a service all while seamlessly interacting with other devices <s> we call this the Internet of Things twenty twenty the number of devices in data are expected to grow exponentially reaching thirty billion devices forty trillion gigabytes of data when ten million applications these new applications are doing things that a few years ago we didn't think were possible and doing it at huge scale when running these new applications that scale <s> we had to think differently in the design of the solution <s> H B Moore shot is a revolutionary new server design for at scale applications applications used by millions of users applications which support the new style of I T H Freeman shot uses ultra low energy servers derivatives and devices designed for notebooks and cell phones designed to run all day on a single battery charge designed to be incredibly energy efficient and the end result is a dramatic reduction in cost power and space <s> one of the most innovative things about moon shot is the overall architecture <s> we have created software to find servers each one targeted at specific workloads share everything provided discreetly in a traditional server <s> we created a resourceful the power supply the power cords the cooling fans <s> the network a network of links the management interface management network the firmware firmware updates everything shared across forty five hot plausible server cartridges in essence a community of servers where only components that need to be dedicating our dedicated in everything else share to provide savings in power space in costs one shot is different from servers designed in the past <s> for the targeted applications we see a dramatic reduction in the number of racks required to do the same amount of work saving purchase cost operational cost and data center space typical server designs take eighteen months from beginning to end one shot accelerate that pace of innovation by three times to address the rate of the explosion of these new types of that scale applications and this can be achieved through tailored and tuned <s> software defined server cartridges <s> nobody other than HP can put together a comprehensive plan from beginning to end like we done with one shot using the HP Pathfinder innovation ecosystem we bring together leading technology partners to provide the latest solutions to address the new style I tee shot is a great example of HP's innovation in shows how we're addressing what CIO and CFO has made lowering cost reducing energy and saving valuable data center space we define the industry standard server market and we've been a leader for years with moonshine we've redefined the market and taking it to the next level the more shot journey has just started and I'm excited about where we'll take us find out how much I can help you accelerate innovation while delivering breakthrough efficiency and scale today thanks to Ty
You can see that it is fairly readable text, showing good recognition capabilities. If you listen to the file, you can see that this is because there is one speaker, who speaks clearly and with a typical American accent, with no background noise or music.
Type the following in your Web browser to run a speech-to-text task:
http://localhost:15000/action=AddTask&Type=WavToText&File=path/moonshotOverview.mp4&Out=path/moonshotOverview.ctm
where path
is the location on your machine of the video file and its corresponding output file.
The word output as a paragraph of words for a speech-to-text task on the MoonshotOverview.mp4
sample file should look something like this:
the older per unit is probably the buckling under its only the forty three thousand searches in some sixty million three thousand nine hundred million dollars online transactions <s> every now and it's growing exponentially doubling in size every thirty six miles <s> the tax cuts the state to build a new car the server for a new time for a new kind word or a new kind of world I'm lazy exponentially fewer resources to do and most people call that changed somewhat colder we call it <s> H The moonshine Show which I ate spinach I go to the Internet the room and it's around eighty percent smaller than traditional servers use eighty nine percent less energy <s> possibly some strong but it's bigger than over the next thirty six months its two hundred metric tons of steel to never being an eight million square feet of forest but it just means it's ten power is at its the reason the internet just might have a future after all so empowered has allowed lives our future <s> the flu shot is from HP <s> it's time to build a better read this get together again in Madison on the hp shop <s> this is going to pay the the the the <s>
If you listen to the file, it has background music almost the entire way through, and also contains very disjointed speech. The file mainly consists of short sentences with periods of music between, with some sentences even being broken in to two parts with music in between. The IDOL Speech Server standard Broadcast models are optimized for Broadcast news content, and as such do not perform as well on this disjointed speech with background music..
The longer periods of music are commonly recognized as words in the example above. You can configure IDOL Speech Server to perform music and noise filtering, which attempts to recognize speech only on areas classified as speech, and excludes areas classified as music or noise.
You might miss some segments of speech if you enable music and noise filtering.
A number of factors affect the recall rate (correct detection) of words or phrases. These include signal bandwidth, background noise, speech clarity (which can be affected by factors such as whether a speaker is native), audio signal distortion because of compression and storage, and also factors that affect the spoken speech, such as speech clarity (which can be affected by factors such as whether a speaker is native), speed of speech, presence of cross-talk ( multiple speakers at the same time ), and the format and breadth of language used.
You can optimize the results of speech-to-text tasks in several ways. For example, if the content is about a particular subject, you can produce a custom language model for that subject to guide the recognition to words from that particular subject domain. You can also filter music and noise, and perform speech-to-text only on those parts of the file that are classified as speech. If you have content that might contain a number of languages or accents, you can use language identification to determine the most likely language or accent variant. For further information on language identification, refer to the IDOL Speech Server Administration Guide.
IDOL Speech Server provides several other speech and audio processing capabilities, including audio classification, language identification, speaker segmentation and identification, audio fingerprinting, and phrase matching. IDOL Speech Server supports over 30 languages, with a number of those having telephony language packs, which enable speech-to-text over voice call content. Refer to the IDOL Speech Server Administration Guide for more details.
|