Train Speaker Identification
Speaker identification divides audio into different speakers. Media Server can identify the gender of each speaker without training, but to recognize individual speakers you must train Media Server by providing audio samples for each person.
OpenText recommends that you provide at least five minutes of speech for each speaker. An audio sample must not contain speech from any other speakers. Ideally, you should use high-quality audio samples that contain only the speaker's voice and no background noise. However, you should include samples from a range of environments (indoors, outdoors, noisy, and so on) that match what you expect to process. The audio sample can contain any vocabulary - the speaker does not need to say any specific phrase.
If you want to process audio that includes unknown speakers (people who you have not trained and do not exist in the database) there are some additional training requirements:
- You must provide audio samples that represent unknown speakers (any speakers you have not trained; the audio samples you provide for unknown speakers do not need to match the unknown speakers in the audio you are going to process). OpenText recommends that you provide at least 60 minutes of audio containing unknown speakers. This audio must not contain any of the speakers that you have trained.
- You must provide additional audio samples for each of the speakers you have trained, to be used as development rather than training samples. The development samples must be different to the training samples. OpenText recommends that you provide at least five minutes of speech for each speaker.
These additional audio samples are used to generate the thresholds that Media Server uses to distinguish between a match to a known speaker and an unknown speaker.
The speakers that you train are organized into databases. When you run speaker identification you provide the name of the database to use and Media Server attempts to recognize speakers against the speakers in that database. For example, you could create a database named "news" for processing news broadcasts and train various speakers (newsreaders, politicians, and so on) who you expect to appear.
A television news broadcast is an example that contains unknown speakers, because you cannot expect to predict who will speak or provide audio samples for every person. So in this case you would need to provide development audio samples for each speaker you train, and add audio samples to the database that represent unknown speakers.