Train Speaker Identification

Speaker identification divides audio into different speakers. Media Server can identify the gender of each speaker without training, but to recognize individual speakers you must train Media Server by providing audio samples for each person.

OpenText recommends that you provide at least five minutes of speech for each speaker. An audio sample must not contain speech from any other speakers. Ideally, you should use high-quality audio samples that contain only the speaker's voice and no background noise. However, you should include samples from a range of environments (indoors, outdoors, noisy, and so on) that match what you expect to process. The audio sample can contain any vocabulary - the speaker does not need to say any specific phrase.

The speakers that you train are organized into databases. When you run speaker identification you provide the name of the database to use and Media Server attempts to recognize speakers against the speakers in that database. For example, you could create a database named "news" for processing news broadcasts and train various speakers (newsreaders, politicians, and so on) who you expect to appear.

A television news broadcast is an example that contains unknown speakers, because you cannot expect to predict who will speak or provide audio samples for every person. You can use the parameter MatchThreshold to set the threshold at which a speaker is recognized, rather than being output as an unknown speaker.