Media Server

Media Server is an ACI server. For details of changes that affect all ACI servers, see ACI Server Framework.

24.3.0

New Features

  • A new speaker ID algorithm provides improved accuracy and makes speaker ID easier to train. The new algorithm always operates in open-set mode: a speaker is classified as unknown when the confidence score fails to meet the value of the new parameter MatchThreshold. When training speaker ID, you no longer need to provide audio samples from "unknown" speakers.
  • The new speech-to-text models have been improved. When processing non-English speech there is a new model (medium) that provides greater accuracy than the small model and better performance than the large model. The large English model has been renamed to medium for consistency (so that the medium and large names accurately indicate resource requirements).
  • The new speech-to-text models produce better output when transcribing Chinese and Japanese. Media Server now produces a separate record for each word. Earlier versions could output multiple words per record when processing these languages.
  • When you use OCR to read a news ticker, Media Server produces a result record for each headline. Earlier versions of Media Server would typically produce a single record for the entire video, containing all of the text.
  • In the OCR analysis engine, the accuracy of alphabet detection has improved.
  • The action ValidateProcessConfig has a new parameter, DatabaseChecks. This specifies whether to fail validation when a trained model (for example an image classifier or object class recognizer) is specified in the configuration but does not exist in the database. To preserve backwards compatibility, the default value of this parameter is FALSE.

Resolved Issues

  • (Security update) The third-party openjpeg library was updated to version 2.5.2.
  • (Security update) The third-party thrift library was updated to version 0.20.0.
  • (Security update) The third-party libde265 library was updated to version 1.0.15.
  • (Security update) The third-party libheif library was updated to version 1.17.6.
  • The new speech-to-text models would produce output in Traditional Chinese, rather than Simplified Chinese, when LanguagePack=ZHCN.
  • Non-legacy speech-to-text would immediately fail with the error "bad language code" when the chosen language was CYUK, SKSK, or SVSE.
  • Media Server could terminate unexpectedly when performing OCR on video, with SceneAlgorithmBias=accuracy.
  • The alphabet field in PageResult records, from the OCR analysis engine, could include alphabets that were not present in the image.
  • When a training action (for example TrainFace or TrainObject) did not specify an identifier, the automatically-generated identifier was not returned in the response (accessible through the QueueInfo action).
  • A training action that moved a training image (for example MoveFace or MoveObject) would fail if the training image had a status of failed.

Notes

  • There have been significant changes in speaker ID following the introduction of the new speaker ID algorithm.

    • The action CreateSpeakerDatabase no longer has a SampleFrequency parameter, because setting this is no longer required.
    • The action AddSpeakerAudio no longer has a Training parameter. In previous versions of Media Server, some audio samples were designated (by setting Training=False) for use in estimating speaker thresholds, but this is no longer required.
    • The response to the action ListSpeakers has changed.
    • The following training actions have been removed:

      • AddUnknownSpeakerAudio
      • EstimateAllSpeakerThresholds
      • EstimateSpeakerThreshold
      • GetUnknownSpeakerAudio
      • ListUnknownSpeakerAudio
      • NullUnknownSpeakerAudioData
      • RemoveUnknownSpeakerAudio
      • SetSpeakerThreshold
    • The ClosedSet parameter has been removed from the speaker ID analysis engine. You can remove this parameter from your session configurations.

24.2.0

New Features

  • Media Server has a new analysis engine (Type=ImageEmbedding) to generate embeddings. An embedding is a numerical representation of an image, in the form of a vector that you can index into your IDOL Content component. IDOL Content can compare embeddings to other embeddings, to see whether two images are conceptually similar. If you have a pair of encoders, one trained to generate vectors from images and the corresponding one to generate vectors from text, you can use IDOL Content to search for images that are conceptually similar to keywords.
  • The OCR parameter Languages accepts the value ALL, for cases when you do not know which language(s) to expect.

Resolved Issues

  • Media Server could terminate unexpectedly when speech-to-text was configured with LanguagePack=Input but was not configured to receive language ID results.
  • Speech-to-text accuracy was reduced when processing non-English audio with a duration greater than 30 seconds. This issue was introduced in Media Server 24.1 and only affects the new speech-to-text models (ModelVersion=micro/small/large, but not ModelVersion=legacy). The warning "Detected unsupported language" would appear in the log file engine.log.

24.1.0

New Features

  • Face Demographics has significantly improved accuracy when recognizing age, gender, and ethnicity.
  • Optical Character Recognition recognizes tables that are constructed from text elements in PDF files. In earlier versions of Media Server, OCR identified tabular data only in images.
  • Optical Character Recognition uses a new algorithm for finding text in scene mode. The new algorithm uses a neural net and is more accurate but slower. The new algorithm is used by default but if you prefer to prioritize processing speed you can switch to the algorithm used in earlier versions, by setting SceneAlgorithmBias=speed.
  • Optical Character Recognition is faster when processing video subtitles.
  • Language identification results can be routed to a speech-to-text task, so that you can detect the language of speech and then transcribe it in a single process request. To support this feature, the language ID analysis engine has a new output track named ResultWithSource, which includes the detected language and the audio. You can use this track as the input for your speech-to-text task - set LanguagePack=input rather than choosing a specific language. This feature is not available with legacy speech-to-text (ModelVersion=legacy).
  • The DescribeMedia action includes information about subtitle (closed-caption) streams. This information has also been added to the Proxy track of the video ingest engine.
  • You can run face recognition and object recognition against multiple databases of faces/objects. (In earlier versions of Media Server you could choose either one database, or all databases.)
  • The clip encoder can produce animated GIF thumbnails of events that occur in your video.
  • The example Lua script that is provided for drawing analysis results on video (configurations/lua/draw.lua) supports text labels and tracking functionality.

Resolved Issues

  • (Security update) The third-party libwebp library was updated to version 1.3.2.
  • (Security update) The third-party libpng library was updated to version 1.6.40.
  • (Security update) The third-party libjpeg library was updated to version 9e.
  • (Security update) The third-party libtiff library was updated to version 4.6.0.
  • (Security update) The third-party protobuf library was updated to version 24.3. (This update does not apply when GPU acceleration is enabled, in which case Media Server still uses version 2.6.1).
  • Following the improvements to the face detection algorithm in Media Server 23.4, face recognition (used with the same recognition threshold) could produce a different ratio of false positives to true negatives. Face recognition confidence scores have now been rescaled with the aim that a fixed recognition threshold should produce a consistent ratio of false positives to true negatives across Media Server versions.

Notes

  • The list of values that Face Demographics can return for a person's ethnicity has been updated:

    Media Server 23.4 Media Server 24.1
    • African/Caribbean
    • Arab
    • Caucasian
    • East Asian
    • Hispanic
    • Indian Subcontinent
    • Arab
    • Black
    • East Asian
    • Latino
    • South Asian
    • Southeast Asian
    • White
  • The list of values that Face Demographics can return for a person's age has been updated:

    Label Approximate Age
    Media Server 23.4 Media Server 24.1
    Baby < 2 years < 2 years
    Child 2-15 years 2-15 years
    Young Adult 15-35 years 15-30 years
    Adult 35-55 years 30-60 years
    Elderly > 60 years > 60 years

23.4.0

New Features

  • Face detection has significantly improved recall (fewer genuine faces are missed). For any chosen detection threshold, the new detector produces approximately the same proportion of false positives but significantly fewer false negatives. If you prefer to prioritize precision and have fewer false positives, you could increase the detection threshold because the new detector can achieve equivalent recall at a higher threshold. This improvement benefits downstream analysis such as face recognition, because faces that were not detected in previous versions can now be recognized.
  • Face detection supports GPU acceleration and is much faster when you use a GPU for analysis. In earlier versions of Media Server, face detection did not use the GPU.
  • Media Server can perform a new type of analysis called visual clustering. You can add a selection of video clips to the training database, and then run the action ClusterVisualItems. Media Server divides the videos into clusters of similar items and returns the results.
  • Improved (reduced) memory usage when using a GPU with CUDAVersion=11.
  • Improved pre-trained recognizers are available in the MediaServerPretrainedModels package. The new recognizers are labeled "large" or "small" - the large models provide the best accuracy while the small models prioritize faster analysis.
  • Speech-to-text results now include ISO 8601 durations for alternate word offsets. In earlier versions of Media Server, the startOffset and endOffset values were provided only as a number of milliseconds.
  • Optical Character Recognition (OCR) accuracy has improved when OcrMode=Scene and HollowText=TRUE.
  • The source field in an OCRResult record can contain the value image table, which specifies that the text was extracted from a table in an image. In earlier versions of Media Server these records would have had the value image.

Resolved Issues

  • GPU acceleration did not work on Windows because some libraries were missing from the Media Server package.
  • Media Server would give the error "Unsupported model version" when speech-to-text was configured to use the micro speech-to-text model.

Notes

  • Media Server 23.4 requests as many channels as it requires from your IDOL License Server, up to the maximum number available in your license, unless you configure the number of channels to request. (This is a change in the default behavior: earlier versions of Media Server did not request any channels by default and the number of channels you wanted to use had to be set in the configuration file). This means that you no longer need to configure the number of channels when running a single Media Server. However, if you are using multiple instances of Media Server, OpenText recommends that you specify the number of channels to use in each Media Server configuration file, to prevent just one of the servers from consuming all of the available channels.
  • When training image classification or object class recognition, you can set the training option validation_proportion without setting snapshot_frequency. In this case Media Server uses the specified proportion of your training images to evaluate the performance of the final trained model. You can obtain the results of the evaluation through the action GetObjectClassRecognizerSnapshotStatistics.
  • Some of the pre-trained recognizers that were available in the MediaServerPretrainedModels package have been removed, because other recognizers offer better accuracy or performance. ObjectClassRecognizer_RoadScene, ObjectClassRecognizer_Person, and ObjectClassRecognizer_Gen3_PersonCar have been removed because OpenText recommends using one of the surveillance recognizers instead. You can filter the output of a recognizer by setting the ClassFilters parameter in your analysis task. For example, to use a surveillance recognizer to recognize only people and cars, set ClassFilters=person,car.

23.3.0

New Features

  • Media Server supports a wider range of graphics cards for accelerating media analysis. Media Server now supports CUDA version 11 (compute capability version 3.5 to 8.6) in addition to the GPUs that were previously supported. For more information about using a GPU, refer to the Media Server Help.
  • Media Server has a new analysis engine (Type=PersonAnalysis). Person analysis reports information about a person - for example their gender, clothing style and color, hair style and color, and hat style and color.
  • The new speech-to-text algorithm (introduced in Media Server 23.2) now requires only the common speech-to-text resources. You no longer need to install legacy language packs such as ENUK or ENUS. The legacy language packs are still required for transcript alignment, speaker clustering, and for running speech-to-text if you set ModelVersion=Legacy.
  • The new speech-to-text algorithm (introduced in Media Server 23.2) can output alternative words.
  • The common resources for the new speech-to-text algorithm include a new model named micro. This model is faster - but less accurate - than the other models. It can be used on older or less powerful hardware that is unable to keep up with a live stream when using the small model. OpenText recommends that you try the small model first.
  • The confidence score for a face recognition result is no longer affected by the value of the parameter MaxRecognitionResults (in earlier versions of Media Server there could be a small variation in confidence score for different values of this parameter).

Resolved Issues

  • Media Server could terminate unexpectedly when ingesting images with a very large number of pixels.
  • An issue could occur when retraining Generation1 object class recognizers. This issue has been resolved with a new database schema (version 13). OpenText recommends that you upgrade to the latest schema and consider retraining any Generation1 recognizers.
  • The XSL transformations toIDX.xsl, toCSV.xsl, and toCFS.xsl did not include the clothing region type.

Notes

  • The Media Server database schema has changed. If you are using an internal database, the schema upgrade is performed automatically when you start the new version of Media Server. If you are using an external MySQL database you must run an upgrade script, which is included in the Media Server 23.3 installation. For more information about upgrading the database schema, refer to the Media Server Help.

23.2.0

New Features

  • Media Server supports new speech-to-text models that offer significantly better accuracy, especially for English speech. The speech-to-text analysis engine has a new configuration parameter, ModelVersion. The default behavior (ModelVersion=legacy) uses the same models as Media Server 12.x. To use one of the new speech to text models you must set this parameter to either small (the fastest of the new models) or large (which offers the best accuracy). Custom language models and custom word dictionaries are neither necessary nor supported with the new models, because the vocabulary of the new models is not limited by their training. Due to their size, the new models are not included in the Media Server package and must be downloaded separately.
  • Media Server can perform speaker clustering, which segments an audio recording into different speakers. There is a new analysis engine (Type=ClusterSpeech). Speaker clustering does not need training but does require that you install an appropriate speech-to-text language pack.
  • Transcript alignment (action=AlignAudioTranscript) has a new parameter named IngestDateTime. You can use this to configure the start time for the timestamps. For example, set this parameter when you want the timestamps to match the time when the video was broadcast.
  • Face recognition accuracy has been significantly improved.
  • OCR accuracy has been improved when processing high resolution images in scene mode.
  • The OCR WordResult track supports scrolling text (often seen below television news broadcasts).
  • Strict action validation has been extended so that it validates any session configuration that you pass to a process action. When strict action validation is used, the process action will fail immediately if the configuration includes unknown parameters, or includes a configuration section that is referenced but contains no parameters.

Resolved Issues

  • Media Server could terminate unexpectedly when running Optical Character Recognition (OCR).
  • Some temporary files produced by KeyView were written to the system temporary directory, rather than the location specified by the TempDir parameter in the [Paths] section of the configuration file. Some temporary files were not deleted when no longer required.

Notes

  • The older session configuration format, which was replaced in Media Server 12.0, is no longer supported. Media Server 23.2 only reads the list of processing tasks from the [Session] section of the configuration file. The older section names, such as [Ingest], [Analysis], [Transform], [Encoding], and [Output], are now ignored.
  • Media Server no longer supports the alias Image_1 for the Default_Image track. In your session configuration files, replace Image_1 with Default_Image. This change does not affect fully-qualified track names such as TaskName.Image_1, where TaskName is the name of an ingest task.
  • The deprecated parameters Language and CustomLM have been removed from the speech-to-text analysis engine. Use the LanguagePack and CustomLanguageModel parameters instead.
  • Strict action validation has been enabled by default. If you prefer to disable this, set the configuration parameter StrictActionValidation=FALSE in the [Server] section of the configuration file.
  • The records in the OCR WordData and WordResult tracks now have unique UUIDs for each word. In earlier versions of Media Server all of the words on the same line had the same UUID. Now, there is a unique UUID for each word and the parentID field provides the UUID of the parent line.

Deprecated Features

The following features are deprecated and might be removed in a future release.

Category Deprecated Feature Deprecated Since
Actions The activity action. The graphical user interface (action=GraphicalUserInterface) provides similar functionality. 24.3.0
Speech processing The SampleFrequency parameter, for speech-to-text and for the AlignAudioTranscript action, has been deprecated. Media Server can now determine the correct sample frequency from the language pack. 23.2.0
Transcript alignment The output of the AlignAudioTranscript action has been updated. The startTime, duration, and endTime fields (which provide timestamps in milliseconds from the beginning of the file) have been deprecated. There is now a timestamp field that provides timestamp information in both ISO 8601 format and in epoch microseconds. 23.2.0
Face detection In the XML output from face detection there are fields named outOfPlaneAngleX, outOfPlaneAngleY, and percentageInImage. The macros and Lua table entries that were available to access these data were named outofplaneanglex, outofplaneangley, and percentageinimage (note the difference in case). Media Server has been updated so that the macros and Lua table entries are consistent with the XML output. The all-lowercase names have been deprecated and will be removed in a future release. 23.2.0
OCR The Blacklist and Whitelist configuration parameters have been deprecated. Use the parameters DisabledCharacters and ExtraEnabledCharacters instead. 23.2.0
Actions The GetLatestRecord action. The new actions KeepLatestRecords and GetLatestRecords provide more control over what to store and retrieve. 12.5.0