Media Server
Media Server is an ACI server. For details of changes that affect all ACI servers, see ACI Server Framework.
24.1.1
Media Server 24.1.1 resolves the following issues.
- Media Server could terminate unexpectedly when speech-to-text was configured with
LanguagePack=Input
but was not configured to receive language ID results. - Speech-to-text accuracy was reduced when processing non-English audio with a duration greater than 30 seconds. This issue was introduced in Media Server 24.1 and only affects the new speech-to-text models (
ModelVersion=micro/small/large
, but notModelVersion=legacy
). The warning "Detected unsupported language" would appear in the log fileengine.log
.
24.1.0
New Features
- Face Demographics has significantly improved accuracy when recognizing age, gender, and ethnicity.
- Optical Character Recognition recognizes tables that are constructed from text elements in PDF files. In earlier versions of Media Server, OCR identified tabular data only in images.
- Optical Character Recognition uses a new algorithm for finding text in scene mode. The new algorithm uses a neural net and is more accurate but slower. The new algorithm is used by default but if you prefer to prioritize processing speed you can switch to the algorithm used in earlier versions, by setting
SceneAlgorithmBias=speed
. - Optical Character Recognition is faster when processing video subtitles.
- Language identification results can be routed to a speech-to-text task, so that you can detect the language of speech and then transcribe it in a single process request. To support this feature, the language ID analysis engine has a new output track named
ResultWithSource
, which includes the detected language and the audio. You can use this track as the input for your speech-to-text task - setLanguagePack=input
rather than choosing a specific language. This feature is not available with legacy speech-to-text (ModelVersion=legacy
). - The
DescribeMedia
action includes information about subtitle (closed-caption) streams. This information has also been added to theProxy
track of the video ingest engine. - You can run face recognition and object recognition against multiple databases of faces/objects. (In earlier versions of Media Server you could choose either one database, or all databases.)
- The clip encoder can produce animated GIF thumbnails of events that occur in your video.
- The example Lua script that is provided for drawing analysis results on video (
configurations/lua/draw.lua
) supports text labels and tracking functionality.
Resolved Issues
- (Security update) The third-party libwebp library was updated to version 1.3.2.
- (Security update) The third-party libpng library was updated to version 1.6.40.
- (Security update) The third-party libjpeg library was updated to version 9e.
- (Security update) The third-party libtiff library was updated to version 4.6.0.
- (Security update) The third-party protobuf library was updated to version 24.3. (This update does not apply when GPU acceleration is enabled, in which case Media Server still uses version 2.6.1).
- Following the improvements to the face detection algorithm in Media Server 23.4, face recognition (used with the same recognition threshold) could produce a different ratio of false positives to true negatives. Face recognition confidence scores have now been rescaled with the aim that a fixed recognition threshold should produce a consistent ratio of false positives to true negatives across Media Server versions.
Notes
-
The list of values that Face Demographics can return for a person's ethnicity has been updated:
Media Server 23.4 Media Server 24.1 African/Caribbean
Arab
Caucasian
East Asian
Hispanic
Indian Subcontinent
Arab
Black
East Asian
Latino
South Asian
Southeast Asian
White
-
The list of values that Face Demographics can return for a person's age has been updated:
Label Approximate Age Media Server 23.4 Media Server 24.1 Baby < 2 years < 2 years Child 2-15 years 2-15 years Young Adult 15-35 years 15-30 years Adult 35-55 years 30-60 years Elderly > 60 years > 60 years
23.4.0
New Features
- Face detection has significantly improved recall (fewer genuine faces are missed). For any chosen detection threshold, the new detector produces approximately the same proportion of false positives but significantly fewer false negatives. If you prefer to prioritize precision and have fewer false positives, you could increase the detection threshold because the new detector can achieve equivalent recall at a higher threshold. This improvement benefits downstream analysis such as face recognition, because faces that were not detected in previous versions can now be recognized.
- Face detection supports GPU acceleration and is much faster when you use a GPU for analysis. In earlier versions of Media Server, face detection did not use the GPU.
- Media Server can perform a new type of analysis called visual clustering. You can add a selection of video clips to the training database, and then run the action
ClusterVisualItems
. Media Server divides the videos into clusters of similar items and returns the results. - Improved (reduced) memory usage when using a GPU with
CUDAVersion=11
. - Improved pre-trained recognizers are available in the
MediaServerPretrainedModels
package. The new recognizers are labeled "large" or "small" - the large models provide the best accuracy while the small models prioritize faster analysis. - Speech-to-text results now include ISO 8601 durations for alternate word offsets. In earlier versions of Media Server, the
startOffset
andendOffset
values were provided only as a number of milliseconds. - Optical Character Recognition (OCR) accuracy has improved when
OcrMode=Scene
andHollowText=TRUE
. - The
source
field in anOCRResult
record can contain the valueimage table
, which specifies that the text was extracted from a table in an image. In earlier versions of Media Server these records would have had the valueimage
.
Resolved Issues
- GPU acceleration did not work on Windows because some libraries were missing from the Media Server package.
- Media Server would give the error "Unsupported model version" when speech-to-text was configured to use the
micro
speech-to-text model.
Notes
- Media Server 23.4 requests as many channels as it requires from your IDOL License Server, up to the maximum number available in your license, unless you configure the number of channels to request. (This is a change in the default behavior: earlier versions of Media Server did not request any channels by default and the number of channels you wanted to use had to be set in the configuration file). This means that you no longer need to configure the number of channels when running a single Media Server. However, if you are using multiple instances of Media Server, OpenText recommends that you specify the number of channels to use in each Media Server configuration file, to prevent just one of the servers from consuming all of the available channels.
- When training image classification or object class recognition, you can set the training option
validation_proportion
without settingsnapshot_frequency
. In this case Media Server uses the specified proportion of your training images to evaluate the performance of the final trained model. You can obtain the results of the evaluation through the actionGetObjectClassRecognizerSnapshotStatistics
. - Some of the pre-trained recognizers that were available in the
MediaServerPretrainedModels
package have been removed, because other recognizers offer better accuracy or performance.ObjectClassRecognizer_RoadScene
,ObjectClassRecognizer_Person
, andObjectClassRecognizer_Gen3_PersonCar
have been removed because OpenText recommends using one of the surveillance recognizers instead. You can filter the output of a recognizer by setting theClassFilters
parameter in your analysis task. For example, to use a surveillance recognizer to recognize only people and cars, setClassFilters=person,car
.
23.3.0
New Features
- Media Server supports a wider range of graphics cards for accelerating media analysis. Media Server now supports CUDA version 11 (compute capability version 3.5 to 8.6) in addition to the GPUs that were previously supported. For more information about using a GPU, refer to the Media Server Help.
- Media Server has a new analysis engine (
Type=PersonAnalysis
). Person analysis reports information about a person - for example their gender, clothing style and color, hair style and color, and hat style and color. - The new speech-to-text algorithm (introduced in Media Server 23.2) now requires only the common speech-to-text resources. You no longer need to install legacy language packs such as
ENUK
orENUS
. The legacy language packs are still required for transcript alignment, speaker clustering, and for running speech-to-text if you setModelVersion=Legacy
. - The new speech-to-text algorithm (introduced in Media Server 23.2) can output alternative words.
- The common resources for the new speech-to-text algorithm include a new model named
micro
. This model is faster - but less accurate - than the other models. It can be used on older or less powerful hardware that is unable to keep up with a live stream when using thesmall
model. OpenText recommends that you try thesmall
model first. - The confidence score for a face recognition result is no longer affected by the value of the parameter
MaxRecognitionResults
(in earlier versions of Media Server there could be a small variation in confidence score for different values of this parameter).
Resolved Issues
- Media Server could terminate unexpectedly when ingesting images with a very large number of pixels.
- An issue could occur when retraining
Generation1
object class recognizers. This issue has been resolved with a new database schema (version 13). OpenText recommends that you upgrade to the latest schema and consider retraining anyGeneration1
recognizers. - The XSL transformations
toIDX.xsl
,toCSV.xsl
, andtoCFS.xsl
did not include the clothing region type.
Notes
- The Media Server database schema has changed. If you are using an internal database, the schema upgrade is performed automatically when you start the new version of Media Server. If you are using an external MySQL database you must run an upgrade script, which is included in the Media Server 23.3 installation. For more information about upgrading the database schema, refer to the Media Server Help.
23.2.0
New Features
- Media Server supports new speech-to-text models that offer significantly better accuracy, especially for English speech. The speech-to-text analysis engine has a new configuration parameter,
ModelVersion
. The default behavior (ModelVersion=legacy
) uses the same models as Media Server 12.x. To use one of the new speech to text models you must set this parameter to eithersmall
(the fastest of the new models) orlarge
(which offers the best accuracy). Custom language models and custom word dictionaries are neither necessary nor supported with the new models, because the vocabulary of the new models is not limited by their training. Due to their size, the new models are not included in the Media Server package and must be downloaded separately. - Media Server can perform speaker clustering, which segments an audio recording into different speakers. There is a new analysis engine (
Type=ClusterSpeech
). Speaker clustering does not need training but does require that you install an appropriate speech-to-text language pack. - Transcript alignment (
action=AlignAudioTranscript
) has a new parameter namedIngestDateTime
. You can use this to configure the start time for the timestamps. For example, set this parameter when you want the timestamps to match the time when the video was broadcast. - Face recognition accuracy has been significantly improved.
- OCR accuracy has been improved when processing high resolution images in scene mode.
- The OCR
WordResult
track supports scrolling text (often seen below television news broadcasts). - Strict action validation has been extended so that it validates any session configuration that you pass to a
process
action. When strict action validation is used, theprocess
action will fail immediately if the configuration includes unknown parameters, or includes a configuration section that is referenced but contains no parameters.
Resolved Issues
- Media Server could terminate unexpectedly when running Optical Character Recognition (OCR).
- Some temporary files produced by KeyView were written to the system temporary directory, rather than the location specified by the
TempDir
parameter in the[Paths]
section of the configuration file. Some temporary files were not deleted when no longer required.
Notes
- The older session configuration format, which was replaced in Media Server 12.0, is no longer supported. Media Server 23.2 only reads the list of processing tasks from the
[Session]
section of the configuration file. The older section names, such as[Ingest]
,[Analysis]
,[Transform]
,[Encoding]
, and[Output]
, are now ignored. - Media Server no longer supports the alias
Image_1
for theDefault_Image
track. In your session configuration files, replaceImage_1
withDefault_Image
. This change does not affect fully-qualified track names such asTaskName.Image_1
, whereTaskName
is the name of an ingest task. - The deprecated parameters
Language
andCustomLM
have been removed from the speech-to-text analysis engine. Use theLanguagePack
andCustomLanguageModel
parameters instead. - Strict action validation has been enabled by default. If you prefer to disable this, set the configuration parameter
StrictActionValidation=FALSE
in the[Server]
section of the configuration file. - The records in the OCR
WordData
andWordResult
tracks now have unique UUIDs for each word. In earlier versions of Media Server all of the words on the same line had the same UUID. Now, there is a unique UUID for each word and theparentID
field provides the UUID of the parent line.
Deprecated Features
The following features are deprecated and might be removed in a future release.
Category | Deprecated Feature | Deprecated Since |
---|---|---|
Speech processing | The SampleFrequency parameter, for speech-to-text and for the AlignAudioTranscript action, has been deprecated. Media Server can now determine the correct sample frequency from the language pack. |
23.2.0 |
Transcript alignment | The output of the AlignAudioTranscript action has been updated. The startTime , duration , and endTime fields (which provide timestamps in milliseconds from the beginning of the file) have been deprecated. There is now a timestamp field that provides timestamp information in both ISO 8601 format and in epoch microseconds. |
23.2.0 |
Face detection | In the XML output from face detection there are fields named outOfPlaneAngleX , outOfPlaneAngleY , and percentageInImage . The macros and Lua table entries that were available to access these data were named outofplaneanglex , outofplaneangley , and percentageinimage (note the difference in case). Media Server has been updated so that the macros and Lua table entries are consistent with the XML output. The all-lowercase names have been deprecated and will be removed in a future release. |
23.2.0 |
OCR | The Blacklist and Whitelist configuration parameters have been deprecated. Use the parameters DisabledCharacters and ExtraEnabledCharacters instead. |
23.2.0 |
Actions | The GetLatestRecord action. The new actions KeepLatestRecords and GetLatestRecords provide more control over what to store and retrieve. |
12.5.0 |