Analysis Task Output Tracks

The events that occur in video usually span many frames. For example, a person, object, or logo might appear on screen and remain there for several minutes. Media Server analyzes video frame by frame, but many analysis engines track events across frames because analyzing multiple frames can improve accuracy.

Analysis tasks can produce many different output tracks but, regardless of which track they belong to, records that relate to the same event always have the same ID.

  • Result tracks contain records that summarize the analysis results for complete event. Each record can span many video frames and has a start time, peak time, end time, duration, and an ID. You can use the ID to find other records that are related to the same event. The purpose of a result track is to provide a summary of the analysis results that is suitable to output from Media Server. Media Server does not generate a record in a result track until an event has finished, because these records represent an entire event from beginning to end.

    Example: A face detection result track contains a single record for each detected face. Each record has a different ID.

    Example: A face recognition result track contains zero or more records for each detected face (there can be multiple recognition results when there are several matches that exceed the recognition threshold). Face recognition results inherit their ID from the detected face, so all of the recognition results for the same detected face have the same ID.

  • ResultWithSource tracks are similar to result tracks because the records represent complete events. The records are the same as records in the result track, except that each record also includes the video frame that produced the best analysis result. For example, when you run face recognition the video frame with the highest confidence score is added to the record. This frame corresponds to the "peak" timestamp.
  • Data tracks contain records that correspond to a single analyzed frame. A data track can contain hundreds of records that relate to the same event. A data track can also contain multiple records that relate to the same video frame, because multiple events can occur at the same time.

    Example: A face detection data track contains at least one record for every analyzed frame in which a face appears. If a person remains in the scene for several seconds, this track could contain hundreds of records that identify the same face and have the same ID. If a video frame contains three faces, the face detection data track will contain three records with timestamps matching that frame, each with a different ID.

  • DataWithSource tracks are similar to data tracks because the records correspond to a single analyzed frame. The records are the same as the records in the data track, except that each record also includes the video frame that was analyzed.

TIP: Data and DataWithSource tracks contain a lot of information, usually more than you want to output from Media Server. These tracks are intended to provide data for subsequent analysis tasks. For example, you can use the DataWithSource track from face detection as the input for face recognition, so that face recognition can analyze each face across multiple video frames.

  • Start and End tracks contain records that describe the beginning or end of an event in the video.

    Example: With face detection the start track contains a record when a face appears in the scene, and the end track contains a record when the face disappears.

    Example: Face recognition does not produce a start or end track, because information about events (detected faces) is provided by face detection.

  • SegmentedResult tracks are similar to result tracks, except that the maximum duration of a record is limited by a configuration parameter named SegmentDuration. When a record reaches the maximum duration, Media Server outputs the record and begins a new one with the same ID. This means that for every record in the result track that exceeds the maximum duration, there will be two or more records in the SegmentedResult track. Segmented results are useful when you need to obtain information about an event before it finishes.
  • SegmentedResultWithSource tracks are similar to SegmentedResult tracks. The records are the same, except that each record also includes the best source frame that was available at the time the record was generated.

The following diagram shows how face detection creates records (represented by rectangles) when a face appears in a video.

The following diagram shows how face detection creates records (represented by rectangles) when two faces appear in a video. All of the records related to the same detected face (the same event) have the same ID. So, in the following example, all of the blue records (1) would have the same ID and all of the green records (2) would have the same ID.

In both of the previous examples:

  • Media Server creates a single record in the Result and ResultWithSource tracks for each event (in this example a detected face). These records span the event and summarize the analysis results. When there are multiple people in the scene at the same time, the records overlap chronologically.
  • The records in the Data and DataWithSource tracks correspond to a single analyzed frame. This means that there can be many records for each event. When there are multiple people in the scene, there are multiple records with timestamps matching the same video frame.
  • Media Server creates a record in the Start track when a person appears in the scene.
  • Media Server creates a record in the End track when a person leaves the scene.
  • In these examples, each person remains in the scene longer than the configured SegmentDuration, so Media Server creates multiple records in the SegmentedResult and SegmentedResultWithSource tracks. Media Server starts a new record when the SegmentDuration is reached.

Some analysis tasks process the output of other engines. Face recognition, for example, processes records that are produced by face detection. You can see from the examples, above, that the face detection DataWithSource track provides much more information than the ResultWithSource track. When you configure face recognition, you can choose which track to process. Processing the DataWithSource track can result in better accuracy, because face recognition processes multiple video frames for each detected face. However, processing all of these frames is more computationally intensive and you should configure this only if your server has sufficient resources.

For information about the tracks that are produced by Media Server tasks, and the information contained in each track, refer to the Media Server Reference.