Exposing In-band Media Container Tracks in HTML5

Unofficial Draft 11 January 2012

Editor:
Bob Lund

Abstract

This specification proposes a mechanism for a user agent to expose the in-band tracks in MPEG-2 TS, Ogg, WebM and MPEG-4 File Format media containers so that script has access to the metadata in the media resource and can determine the role and type of the in-band tracks using this metadata. This mechanism is a generalization of the one described in [MPEG2HTML5].

Status of This Document

This document is merely a public working draft of a potential specification. It has no official standing of any kind and does not represent the support or consensus of any standards organisation.

Table of Contents

1. Introduction and Purpose

HTML5 UAs [HTML5] may playback media resources that contain a multiplex of in-band video, audio and text tracks. A consistent HTML presentation of in-band tracks across media resource container formats by UAs is desirable in order for script to understand the specific type of service and recognize the type and role of the track data, independent of the media resource provider and container.

Some examples of how media resource metadata and tracks in media resources are used in media applications are:

Closed Captioning Textual representation of the media resource audio dialogue intended for the hearing impaired.
Subtitles Alternate language textual representation of the media resource audio dialogue.
Content Advisories Content rating information used by parental control applications.
Synchronized Content Signaling messages to control the execution of a client application in a manner synchronized with the media resource playback.
Client ad insertion Signaling messages that convey advertisement insertion opportunities to a client application.
Audio translations Alternate language representation of the primary audio track.
Audio descriptions Audio descriptions of the video intended for the visually impaired.

This specification proposes a consistent representation of metadata and tracks in MPEG-2 TS [MPEG2TS], Ogg [OGG], WebM [WEBM] and MPEG-4 File Format [MPEG4FF] media formats in the equivalent HTML5 video, audio and text track elements. Preliminary investigation suggests that the approach will also work for MPEG DASH [DASH] media profile descriptions (as well as other adaptive bit rate manifest file formats).

The elements of a common media resource representation of in-band tracks are:

  1. Definition of a TextTrack used to make the media resource metadata that describes the tracks available to script.
  2. Definition of how media resource video, audio and text tracks should be created so that the media resource track metadata can be associated with the tracks.

2. Track Description TextTrack

A TextTrack is used to make the media resource track metadata available to script. This TextTrack is called the track description TextTrack. The following requirements apply in creating the track description TextTrack.

The UA must create a TextTrack in the media resource TextTrackList to make the metadata available to a script and set the TextTrack attributes using the following rules:

  1. kind = "metadata"
  2. label = "MIME track-description", where MIME is type of the media resource, e.g. video/mp2t, video/mp4, video/webm, video/ogg or video/vnd.mpeg.dash.mpd
  3. language = ""(empty string)
  4. mode="HIDDEN"

The UA must create a TextTrackCue for each unique instance of the metadata that describes the media resource tracks. The MPEG-2 TS PMT containing track metadata is sent several times a second but rarely changes; this is the reason for the "unique instance" portion of the requirements.

The UA must set the TextTrackCue attributes as follows:

  1. startTime is set to the current time in the media resource timeline.
  2. endTime = Infinity.
  3. pauseOnExit=false.

The text attribute of the TextTrackCue must be set to a JSON [JSON] string representation of the track metadata. The format of the JSON string depends on the MIME type of the media resource:

MPEG-2 TS (video/mp2t)

JSON object is: '{"m2pt_track_description":[program_info_entry, ...]}'.

Each program_info_entry is {"stream_type":string representation of the PMT stream_type field, "pid":string representation of the PMT elementary_PID field, "es_descriptors":[es_descriptor, ...]}.

Each es_descriptor is {"tag":string representation of the PMT descriptor tag, "desc_contents":Base64 encoded representation of the PMT elementary stream descriptor}

The following is an example of the TextTrackCue text attribute for an MPEG-2 TS program with 1 video, 1 English language audio and 1 data elementary stream containing subtitle data.

'{"mp2t_track_description": [
	{"stream_type":"0x01",   //video stream type
	 "pid":"141",            //video elementary stream PID
	 "es_descriptor" : []    //empty array of elementary stream descriptors
	},
	{"stream_type":"0x03",   //audio stream type
	 "pid":"150",            //audio elementary stream PID
	 "es_descriptor":[       //one ISO_639-2 language descriptor
		{"tag":"0x0a",        //tag identifying ISO_639-2 descriptor
		 "contents":"Base64 encoding of 0x656e670"}]  //descriptor is "eng", 0x0
	},
	{stream_type":"0x82",    //subtitle data stream type
	 "pid":"190",            //subtitle elementary stream PID
	 "es_descriptor":[]      //empty array of elementary stream descriptors
	}
]}'
WebM (video/webm)

JSON object is: '{"webm_track_description":[track_entry, ...]}'.

Each track_entry is {"TrackEntry-ElementName":"Element-contents", ...} where "TrackEntry-ElementName" is one of the elements contained in the TrackEntry element and "Element-contents" is the string representation of that element's contents.

The following is an example of the TextTrackCue text attribute for an WebM program with 1 video, 1 English language audio and 1 data elementary stream containing subtitle data.

'{"webm_track_description": [
   {"TrackNumber":"1", "TrackType":"0x1", "CodecID":"V_MPEG2", ...},	 //only 3 track entry elements shown
   {"TrackNumber":"2", "TrackType":"0x2", "CodecID":"A_MPEG/L2", ...},   //only 3 track entry elements shown
   {"TrackNumber":"3", "TrackType":"0x11", "CodecID":"S_TEXT/UTF8", ...} //only 3 track entry elements shown
]}'
Ogg(video/ogg)

JSON object is: '{"ogg_track_description":[ogg_fishbone, ...]}'.

Each ogg_fishbone is {"stream_number":"string representation of the stream serial number", MessageHeader, ...}.

Each MessageHeader contains the contents of one of the message headers contained in the Ogg fishbone and is of the form "message header name":"string representation of message header contents".

The following is an example of the TextTrackCue text attribute for an Ogg program with 1 video, 1 English language audio and 1 data elementary stream containing subtitle data.

'{"ogg_track_description": [
   {"stream_number":"1", "Content-type":"video/theora", "Role":"video/main", ...},        //only 2 message headers shown
   {"stream_number":"2", "Content-type":"audio/vorbis", "Role":"audio/main", ...},        //only 2 message headers shown
   {"stream_number":"3", "Content-type":"application/kate", "Role":"text/subtitle", ...}, //only 2 message headers shown
]}'
MPEG-4 File Format(video/mp4)

JSON object is: '{"mp4_track_description":[mp4_track_metadata, ...]}'.

Each mp4_track_metadata is {"trak-tkhd-track_id":"string representation of the track track_id","trak-mdia-hdlr-hdlr_type":"string representation of the track hdlr_type", "trak":"Base64 representation of the trak box (and all child boxes)"}.

The following is an example of the TextTrackCue text attribute for an mp4 program with 1 video, 1 English language audio and 1 data elementary stream containing subtitle data.

'{"mp4_track_description": [
   {"trak-tkhd-track_id":"1", "trak-mdia-hdlr-hdlr_type":"vide", "trak":"Base64 representation of trak"},
   {"trak-tkhd-track_id":"2", "trak-mdia-hdlr-hdlr_type":"vide", "trak":"Base64 representation of trak"},
   {"trak-tkhd-track_id":"3", "trak-mdia-hdlr-hdlr_type":"meta", "trak":"Base64 representation of trak"}   //subtitle identified in "trak" property contents
]}'

3. Video, Audio and TextTrack Creation

Script needs to be able to correlate the track metadata received in the track description TextTrackCues with tracks in the VideoTrackList, AudioTrackList and TextTrackList attributes of the media resource. Two ways this could be done are:

  1. Ensure that the order of metadata in the track description TextTrackCues is the same as the order of tracks in VideoTrackList, AudioTrackList and TextTrackList attributes. [HTML5] requires that the UA create tracks in the order that they appear in the media resource, if there is an order. If the UA uses the order of the track metadata to order the tracks, then it can ensure that a track's metadata will be in the same order as the track, for a specific track type, e.g. video. This is the case with MPEG-2 TS, WebM and MPEG-4 FF. This need not be the case with Ogg, however, since there is not a requirement that the ordering of logical bitstream headers and Fishbones containing metadata about those streams are in the same order.
  2. Require that the track metadata contain a track id unique to the media resourece type and that VideoTrack, AudioTrack and TextTrack have an attribute that contains its track id.

Since the second alternative is more precise and works with all media resource formats, it is the one proposed.

The UA must create a new VideoTrack and AudioTrack in the VideoTrackList and AudioTrackList, respectively, for each track in the media resource in the order it appears in the media resource, as defined in [HTML5].

The UA must create a new TextTrack in the TextTrackList for each track that is not recognized by the UA as a VideoTrack or TextTrack.

The track order for each media resource type is defined as follows:

video/mp2t
The order the track elementary stream data appears in the PMT.
video/webm
The order the TrackEntry elements appears in the Tracks element.
video/ogg
The order the bit stream headers appear in the Ogg container.
video/mp4
The order the trak box appears in the mp4 container.

An attribute containing the track identifier used in the media resource must be set for each VideoTrack, AudioTrack and TextTrack so that script can associate metadata provided in the track description TextTrackCues with that VideoTrack, AudioTrack and TextTrack. This could be a new attribute (as has been proposed in [BUG13359] or the label attribute could be used. Reuse of the label attribute is not preferred as some media formats, e.g. Ogg, have a track metadata element that is to be used for the value of the label attribute.

The attribute that contains the track identifier must be set as follows for each media container:

video/mp2t
"video/mp2t-pid" where pid is the PID in the PMT for that track.
video/webm
"video/mp2t-TrackNumer" where TrackNumber is the value of the TrackNumber element for that track.
video/ogg
"video/mp2t-stream_number" where stream_number is the stream_serial_number track.
video/mp4
"video/mp2t-track_id" where track_id is the contents of the track_id attribute in the tkhd box for that track.

3.1 Additional TextTrack Requirements

If the UA recognizes the kind of text track as one of the categories listed in [HTML5] it should:

  1. Set the TextTrack kind, language, label and mode attributes to the appropriate values for that text track.
  2. Create TextTrackCues as appropriate for the type of data in the text track.

If the UA does not determine the kind of the TextTrack it is presumed that the UA does not know the the purpose or type of the text track. In this case, the UA must:

  1. Set kind to "metadata ", language and label to the empty string and mode to "disabled"
  2. Create TextTrackCues as defined in [HTML5]with the following attributes set as specified:
    • startTime is set to the current time in the media resource timeline
    • endTime is set to the Infinity
    • pauseOnExit and snapToLines are set to false
    • direction and alignment are set to "" (empty string
    • linePosition and textPosition are set to 0
    • text is set depending on the type of the media reource:
      video/mp2t
      Base64 [BASE64] representation of the PES or private data packet in the program stream represented by the TextTrack as defined in [MPEG2TS].
      video/webm
      Base64 representation of the Track data represented by the TextTrack.
      video/ogg
      If the track contains text, as indicated by the Role message header, then the contents of the bitstream represented by the TextTrack. If the track does not contain text, then a Base64 representation of the represented by the TextTrack.
      video/mp4
      If the track contains text or XML, as indicated by a handler_type of 'meta', then the contents of the track represented by the TextTrack. Otherwise, a Base64 representation of the track data represented by the TextTrack.

A UA may be presented with previously processed in-band TextTracks, for example, when the viewer seeks back in the media resource, as controlled by the seekable time ranges attribute of the HTMLMediaElement. TextTrackCues are not removed from the TextTrack so the user agent must not create duplicate TextTrackCues in this case. How the user agent accomplishes this is implementation specific.

3.2 MPEG-2 TS Closed Captioning

MPEG-2 TS closed captioning is delivered as part of the video stream as opposed to a text track as is the case for the other media resource formats.

If the UA recognizes MPEG-2 TS closed captioning it must:

  1. Create a new TextTrack as defined in [HTML5] with the track element attributes set as follows:
    1. kind= "caption"
    2. language set to a BCP-47 [BCP47] conformant representation of the caption data language
    3. The attribute used in Video, Audio and TextTrack Creation is set to the unique track id of the MPEG-2 video program stream containing the caption data
  2. The UA must create a new TextTrackCue in the TexTrack as described in [HTML5] for each caption with attributes set as follows:
    1. startTime is set to the presentation time set in the caption data converted into the equivalent time in seconds relative to the media resource timeline.
    2. endTime is set to the end of the presentation if specified in the caption data. If the end of the presentation is not specified endTime is set to Infinity.
    3. the text of the cue is set to the caption data. It is UA implementation specific how the type of the caption data cue is determined by the UA when the cue text is rendered or when getCueTextAsHTML() is called.
    4. pauseOnExit is set to false

Acknowledgements

Thanks are expressed by the editor to the following individuals for their input to and feedback on this specification to date (in alphabetical order).

Mukta Kar, Giuseppe Pascale, Ed Shrum, George Sarosi, Clarke Stevens, Mark Vickers and Eric Winkelman.

A. References

A.1 Normative references

[BASE64]
The Base16, Base32, and Base64 Data Encodings
[BCP47]
Tags for Identifying Languages
[DASH]
ISO/IEC 23009-1 Dynamic adaptive streaming over HTTP (DASH) — Part 1: Media presentation description and segment formats
[HTML5]
Ian Hickson; David Hyatt. HTML5. 25 May 2011. W3C Working Draft. (Work in progress.) URL: http://www.w3.org/TR/html5
[MPEG2TS]
H.222.0 Infrastructure of audiovisual services - Transmission multiplexing and synchronization
[MPEG4FF]
Coding of audio-visual objects -- Part 12: ISO base media file format
[OGG]
Ogg Documentation
[WEBM]
WebM Container Guidelines

A.2 Informative references

[BUG13359]
HTML5 spec bug 13359
[JSON]
JSON in JavaScript
[MPEG2HTML5]
Mapping from MPEG-2 Transport to HTML5