Subtitles, Captions, Timed-text and Transcripts in CMML

Everyone asks for these, and we need to do them properly. This involves examining existing standards and common practice, and ensuring that CMML is capable of capturing the same information.

A principle that CMML will follow with respect to subtitling is to keep the structural information separate from the formatting information. This is good Web practice. CMML will therefore make use of style sheets and style tags.

Chris Chiu researched into popular subtitle formats, and developed first basic Python scripts for conversion of these formats into CMML. Here is his SubtitleReport.

See also http://wiki.whatwg.org/wiki/Video_accessibility

Overview of subtitling formats

There are two different types of subtitling formats that we need to cover:

  • Subtitles is a generic term referring to text overlaid on-screen, possibly rendered right into the imagery.
  • Timed-text is a generic term referring to text that is intended to be rendered with timing information, there may be no image display involved.

The two most common uses of subtitles in commercial TV and film are the provision of language translation and captioning for the deaf. These are subtly different -- for example, captioning for the deaf will usually include notation of sound effects and musical cues, which are not required for language translation.

Transcripts provide a full-text copy of spoken content, and may not necessarily be intended for simultaneous display on-screen -- for example, a transcript may contain more text than can be reasonably read in real-time, and may contain um's and ah's that are usually edited out of captioned summaries. Transcripts are most useful for metadata search and query, or for referencing back to an original script. Hence they are extremely important for CMML, even though they may not commonly be user-visible in existing TV and film content.

It is important to differentiate text spoken by different people, both in terms of semantic separation and rendering (colours, justification etc.). As an example, a karaoke system may require phrase-accurate timing of both the initial display of each word, and the cue time of it's colouration; and may require strict colouring of separate phrases (eg. pink and blue text for a duet).

In the old TV terminology, subtitles can be one of the following two types:

  • Open, which means burnt-into the video image and cannot be disabled.
  • Closed, which means sent out-of-band, eg. in the vertical blank.

Here we are interested to capture the actual text. This is easily possible for closed captions, but also for open captions if image analysis and OCR (optical character recognition) are used.

Captioning Standards

  • The European Broadbasting Union defines a crusty old binary format for interchange of subtitling information. It has a big general metadata header for describing the framerate and language, followed by a series of text blocks with in- and out- timecodes and very basic formatting information.

Timed Text Standards

  • The 3rd Generation Partnership Project (3GPP) timed text standard specifies a time-lined decorated text media format with defined storage in a *.3GP file. The specification scope of 3GPP was designed for the 3G mobile phone system, in which it specifies an RTP payload format for the transmission of 3GPP timed text, and additional synchronisation features with audio/video contents to be used in captioning, titling and multimedia applications. Specifications are as follows: 3GPP Timed Text I-D

Anime Subtitle Standards

There are a lot of anime subtitle formats around for video players, with the popular formats being suppored by open-source players including MPlayer, Media Player Classic and VideoLAN. As an example, MPlayer supports OGM, AQTitle, JACOsub, MicroDVD, MPsub, PJS, RT, SubRip, SSA, SubViewer, SAMI and VPlayer; while VideoLAN supports MicroDVD, SubRip, SSA, SubViewer, SAMI and VOBSub.

  • Note that OGM (neé Ogg, neé Xiph) don't have an official subtitle format for use with video, so even though SSA is quite common with the fansubbers, it's easily possible to download an OGM video file that has some weird subtitle format. See discussion between Arc and ChristianHJW (a "Matroska guy") for some insight into their attitudes toward this.
  • See the Scriptclub FAQ to get an idea of the scope of tools available which have to deal with the multitude of subtitle formats. Of these, Subtitle Workshop (Official Site) has the most comprehensive support of subtitle formats, including older formats fallen out of favour in the anime community. A list of other subtitling tools can be found at DivX Digest, with a comprehensive site of the transcoding process from DVD into DivX/XVid here.
  • Transcoding in Annodex is a beginners guide to encoding Fansubs into Annodex. While transcoding is straightforward, the conversion and parsing of subtitles into CMML prior to annodexing continues to be a work in progress.

Syntactical Differentiation between Subtitle formats

Subtitle Overview of the syntactical differences of the subtitles and how they fit in the scheme of parsing into CMML. It is a summary of the variation in subtitling structures, and the potential compatibility issues arising from differences in syntax when parsing.

List of Subtitle Formats and Details

  • This section will brief the most common formats popular for subtitling amonst anime fans. Older subtitles tend to have timings based on frames, while the newer subtitles are based on the times of the video:
    • Of these formats, SubRip and SubViewer are the most prevalent, if only because of its simplicity over other formats.
    • XML based subtitle scripts include RealText, SAMI and USF. This is being implemented using the XML Parser Libraries for Python.
    • Note: For any comments or issues when using the subtitle scripts, place a message in the relevant wikipage below:

Note:

  • Visit Frame Conversion Concepts for the method used to convert frame rates into timings. Implementation based upon the python code in the wikipage can be found in the aforementioned subtitle format links.

Common Features

To get a handle on how to support these things natively in CMML, we have to extract what they all have in common. Let's start by subdividing the features into presentation and semantic features.

Semantic Features:

  • text
  • text type (subtitle, caption, transcript, karaoke)
  • speaker

Presentation Features:

  • timing
  • fonts
  • colours
  • justification
  • scrolling text (stock market ticker)
  • dynamic colouring (karaoke)

Handling in CMML

Text can be introduced into CMML clips in the "title" attribute or the "desc" element - or a new attribute/element could be defined. Since an attribute cannot be formatted, the "title" attribute is not a viable alternative. Let's discuss the alternatives of using the "desc" attribute and introducing a new element.

Alternative 1

Different types of text should be handled in different CMML tracks. Thus, the text type should be specified in the CMML track attribute. Currently the only standard name for a CMML track is "default". We should also introduce "subtitle", "caption", "transcript", and "karaoke".

To introduce attributes such as colours, justification, fonts etc in CSS, a further subdivision of the desc tag is necessary. Helpful subdivisions that should be available inside the desc tag are:

  • span
  • div

Consequence:

  • When a speaker gets specified, that would happen through the name of a class and then the attributes can be specified in the stylesheet.
    • However, turns the semantic information of "which speaker" into the name of a CSS class, making it a presentation feature. If we want to handle the speaker as meta information, we would probably prefer to keep details about the speaker in meta tags. How can we then link up the meta tags with the style class specifications?
  • Another problem is to have images that e.g. contain the subtitles (or logos etc.) rendered on top of the video/playback window.
    • We recommend subtitles to be added to CMML as text as for new content that makes sense as it can be searched. If this is not possible and there is a need to ship the binary image subtitles, this should be added as logical bitstream within an Annodex bitstream (e.g. a VOB-SUBS track for DVDs).

Alternative 2

We could also add a "caption" element to the clip tag. That element could have a "type" tag that allows for the distinction of "subtitle", "caption", "transcript", and "karaoke". The "caption" element would also contain a "span" or "div" element to allow formatting.

The advantages are that:

  • you can have a caption in parallel to a description element that allows for some comments on the caption itself.
  • you can have several captions of different authors in the same language (distinguished by the track tag).
  • distinguishes better between description and caption, since the desc element already has a location where it gets displayed in the AFE and the caption would need to be overlayed on the video.

Alternative 3

We could also completely avoid dealing with CMML for subtitles. If we use the most generic subtitling format (e.g. ssa), we can represent all other subtitling formats in this format and just define a mapping & granulepos for this new caption track.