Subtitles, Captions, Timed-text and Transcripts in CMML
Everyone asks for these, and we need to do them properly. This involves examining existing standards and common practice, and ensuring that CMML is capable of capturing the same information.
A principle that CMML will follow with respect to subtitling is to keep the structural information separate from the formatting information. This is good Web practice. CMML will therefore make use of style sheets and style tags.
Chris Chiu researched into popular subtitle formats, and developed first basic Python scripts for conversion of these formats into CMML. Here is his SubtitleReport.
See also http://wiki.whatwg.org/wiki/Video_accessibility
Overview of subtitling formats
There are two different types of subtitling formats that we need to cover:
- Subtitles is a generic term referring to text overlaid on-screen, possibly rendered right into the imagery.
- Timed-text is a generic term referring to text that is intended to be rendered with timing information, there may be no image display involved.
The two most common uses of subtitles in commercial TV and film are the provision of language translation and captioning for the deaf. These are subtly different -- for example, captioning for the deaf will usually include notation of sound effects and musical cues, which are not required for language translation.
Transcripts provide a full-text copy of spoken content, and may not necessarily be intended for simultaneous display on-screen -- for example, a transcript may contain more text than can be reasonably read in real-time, and may contain um's and ah's that are usually edited out of captioned summaries. Transcripts are most useful for metadata search and query, or for referencing back to an original script. Hence they are extremely important for CMML, even though they may not commonly be user-visible in existing TV and film content.
It is important to differentiate text spoken by different people, both in terms of semantic separation and rendering (colours, justification etc.). As an example, a karaoke system may require phrase-accurate timing of both the initial display of each word, and the cue time of it's colouration; and may require strict colouring of separate phrases (eg. pink and blue text for a duet).
In the old TV terminology, subtitles can be one of the following two types:
- Open, which means burnt-into the video image and cannot be disabled.
- Closed, which means sent out-of-band, eg. in the vertical blank.
Here we are interested to capture the actual text. This is easily possible for closed captions, but also for open captions if image analysis and OCR (optical character recognition) are used.
Captioning Standards
- The Silent Soundtrack is an excellent article at XML.com discussing captions in SMIL, as well as the proprietary XML formats Hi-Caption and MAGpie.
- The European Broadbasting Union defines a crusty old binary format for interchange of subtitling information. It has a big general metadata header for describing the framerate and language, followed by a series of text blocks with in- and out- timecodes and very basic formatting information.
- The Digital Video Broadcasting subtitle standard supports both picture based subpictures as well as plain text. The formal specifications can be found here (EN-300-743)
Timed Text Standards
- The 3rd Generation Partnership Project (3GPP) timed text standard specifies a time-lined decorated text media format with defined storage in a *.3GP file. The specification scope of 3GPP was designed for the 3G mobile phone system, in which it specifies an RTP payload format for the transmission of 3GPP timed text, and additional synchronisation features with audio/video contents to be used in captioning, titling and multimedia applications. Specifications are as follows: 3GPP Timed Text I-D
Anime Subtitle Standards
There are a lot of anime subtitle formats around for video players, with the popular formats being suppored by open-source players including MPlayer, Media Player Classic and VideoLAN. As an example, MPlayer supports OGM, AQTitle, JACOsub, MicroDVD, MPsub, PJS, RT, SubRip, SSA, SubViewer, SAMI and VPlayer; while VideoLAN supports MicroDVD, SubRip, SSA, SubViewer, SAMI and VOBSub.
- Note that OGM (neé Ogg, neé Xiph) don't have an official subtitle format for use with video, so even though SSA is quite common with the fansubbers, it's easily possible to download an OGM video file that has some weird subtitle format. See discussion between Arc and ChristianHJW (a "Matroska guy") for some insight into their attitudes toward this.
- See the Scriptclub FAQ to get an idea of the scope of tools available which have to deal with the multitude of subtitle formats. Of these, Subtitle Workshop (Official Site) has the most comprehensive support of subtitle formats, including older formats fallen out of favour in the anime community. A list of other subtitling tools can be found at DivX Digest, with a comprehensive site of the transcoding process from DVD into DivX/XVid here.
- For general purpose time re-syncronisation, Time Adjuster can assist to correct timing issues, and supports the subtitle formats SubRip, JACOSub, SubViewer, SubStation Alpha, and MicroDVD. However, this program does not support the Unicode character standard. Additional subtitle resources can be found at DivXMovies and DivX Digest.
- Transcoding in Annodex is a beginners guide to encoding Fansubs into Annodex. While transcoding is straightforward, the conversion and parsing of subtitles into CMML prior to annodexing continues to be a work in progress.
Syntactical Differentiation between Subtitle formats
Subtitle Overview of the syntactical differences of the subtitles and how they fit in the scheme of parsing into CMML. It is a summary of the variation in subtitling structures, and the potential compatibility issues arising from differences in syntax when parsing.
List of Subtitle Formats and Details
- This section will brief the most common formats popular for subtitling amonst anime fans. Older subtitles tend to have timings based on frames, while the newer subtitles are based on the times of the video:
- Of these formats, SubRip and SubViewer are the most prevalent, if only because of its simplicity over other formats.
- XML based subtitle scripts include RealText, SAMI and USF. This is being implemented using the XML Parser Libraries for Python.
- Note: For any comments or issues when using the subtitle scripts, place a message in the relevant wikipage below:
- AQTitle Format (*.aqt)
- JACOSub Format (*.jss)
- MicroDVD Format (*.sub)
- MPSub Format (*.sub)
- Phoenix Japanimation Society (*.pjs)
- RealText Format (*.rt)
- SubRip Format (*.srt)
- SubStation Alpha Format (*.ssa)
- SubViewer Format (*.sub)
- Synchronised Accessible Media Interchange (*.smi)
- Universal Subtitle Format (*.usf)
- VOBSub Format (*.sub, *.idx)
- VPlayer Format (*.txt)
- Yet Another Subtitle Format: While there are many formats around, if it's popular enough it may be considered.
Note:
- Visit Frame Conversion Concepts for the method used to convert frame rates into timings. Implementation based upon the python code in the wikipage can be found in the aforementioned subtitle format links.
Common Features
To get a handle on how to support these things natively in CMML, we have to extract what they all have in common. Let's start by subdividing the features into presentation and semantic features.
Semantic Features:
- text
- text type (subtitle, caption, transcript, karaoke)
- speaker
Presentation Features:
- timing
- fonts
- colours
- justification
- scrolling text (stock market ticker)
- dynamic colouring (karaoke)
Handling in CMML
Text can be introduced into CMML clips in the "title" attribute or the "desc" element - or a new attribute/element could be defined. Since an attribute cannot be formatted, the "title" attribute is not a viable alternative. Let's discuss the alternatives of using the "desc" attribute and introducing a new element.
Alternative 1
Different types of text should be handled in different CMML tracks. Thus, the text type should be specified in the CMML track attribute. Currently the only standard name for a CMML track is "default". We should also introduce "subtitle", "caption", "transcript", and "karaoke".
To introduce attributes such as colours, justification, fonts etc in CSS, a further subdivision of the desc tag is necessary. Helpful subdivisions that should be available inside the desc tag are:
- span
- div
Consequence:
- When a speaker gets specified, that would happen through the name of a class and then the attributes can be specified in the stylesheet.
- However, turns the semantic information of "which speaker" into the name of a CSS class, making it a presentation feature. If we want to handle the speaker as meta information, we would probably prefer to keep details about the speaker in meta tags. How can we then link up the meta tags with the style class specifications?
- Another problem is to have images that e.g. contain the subtitles (or logos etc.) rendered on top of the video/playback window.
- We recommend subtitles to be added to CMML as text as for new content that makes sense as it can be searched. If this is not possible and there is a need to ship the binary image subtitles, this should be added as logical bitstream within an Annodex bitstream (e.g. a VOB-SUBS track for DVDs).
Alternative 2
We could also add a "caption" element to the clip tag. That element could have a "type" tag that allows for the distinction of "subtitle", "caption", "transcript", and "karaoke". The "caption" element would also contain a "span" or "div" element to allow formatting.
The advantages are that:
- you can have a caption in parallel to a description element that allows for some comments on the caption itself.
- you can have several captions of different authors in the same language (distinguished by the track tag).
- distinguishes better between description and caption, since the desc element already has a location where it gets displayed in the AFE and the caption would need to be overlayed on the video.
Alternative 3
We could also completely avoid dealing with CMML for subtitles. If we use the most generic subtitling format (e.g. ssa), we can represent all other subtitling formats in this format and just define a mapping & granulepos for this new caption track.