Wednesday, December 08, 2010

Converting Youtube's annotation into SRT subtitle

It has been a long time since my last blog. Well, I'm a lazy guy, and English is apparently not my native language. Besides, there were lots of things that weren't exciting enough for me to write a long article on the blog, so I usually write short comments on the my Buzz instead.

Any way, let's cut to the chase.

These days, more and more people like to use annotation to add "subtitles" onto Youtube videos rather than to use caption. There already are lots of on-line/off-line "Youtube downloaders" that can download either videos, the corresponding captions, or both of them at once, such as get_flash_videos, clive, youtube-dl, Google2SRT, and Youtube Subtitle Ripper, etc. However, there is not much information available about how to download the annotations and convert them into SRT subtitles. Today, I found the solution.


First of all, I found this comment on the blog post about how to download the annotations in XML format. And yes, I do write a script to download the caption and annotation using wget, but it is a simple script that is not worth to mention. After downloading the annotation in XML, next step would be converting it into some subtitle format.

Although there are many subtitle formats available, and the converting algorithm is possibly existing in the Google2SRT source code, I decide to write my own bash script that converts the XML into the SRT format, which is one of the simplest subtitle format.

The script I wrote, called ann2srt, uses the XMLStarlet as the XML parsing tool. Other than that, the script only uses the bash built-ins and coreutils like cut and tr. For now, the generated SRT could have some compatibility problems with some players. This is because the annotations in the XML are not in chronicle order. Adding the sorting is possible, but since mplayer can handle the out-of-order subs correctly, I'll leave it this way for now. Here is the code of ann2srt:


#!/bin/bash
#
# Convert the youtube annotation into SRT subtitle
#
# By Shang-Feng Yang <storm_dot_sfyang_at_gmail_dot_com>
# Version: 0.1
# License: GPL v3

function usage() {
echo -e "Usage:\n"
echo -e "\t$(basename $0) ANNOTATION_FILE\n"
}

function parseXML() {
cat ${ANN} |xmlstarlet sel -t -m 'document/annotations/annotation' -v 'TEXT' -o ',' -m 'segment/movingRegion/rectRegion' -v '@t' -o ',' -b -n
}

function reformatTime() {
H=$(echo $1 |cut -d ':' -f 1)
M=$(echo $1 |cut -d ':' -f 2)
S=$(echo $1 |cut -d ':' -f 3)
printf '%02d:%02d:%02.3f' ${H} ${M} ${S} |tr '.' ','
}

ANN=$1
SRT=$(basename ${ANN} .xml).srt
IFS=$'\n'
I=0

[ -f ${ANN} ] || { usage; exit 1; }
[ -f ${SRT} ] && rm ${SRT}

for LINE in $(parseXML); do
(( I++ ))
C=$(echo ${LINE} |cut -d ',' -f 1)
B=$(echo ${LINE} |cut -d ',' -f 2)
E=$(echo ${LINE} |cut -d ',' -f 3)
echo -e "${I}\n$(reformatTime ${B}) --> $(reformatTime ${E})\n${C}\n" >> ${SRT}
done


A sidenote for mplayer users: When playing videos with subs generated by this script, remember to turn on the SSA/ASS support by using the "-ass" option. Due to the nature of the annotations, it is possible that several annotations occupy the same time period, and the built-in SRT parser of mplayer will only show one of them, while they will be stacked when -ass is enabled.

SRT is a quite simple format that did not support any special effect, of which the annotations possess such as position and color of the annotations. The next version of the script will be one that converts the annotations into SSA/ASS format -- only if I have the motive to improve it...

7 comments:

Anonymous said...

Thanks! This had me confused before I realized that captions and annotations were different things.

Ivan Pozdeev said...
This comment has been removed by the author.
Ivan Pozdeev said...
This comment has been removed by the author.
Strubbl said...

Thanks for your initial work. It helped me a lot. So I adopted the script to make it work for my purpose. Perhaps somebody else needs those changes, too. You can find it on github with Strubbl/youtubeannotations2srt

direct link:
https://github.com/Strubbl/youtubeannotations2srt

Shang-Feng Yang said...

@Strubbi,

Thanks and happy hacking!

Jon Bayless said...

Some google code changed. Here is a web page with a nice way to extract annotations!

http://stefansundin.com/stuff/youtube/youtube-copy-annotations.html

Shang-Feng Yang said...

@Jon Bayless,

Although I'd known that YouTube had changed their annotation URL for a while, thanks for your information and happy new year!