Sunday, October 30, 2011

ann2srt v0.3

Although all the bug fixing, testing, and cleaning up have been done several days ago, I was a little too lazy to write... Anyway, here is the "official release notice" of ann2srt version 0.3.

Thanks to the commenter L who helped me on testing and debugging the script on Cygwin, version 0.3 of ann2srt now can handle the annotations other than Traditional Chinese language that have newlines and commas in them, and also can run correctly under Cygwin environment on Win32 platform.


Due to the fact that version 0.2 script uses CSV (Comma-Separated Values) as an intermediate format, the version 0.2 script will fail if the annotation has newline or comma in it. To fix this, in version 0.3, tr is used to eliminate newlines in the annotation. To address the "comma" problem, the delimiter for the intermediate stream is changed from comma to "|".

The version 0.2 script, technically speaking, should be able to run correctly without any modification under Cygwin environment. However, since Windows uses "DOS style" newline characters that consists CR+LF, if any of the external programs used in the script were Win32 binary, or if the input annotation file was in DOS format, the execution of the script becomes unpredictable. To fix this, tr is used again to convert the annotation and the output of the Win32 XMLStarlet from DOS format into UNIX format.

Let's cut to the chase. Here is the source of the version 0.3 script:

#!/bin/bash
#
# Convert the youtube annotation into SRT subtitle
#
# By Shang-Feng Yang
# Version: 0.3
# License: GPL v3
#
# Changelog:
# * v0.3 (Oct/19/2011):
# - Fix the parsing errors caused by comma and newline characters in
# some English annotations
# - Adding transparent dos2unix conversion for compatibility under Cygwin
# * v0.2 (Jan/19/2011):
# - Sort the annotations using the "begin" time as key
# - Minor bugs fixing
# * v0.1 (Dec/7/2010):
# - Initial release


ANN=$1
SRT=$(basename ${ANN} .xml).srt
IFS=$'\n'
I=0

function usage() {
echo -e "Usage:\n"
echo -e "\t$(basename $0) ANNOTATION_FILE\n"
}

function parseXML() {
cat ${ANN} | tr -d '\r' |tr '\n' ' ' | xmlstarlet sel -t -m 'document/annotations/annotation' -v 'TEXT' -o '|' -m 'segment/movingRegion/rectRegion' -v '@t' -o '|' -b -n | tr -d '\r'
}

function reformatTime() {
local H=$(echo $1 |cut -d ':' -f 1)
local M=$(echo $1 |cut -d ':' -f 2)
local S=$(echo $1 |cut -d ':' -f 3)
printf '%02d:%02d:%06.3f' ${H} ${M} ${S} |tr '.' ','
}

function time2sod() {
# Convert time in HH:MM:SS.SSS format into second-of-the-day value
local SOD=$(echo $1 | awk -F ":" '{printf("%f\n", $1*3600+$2*60+$3);}')

echo ${SOD}
}

[ "x${ANN}" = "x" ] && { usage; exit 1; }
[ -f ${ANN} ] || { usage; exit 1; }
[ -f ${SRT} ] && rm ${SRT}
[ -f ${SRT}.tmp ] && rm ${SRT}.tmp

for LINE in $(parseXML); do
C=$(echo ${LINE} |cut -d '|' -f 1)
B=$(echo ${LINE} |cut -d '|' -f 2)
E=$(echo ${LINE} |cut -d '|' -f 3)
echo "$(time2sod ${B})#${B}#${E}#${C}" >> ${SRT}.tmp
done

grep "###" ${SRT}.tmp && {
echo "\"${ANN}\" has no valid annotation!" >&2
rm ${SRT}.tmp
exit 1
}

for LINE in $(cat ${SRT}.tmp|sort -n -t '#'); do
(( I++ ))
C=$(echo ${LINE} |cut -d '#' -f 4)
B=$(reformatTime $(echo ${LINE} |cut -d '#' -f 2))
E=$(reformatTime $(echo ${LINE} |cut -d '#' -f 3))
echo -e "${I}\n${B} --> ${E}\n${C}\n" >> ${SRT}
done

rm ${SRT}.tmp


The version 0.3 script can also be downloaded from here to avoid typos caused by copy-and-paste:
http://dl.dropbox.com/u/1382119/tmp/ann2srt

In fact, I just found that the customized "code block" loses all indentations after the blogger updates. Please download the correct script from the link above.

8 comments:

Roel said...

The script works perfectly! I've never done anything like this before but I figured it out :)

Thank you so much!

Shang-Feng Yang said...

It's good to hear that. Have fun!

Lucas Malor said...
This comment has been removed by the author.
Lucas Malor said...

I had some problem with this video annotations:

http://www.youtube.com/watch?v=hQfD-4i3yh4

because it has also annotations with xml structure
segment/movingRegion/anchoredRegion

My (bad) workaround was simply to comment out

rm ${SRT}.tmp
exit 1

at lines 63 and 64.


Thank you for your script.

Shang-Feng Yang said...

@Lucas Malor:

Thanks for your reporting. This is a known bug to me. I also encountered the "anchored" type of annotation about a month ago, but I haven't release a fix because I am still not quite sure about how to handle the "anchored" type of annotations. I made a workaround similar to yours by commenting out that checking segment of code, too. But the side effect was that some of the subtitles in the final SRT file will have wrong timing. I'll fix it when I learn more about that type of annotation. Thanks and happy hacking!

Shang-Feng Yang said...

For everyone who encountered the problem caused by the annotations in the "anchored" type of movingRegion, I update the script to version 0.3.1 to circumvent the breaking down of the script when there are anchored annotitions. It is basically in principle the same workaround mentioned by the commenter Lucas Malor, but I modified the checking segment to filter out the "invalid" annotations to also avoid the wrong timing problem. This is not a fix but a minor workaround, so I don't think it worths a new release notice post. Anyway, please download the updated script at the same location. Thanks for commenting, and have fun!

Ummu Ch. said...

where do i have to put this script?
sorry i really dont understand how to use this!!

Shang-Feng Yang said...

@Ummu Ch.:

Since I have no idea how familiar you are to shell scripting and command line operation, I can only tell you that, the script should be placed either in the current working directory or some directory that is in your PATH.

In order to use this script, you must download the annotation file first. Assuming that you put the script into the current working directory, and also download the annotation as "annotation.xml", then the command would be:

./ann2srt annotation.xml

If everything goes right, the converted SRT subtitle "annotation.srt" should be presented in the current directory.