Thursday, January 20, 2011

ann2srt v0.2

Last time in my post "Converting Youtube's annotation into SRT subtitle, I released a bash script called "ann2srt" v0.1. Version 0.1 was a pretty crude one that did not deal with the sorting of the subtitles in SRT file, and could possibly be problematic for some SRT parser. Yesterday, I spent some time to improve the script with the sorting functionality, and also fixed some minor bugs in v0.1.


The sorting is achieved by using awk/gawk to convert the "beginning" time of the annotation into seconds and then passing the results into sort for sorting. Since sort is part of the GNU coreutils, and awk/gawk should be installed on most of the distributions, this change should not be a big deal for most people.

Here is the code for v0.2:

#!/bin/bash
#
# Convert the youtube annotation into SRT subtitle
#
# By Shang-Feng Yang <storm_DOT_sfyang_AT_gmail_DOT_com>
# Version: 0.2
# License: GPL v3
#
# Changelog:
# * v0.2 (Jan/19/2011):
# - Sort the annotations using the "begin" time as key
# - Minor bugs fixing

function usage() {
echo -e "Usage:\n"
echo -e "\t$(basename $0) ANNOTATION_FILE\n"
}

function parseXML() {
cat ${ANN} |xmlstarlet sel -t -m 'document/annotations/annotation' -v 'TEXT' -o ',' -m 'segment/movingRegion/rectRegion' -v '@t' -o ',' -b -n
}

function reformatTime() {
local H=$(echo $1 |cut -d ':' -f 1)
local M=$(echo $1 |cut -d ':' -f 2)
local S=$(echo $1 |cut -d ':' -f 3)
printf '%02d:%02d:%06.3f' ${H} ${M} ${S} |tr '.' ','
}

function time2sod() {
# Convert time in HH:MM:SS.SSS format into second-of-the-day value
local SOD=$(echo $1 | awk -F ":" '{printf("%f\n", $1*3600+$2*60+$3);}')

echo ${SOD}
}

ANN=$1
SRT=$(basename ${ANN} .xml).srt
IFS=$'\n'
I=0

[ "x${ANN}" = "x" ] && { usage; exit 1; }
[ -f ${ANN} ] || { usage; exit 1; }
[ -f ${SRT} ] && rm ${SRT}
[ -f ${SRT}.tmp ] && rm ${SRT}.tmp

for LINE in $(parseXML); do
C=$(echo ${LINE} |cut -d ',' -f 1)
B=$(echo ${LINE} |cut -d ',' -f 2)
E=$(echo ${LINE} |cut -d ',' -f 3)
echo "$(time2sod ${B})#${B}#${E}#${C}" >> ${SRT}.tmp
done

grep "###" ${SRT}.tmp && {
echo "\"${ANN}\" has no valid annotation!"
rm ${SRT}.tmp
exit 1
}

for LINE in $(cat ${SRT}.tmp|sort -n -t '#'); do
(( I++ ))
C=$(echo ${LINE} |cut -d '#' -f 4)
B=$(reformatTime $(echo ${LINE} |cut -d '#' -f 2))
E=$(reformatTime $(echo ${LINE} |cut -d '#' -f 3))
echo -e "${I}\n${B} --> ${E}\n${C}\n" >> ${SRT}
done

rm ${SRT}.tmp


The usage should be the same with v0.1.

18 comments:

Anonymous said...

Could you please tell me how to use this? I'm not sure what I should be doing. I have the annotations from some Youtube videos saved as XML files, which I did by using the following site and copying the video ID to the end: http://www.google.com/reviews/y/read2?video_id=

What do I do, exactly, to turn them into SRT files with this bash script? Will it work in Windows?

Thanks.

Shang-Feng Yang said...

Hello niffiwan,

Since you have done downloading the annotations in XML format, to convert it into SRT subtitle format, what you have to do is calling my script and passing the file name of the XML file you just downloaded as a parameter.

However, judging from your question, I guess that you are a Windows user and are not familiar with UNIX shell, so I will try to explain as clear and simple as possible. The "ann2srt" program is a Bash shell script. It is similar to the batch file for command.com or cmd.exe on DOS/Windows. To run it, you will need a Bash shell, and also the external utilities called in the script, which are, for v0.2, XML starlet, awk/gawk, basename, cut, sort, cat, rm, and grep. Most of those utilities are essential for UNIX systems. To convert an annotation in XML format called annotation.xml, what you have to do is executing the command "ann2srt annotation.xml" in command line, and a "annotation.srt" should be generated.

In order to run the script on Windows, you have to find the Windows portings of the necessary utilities. Most of them should be available on Cygwin environment or the GnuWin32 project. I recommend using Cygwin, for that there seems to be no Bash porting available on GnuWin32, so you have to find it somewhere else. XML starlet also has Windows porting, which is available on its project site.

It is totally possible to rewrite the ann2srt script into Perl or Python script or even command.com/cmd.exe batch file, but so far I have no plan to do that.

I hope my comments is helpful for you. Good luck!

Anonymous said...

Actually, concerning my previous (2nd) post: SubtitleEdit works well with some annotation files, but not with others - probably that's why it hasn't been publicly released yet. Maybe I'll get my Pandora out and try this script after all. :)

L said...

Hello,
I've tried to follow your instructions, but when I try to run the v0.1 script, the console prints:
ann2srt.sh: line 22: print f: [annotation text]: invalid number
for every annotation line in the xml file, and doesn't output an srt file.
When I run the v0.2 script, it prints:
0.000000###
"[filename].xml" has no valid annotation!

Am I doing something wrong? Or has YouTube changed its annotation xml format or something?
In any case, thank you for writing the script (:

Shang-Feng Yang said...

Hello L,

I'm not quite sure what the problem was. Would you please provide the YouTube video URL you used that generated the errors you mentioned? It would be easier for me to track the problem if I had the URL. Thanks!

L said...

I've had the same problem on more than one video, so I don't think it's a problem with a particular video. But here's a link anyway: http://www.youtube.com/watch?v=BAcMPit7YE0&feature=related

Shang-Feng Yang said...

Hello L,

After testing with the video you provided, I think I know what the problem was.

For that specific video, I found two reasons that broke the script:
1. The author put an unnecessary newline character at the end of some annotations.
2. There is comma (",") in some annotations.

The reason why unnecessary newline and comma break the script is that, I use CSV (comma-separated values) as an intermediate format, and newline and comma in the annotation causes the CSV parser, which is a simple "cut" command, to parse the wrong field.

To fix this, some modifications have to be made:
1. For the newline, use "tr" to delete newline characters in the XML file.
2. For the comma that causes the problem, change the delimiter from comma to some other character, say, "|". The choice of using "|" as delimiter could be a problem in some other special cases, but it should be acceptable for now.

If you want to fix the version 0.2 script yourself, try to modify the script as following:
* At line 20, change it from

cat ${ANN} |xmlstarlet sel -t -m 'document/annotations/annotation' -v 'TEXT' -o ',' -m 'segment/movingRegion/rectRegion' -v '@t' -o ',' -b -n

to

cat ${ANN} |tr -d '\n' | xmlstarlet sel -t -m 'document/annotations/annotation' -v 'TEXT' -o '|' -m 'segment/movingRegion/rectRegion' -v '@t' -o '|' -b -n

* Among lines 48 to 50, change the delimiter of "cut" from " -d ',' " to " -d '|' "

This should take care of the problem.

I probably will release a version 0.3 for this bug fix some time later.

Thanks for your bug reporting!

L said...

Thanks for your quick reply! I tried editing the bash script, but it still resulted in the same error. I tried modifying the annotation file (got rid of the newlines and changed , to |), but still got the same error. Perhaps there is still something else which breaks the script?

Shang-Feng Yang said...

Hello L,

Did you test the modified script on the annotation of the YouTube video you gave me? If so, then there could be some typo in your script. I made sure it works on that video before I posted my last reply, so it should be working, at least for that particular video's annotation.

To be sure that we have a common ground, please download my script from the following link and use it on that video first.

http://dl.dropbox.com/u/1382119/tmp/ann2srt

Please let me know if it is working or not. Thanks.

L said...

I've tried it again with that file and video, but it doesn't work. Perhaps it has something to do with cygwin? The only modification I made to the script was to replace "xmlstarlet" with the link to the application "/cygdrive/c/xmls/xml.exe". And my command in cygwin's prompt was "/cygdrive/c/xmls/ann2srt.sh /cygdrive/c/xmls/c.xml" Am I doing something wrong there? Thank you so much for all your help! (:

Shang-Feng Yang said...

Hello L,

It is kind of weird. The script, in theory, should work in Cygwin. However, since currently I don't have a Windows machine with Cygwin to do the test (well, technically I do, but it will require me to reboot my machine), it is hard to find the cause of the bug.

In order to monitor as closely as possible the execution of the script on your environment, I modified the script to keep all intermediate data. Please download the debugging script with following link:

http://dl.dropbox.com/u/1382119/tmp/ann2srt-dbg

After download it, please use it on the previous mentioned video. Let's say that the original annotation in XML format is called "c.xml". Please run the script with stderr redirected into a file:

ann2srt-dgb c.xml 2>debug.log

After run the script with c.xml, there should be several additional files generated in the current working directory: c.csv, c.srt, c.srt.tmp, and debug.log. Please give me all generated files plus the original annotation c.xml so that I can take a look of what was happenning during each step of conversion. Thanks.

L said...

Hi, here it is: http://www.mediafire.com/?uy0u9ve0s33ekve
c.srt was not created. It looks like the last for loop wasn't being executed at all, because of the previous "exit 1" command. When I removed the code:

grep "###" ${SRT}.tmp && {
echo "\"${ANN}\" has no valid annotation!"
#rm ${SRT}.tmp
exit 1
}

it produced a working srt file. So for some reason, that code is being executed on my computer but not on yours. Hope that helps! (:

Shang-Feng Yang said...

Hello L,

Although I got some idea, I think I still need some more information before I can be absolutely sure about the cause of the breaking of the script under your environment. Please tell me, what version is your XMLStarlet? Did you download the binary from the project page, or download the source and compile it under Cygwin?

Besides, please repeat the testing process with each of the following slightly modified scripts again:

http://dl.dropbox.com/u/1382119/tmp/ann2srt-dbg2

http://dl.dropbox.com/u/1382119/tmp/ann2srt-dbg3

http://dl.dropbox.com/u/1382119/tmp/ann2srt-dbg4

Please try to run the scripts without commenting out the 'grep "###" block first. In addition to the generated files, please also let me know which one, if any, works in your Cygwin environment.

L said...

It's version 1.3.0, downloaded binary.
The results are all here: http://www.mediafire.com/?6r3gv43h2jgfybk

Shang-Feng Yang said...

Hello L,

The "c.srt.tmp" for ann2srt-dbg4 is missing in the archive. But, it's ok, I think I know where the problem is.

However, just to be sure, would you please tell me how you downloaded your annotation XML? It is apparently that the way you download the c.xml for the first dbg script is different from that for the second dbg scripts. I was guessing that at least one of the last dbg scripts should be working for the last test, but all three were not working correctly because the last c.xml was "different". My guess is that, for the first dbg script, you download the XML using something from Cygwin like wget or curl, and for the second dbg scripts, something under Windows was used. But it's OK. It was expectable to encounter such a problem under Cygwin.

Before I explain where the problem was, I would like to do an additional test. Please download the following dbg script and perform the test again.

http://dl.dropbox.com/u/1382119/tmp/ann2srt-dbg5

Thanks.

L said...

If it's missing, it was probably deleted by the script somehow.
I downloaded them all by saving in Firefox, but I did change the file format and encoding in a text editor before to see if that was causing the problem. That might have been for the first dbg script, sorry I forgot about that.
The latest dbg script works perfectly (: So part of the problem was the differences in Windows and Unix text files? And by bypassing the grep "###" lines, somehow those differences were forcibly ignored?
Sorry for the delay!

Shang-Feng Yang said...

Hello L,

Thanks for your testing result.

The 'grep "###"' block is an error-checking to get rid of some annotations that are not functioning as "captions".

Simply speaking, the root of the breaking down for the script in Cygwin is actually the "newline". In DOS/Windows, the "newline" is CR+LF, while it is LF in UNIX/Cygwin. The problem would not occur if you were using a "pure" environment. That is, the script that gets the "unnecessary-newline-and-comma" problem fixed should run smoothly under Linux, "pure" Cygwin environment, or "pure" Win32/Mingw environment. When you run the script under a "hybrid" environment by mixing the use of Cygwin and Win32 programs, the behavior of the commands like grep, tr, sed, or even the for-loop become somehow unpredictable. For the first dbg script, it was boken due to the XMLStarlet Win32 binary. The annotation file, which was in UNIX format, was "converted" into DOS format, and caused the following processing to break down. For dbg2 and dbg4 scripts, I used a tr command to convert the Win32 XMLStarlet output back to UNIX format, but the scripts were still broken due to the change of the annotation file format from UNIX to DOS, which caused the fix that gets rid of the newline in the annotation to be failed. So, in the last dbg5 script, I added another "conversion" to make sure the annotations will be in UNIX format even when the annotation file is actually in DOS format. Fortunately, the conversion from DOS to UNIX newline is done by removing the CR character at the end of the line, so it can still run under Linux without any side effect.

I will release a version 0.3 script that includes all the fixes some time later after some cleaning up.

Thanks for your reporting and testing!

L said...

Wow, those file formats sure mess everything up :P
No, thank you for all your bug fixing (: