S.-F. Yang's Blog in English: 2011

Tuesday, November 08, 2011

"Speaking Mandarin Chinese" in Hollywood

It's quite often to see some scenes in either TV shows or movies that the characters speak something they claim to be "Mandarin Chinese". Some characters even claim to be very fluent in it. However, for a native Traditional Chinese speaker from Taiwan like me, most of the time, those so called Chinese on screen can hardly be understandable, if it could be understood at all.

It is quite strange, since there should be lots of native Chinese speakers near the production locations of these shows or movie. Is it that hard to find a decent language consultant to make sure the proper pronunciation of a few lines? Or the Hollywood just too proud to admit the fact that, they can't do it right when they are so self-centered and so used to laugh at those whom didn't speak proper English? It's quite painful to hear a character you love to speak something that has nothing resemble to what they claim to be, if that "thing" could be called a language at all.

Read more ...

Saturday, November 05, 2011

Grabbing the vanity card of TBBT into an image

The producer of the TV show "The Big Bang Theory", Mr. Chuck Lorre, always shows the vanity card in the end of each episode. He also posts the same cards on his own website along with those for other shows he produced.

Recently, for some reason, I would like to attach as an image in a e-mail the vanity card for a specific episode of the show from the website. I prefer the image to only contain the content of the card rather than the whole page. This, of course, could be done with screen capturing and cropping of the image using something like GIMP or ImageMagick. However, since I'm a lazy guy, and the chance that I will do this more than once is quite high, manually screen capturing and cropping is certainly not an option for me. Fortunately, I have some ideas on how to do this automatically.

To grab the web page into an image on command line, there are lots of possible ways to do this. The weapon of choice is the still-buggy-but-quite-useful wkhtmltoimage from the project wkhtmltopdf. wkhtmltoimage uses WebKit and Qt to render a given page directly into an image. The great thing about this tool is that, it supports CSS and JavaScript from the page, while you can replace the CSS with your own version and can also append some JavaScripts before rendering happens.

At first, I was trying to render the page into an image, and then pass the image into ImageMatick's convert to cut out only the block of the "vanity card" in the page. However, this approach was proven to be problematic, since it is hard to automatically determine the cropping parameters needed for the "-crop" option of convert. After inspecting the HTML and CSS sources of the page, I decided to experiment with the "visibility" attribute in the CSS definition. I downloaded the CSS file, set the "visibility" attribute to "hidden" for the top most selector (the "#container" selector block in this case), turned on the visibility only for the "#content" block, and supplied the customized CSS to wkhtmltoimage. This gave me an rendered image that only shows the "card" block in the center of a white background. The white "border" then can be easily removed using the "-trim" option of convert.

Although the downloading-and-modifying-CSS approach was a success, supplying a whole modified CSS to wkhtmltoimage is not elegant and could have some potential side-effects. Therefore, the better approach is taking advantage of the ability for wkhtmltoimage to run JavaScripts to alter the "visibility" attribute for appropriate selectors after the page is done loading. Here is my final "one-liner" solution to my problem:

$ wkhtmltoimage \
--run-script "document.getElementById('container').style.visibility='hidden';" \
--run-script "document.getElementById('content').style.visibility='visible';" \
http://chucklorre.com/index-bbt.php?p=364 - \
| convert - -trim tbbt.jpg

The generated JPEG image, "tbbt.jpg", only contains the "card" I want.

The principle behind this could also be applied to other pages. I, as usual, wrote a script to save me some typing that can take an optional production number argument to grab the card for an specific episode. However, since it is an very simple script, I won't bother to post the code here...

Read more ...

Sunday, October 30, 2011

ann2srt v0.3

Although all the bug fixing, testing, and cleaning up have been done several days ago, I was a little too lazy to write... Anyway, here is the "official release notice" of ann2srt version 0.3.

Thanks to the commenter L who helped me on testing and debugging the script on Cygwin, version 0.3 of ann2srt now can handle the annotations other than Traditional Chinese language that have newlines and commas in them, and also can run correctly under Cygwin environment on Win32 platform.

Due to the fact that version 0.2 script uses CSV (Comma-Separated Values) as an intermediate format, the version 0.2 script will fail if the annotation has newline or comma in it. To fix this, in version 0.3, tr is used to eliminate newlines in the annotation. To address the "comma" problem, the delimiter for the intermediate stream is changed from comma to "|".

The version 0.2 script, technically speaking, should be able to run correctly without any modification under Cygwin environment. However, since Windows uses "DOS style" newline characters that consists CR+LF, if any of the external programs used in the script were Win32 binary, or if the input annotation file was in DOS format, the execution of the script becomes unpredictable. To fix this, tr is used again to convert the annotation and the output of the Win32 XMLStarlet from DOS format into UNIX format.

Let's cut to the chase. Here is the source of the version 0.3 script:

#!/bin/bash
#
# Convert the youtube annotation into SRT subtitle
#
# By Shang-Feng Yang
# Version: 0.3
# License: GPL v3
#
# Changelog:
# * v0.3 (Oct/19/2011):
# - Fix the parsing errors caused by comma and newline characters in
# some English annotations
# - Adding transparent dos2unix conversion for compatibility under Cygwin
# * v0.2 (Jan/19/2011):
# - Sort the annotations using the "begin" time as key
# - Minor bugs fixing
# * v0.1 (Dec/7/2010):
# - Initial release

ANN=$1
SRT=$(basename ${ANN} .xml).srt
IFS=$'\n'
I=0

function usage() {
echo -e "Usage:\n"
echo -e "\t$(basename $0) ANNOTATION_FILE\n"
}

function parseXML() {
cat ${ANN} | tr -d '\r' |tr '\n' ' ' | xmlstarlet sel -t -m 'document/annotations/annotation' -v 'TEXT' -o '|' -m 'segment/movingRegion/rectRegion' -v '@t' -o '|' -b -n | tr -d '\r'
}

function reformatTime() {
local H=$(echo $1 |cut -d ':' -f 1)
local M=$(echo $1 |cut -d ':' -f 2)
local S=$(echo $1 |cut -d ':' -f 3)
printf '%02d:%02d:%06.3f' ${H} ${M} ${S} |tr '.' ','
}

function time2sod() {
# Convert time in HH:MM:SS.SSS format into second-of-the-day value
local SOD=$(echo $1 | awk -F ":" '{printf("%f\n", $1*3600+$2*60+$3);}')

echo ${SOD}
}

[ "x${ANN}" = "x" ] && { usage; exit 1; }
[ -f ${ANN} ] || { usage; exit 1; }
[ -f ${SRT} ] && rm ${SRT}
[ -f ${SRT}.tmp ] && rm ${SRT}.tmp

for LINE in $(parseXML); do
C=$(echo ${LINE} |cut -d '|' -f 1)
B=$(echo ${LINE} |cut -d '|' -f 2)
E=$(echo ${LINE} |cut -d '|' -f 3)
echo "$(time2sod ${B})#${B}#${E}#${C}" >> ${SRT}.tmp
done

grep "###" ${SRT}.tmp && {
echo "\"${ANN}\" has no valid annotation!" >&2
rm ${SRT}.tmp
exit 1
}

for LINE in $(cat ${SRT}.tmp|sort -n -t '#'); do
(( I++ ))
C=$(echo ${LINE} |cut -d '#' -f 4)
B=$(reformatTime $(echo ${LINE} |cut -d '#' -f 2))
E=$(reformatTime $(echo ${LINE} |cut -d '#' -f 3))
echo -e "${I}\n${B} --> ${E}\n${C}\n" >> ${SRT}
done

rm ${SRT}.tmp

The version 0.3 script can also be downloaded from here to avoid typos caused by copy-and-paste:
http://dl.dropbox.com/u/1382119/tmp/ann2srt

In fact, I just found that the customized "code block" loses all indentations after the blogger updates. Please download the correct script from the link above.

Read more ...

Thursday, January 20, 2011

ann2srt v0.2

Last time in my post "Converting Youtube's annotation into SRT subtitle, I released a bash script called "ann2srt" v0.1. Version 0.1 was a pretty crude one that did not deal with the sorting of the subtitles in SRT file, and could possibly be problematic for some SRT parser. Yesterday, I spent some time to improve the script with the sorting functionality, and also fixed some minor bugs in v0.1.

The sorting is achieved by using awk/gawk to convert the "beginning" time of the annotation into seconds and then passing the results into sort for sorting. Since sort is part of the GNU coreutils, and awk/gawk should be installed on most of the distributions, this change should not be a big deal for most people.

Here is the code for v0.2:

#!/bin/bash
#
# Convert the youtube annotation into SRT subtitle
#
# By Shang-Feng Yang <storm_DOT_sfyang_AT_gmail_DOT_com>
# Version: 0.2
# License: GPL v3
#
# Changelog:
# * v0.2 (Jan/19/2011):
# - Sort the annotations using the "begin" time as key
# - Minor bugs fixing

function usage() {
echo -e "Usage:\n"
echo -e "\t$(basename $0) ANNOTATION_FILE\n"
}

function parseXML() {
cat ${ANN} |xmlstarlet sel -t -m 'document/annotations/annotation' -v 'TEXT' -o ',' -m 'segment/movingRegion/rectRegion' -v '@t' -o ',' -b -n
}

function reformatTime() {
local H=$(echo $1 |cut -d ':' -f 1)
local M=$(echo $1 |cut -d ':' -f 2)
local S=$(echo $1 |cut -d ':' -f 3)
printf '%02d:%02d:%06.3f' ${H} ${M} ${S} |tr '.' ','
}

function time2sod() {
# Convert time in HH:MM:SS.SSS format into second-of-the-day value
local SOD=$(echo $1 | awk -F ":" '{printf("%f\n", $1*3600+$2*60+$3);}')

echo ${SOD}
}

ANN=$1
SRT=$(basename ${ANN} .xml).srt
IFS=$'\n'
I=0

[ "x${ANN}" = "x" ] && { usage; exit 1; }
[ -f ${ANN} ] || { usage; exit 1; }
[ -f ${SRT} ] && rm ${SRT}
[ -f ${SRT}.tmp ] && rm ${SRT}.tmp

for LINE in $(parseXML); do
C=$(echo ${LINE} |cut -d ',' -f 1)
B=$(echo ${LINE} |cut -d ',' -f 2)
E=$(echo ${LINE} |cut -d ',' -f 3)
echo "$(time2sod ${B})#${B}#${E}#${C}" >> ${SRT}.tmp
done

grep "###" ${SRT}.tmp && {
echo "\"${ANN}\" has no valid annotation!"
rm ${SRT}.tmp
exit 1
}

for LINE in $(cat ${SRT}.tmp|sort -n -t '#'); do
(( I++ ))
C=$(echo ${LINE} |cut -d '#' -f 4)
B=$(reformatTime $(echo ${LINE} |cut -d '#' -f 2))
E=$(reformatTime $(echo ${LINE} |cut -d '#' -f 3))
echo -e "${I}\n${B} --> ${E}\n${C}\n" >> ${SRT}
done

rm ${SRT}.tmp

The usage should be the same with v0.1.

Read more ...

S.-F. Yang's Blog in English

Tuesday, November 08, 2011

"Speaking Mandarin Chinese" in Hollywood

Saturday, November 05, 2011

Grabbing the vanity card of TBBT into an image

Sunday, October 30, 2011

ann2srt v0.3

Thursday, January 20, 2011

ann2srt v0.2

About Me

Blog Archive

Links

Followers