A while back, I talked about my efforts to get a full, liturgical version of the Amitabha Sutra, one of my favorite Buddhist texts, online with both Chinese characters and Japanese-romanized reading. Because the sutra is so long, it is not a matter of copy/pasting and writing HTML yourself. It’s too hard. So, I wrote a Perl script that would parse the romanized text, and put all the HTML tags necessary.
Trouble is, I couldn’t make it parse the Chinese characters because they’re UTF8 encoded, not ASCII text. UTF8 characters can be multiple bytes long, and using simple tools like
split() in Perl can cause a single Chinese character to get split into two, unusable, bytes of gibberish. Perl can process Unicode, but it doesn’t come naturally, and I eventually gave up and tried to copy/paste the Chinese characters by hand for a while, but gave up on that too. It was just too long.
But lately, after exploring Python language, I tried to revive this old project, and got much closer. However, Python’s Japanese language text-processing requires modules I couldn’t use on my Linux distribution (Mint Linux), and I decided to try a different language again: Ruby.
Ruby, ironically, was designed by a Japanese developer. It’s designed for English, but still handles UTF-8 a lot more easily, and is a pretty nice language to learn in general. So, after playing on the Web a couple nights, I came up with this amateur script:
# encoding: UTF-8
word = Array.new
file = File.new(ARGV, "r")
while(line = file.gets())
word = line.split(//u)
for i in (0...word.length)
if i % 5 == 0 and i % 10 != 0 then
print "<td> </td>"
elsif i % 10 == 0 and i > 0
If I take output from the Amitabha Sutra text on Wikipedia Japan, copy it into a text file, remove all spaces and unwanted characters, I have a plain-text file with a long, long string of Chinese characters. Using the script above, I could parse that, and add HTML tags around it like so:
Then, it’s just simply copying and pasting each line into the Amitabha Sutra I am writing for the blog! This approach took more work up-front, but saved me weeks, probably months of copying and pasting each character by hand! At some point, I hope to move on to other sutras as well and get them “stamped out” for liturgical use by other people, but first I want to revise the script to get the Chinese and romanized text all organized into HTML correctly the first time. Then it’s a simple copy-paste right into the blog! :)
I haven’t finished copying this one yet, but already I’ve made a lot more progress than before. As my old boss used to say: work smarter, not harder. He was right. :)
Namu Amida Butsu