Japanese Language, Buddhist Sutras and Ruby Programming

A while back, I talked about my efforts to get a full, liturgical version of the Amitabha Sutra, one of my favorite Buddhist texts, online with both Chinese characters and Japanese-romanized reading. Because the sutra is so long, it is not a matter of copy/pasting and writing HTML yourself. It’s too hard. So, I wrote a Perl script that would parse the romanized text, and put all the HTML tags necessary.

Trouble is, I couldn’t make it parse the Chinese characters because they’re UTF8 encoded, not ASCII text. UTF8 characters can be multiple bytes long, and using simple tools like split() in Perl can cause a single Chinese character to get split into two, unusable, bytes of gibberish. Perl can process Unicode, but it doesn’t come naturally, and I eventually gave up and tried to copy/paste the Chinese characters by hand for a while, but gave up on that too. It was just too long.

But lately, after exploring Python language, I tried to revive this old project, and got much closer. However, Python’s Japanese language text-processing requires modules I couldn’t use on my Linux distribution (Mint Linux), and I decided to try a different language again: Ruby.

Ruby, ironically, was designed by a Japanese developer. It’s designed for English, but still handles UTF-8 a lot more easily, and is a pretty nice language to learn in general. So, after playing on the Web a couple nights, I came up with this amateur script:

# encoding: UTF-8

word = Array.new
file = File.new(ARGV[0], "r")

while(line = file.gets())
word = line.split(//u)
for i in (0...word.length)
print "<td>#{word[i]}</td>"
if i % 5 == 0 and i % 10 != 0 then
print "<td>&nbsp;</td>"
elsif i % 10 == 0 and i > 0
print "\n"


If I take output from the Amitabha Sutra text on Wikipedia Japan, copy it into a text file, remove all spaces and unwanted characters, I have a plain-text file with a long, long string of Chinese characters. Using the script above, I could parse that, and add HTML tags around it like so:


Then, it’s just simply copying and pasting each line into the Amitabha Sutra I am writing for the blog! This approach took more work up-front, but saved me weeks, probably months of copying and pasting each character by hand! At some point, I hope to move on to other sutras as well and get them “stamped out” for liturgical use by other people, but first I want to revise the script to get the Chinese and romanized text all organized into HTML correctly the first time. Then it’s a simple copy-paste right into the blog! :)

I haven’t finished copying this one yet, but already I’ve made a lot more progress than before. As my old boss used to say: work smarter, not harder. He was right. :)

Namu Amida Butsu

About Doug

A fellow who dwells upon the Pale Blue Dot who spends his days obsessing over things like Buddhism, KPop music, foreign languages, BSD UNIX and science fiction.

4 thoughts on “Japanese Language, Buddhist Sutras and Ruby Programming

  1. Oh! I love it! I liked how you collapsed the ARGV[0] value right into the File.readlines part. I am new to Ruby, so I definitely appreciate the example and I’ll have to test it out. I’ll update and let you know.

    Thank you and happy hacking to you too!

  2. Hello and welcome to the JLR. I would love to post in other languages, but I don’t have the time or language resources to do it at the moment, though I will definitely consider this in the future.

Comments are closed.