Python ships with the default encoding of ASCII which causes a lot of headaches when needing to work with CJK characters. Furthermore, with languages that don’t have spacing, like Chinese and Japanese, working with strings can be a pain for the budding Pythonist (like myself). Here, I go over how to set up Python for UTF-8 in Ubuntu (11.10 64-bit) and I’ll go over a crude way to break a spaceless Japanese sentence up and toss it into a list.
First things first, we need to edit our site.py file to change the default encoding to UTF-8. This file was found, on my machine, at /usr/lib/python2.7/ and the change that needs to be made is the encoding from ascii to utf-8:
def setencoding():
“””Set the string encoding used by the Unicode implementation. The
default is ‘ascii’, but if you’re willing to experiment, you can
change this.”””
encoding = “utf-8” # Default value set by _PyUnicode_Init()
Now if we check our getdefaultencoding after firing up Python, we should see this:
>>> sys.getdefaultencoding()
‘utf-8’
Great, now let’s start working with some Japanese. Let’s grab some Grade 1 Kanji from http://www.saiga-jp.com/language/kanji_list.html and see if we can change the string of Kanji into a list.
>>> grade1 = ‘一右雨円王音下火花貝学気九休玉金空月犬見五口校左三山子四糸字耳七車手十出女小上森人水正生青夕石赤千川先早草足村大男竹中虫町天田土二日入年白八百文木本名目立力林六’
Now let’s decode it to utf-8 and space each Kanji with the handy join function!
>>> grade1_dec = grade1.decode(‘utf-8’)
>>> print grade1_dec
一右雨円王音下火花貝学気九休玉金空月犬見五口校左三山子四糸字耳七車手十出女小上森人水正生青夕石赤千川先早草足村大男竹中虫町天田土二日入年白八百文木本名目立力林六>>> grade1_spaced = ‘ ‘.join(grade1_dec)
>>> print grade1_spaced
一 右 雨 円 王 音 下 火 花 貝 学 気 九 休 玉 金 空 月 犬 見 五 口 校 左 三 山 子 四 糸 字 耳 七 車 手 十 出 女 小 上 森 人 水 正 生 青 夕 石 赤 千 川 先 早 草 足 村 大 男 竹 中 虫 町 天 田 土 二 日 入 年 白 八 百 文 木 本 名 目 立 力 林 六
Whoa, this is getting quite nifty, and it’s barely taken any effort. This is why I love Python! Let’s now make a list out of this bad boy with the split function and check a few indexes.
>>> grade1_list = grade1_spaced.split(‘ ‘)
>>> print grade1_list
[u’\u4e00′, u’\u53f3′, u’\u96e8′, u’\u5186′, u’\u738b’, u’\u97f3′, u’\u4e0b’, u’\u706b’, u’\u82b1′, u’\u8c9d’, u’\u5b66′, u’\u6c17′, u’\u4e5d’, u’\u4f11′, u’\u7389′, u’\u91d1′, u’\u7a7a’, u’\u6708′, u’\u72ac’, u’\u898b’, u’\u4e94′, u’\u53e3′, u’\u6821′, u’\u5de6′, u’\u4e09′, u’\u5c71′, u’\u5b50′, u’\u56db’, u’\u7cf8′, u’\u5b57′, u’\u8033′, u’\u4e03′, u’\u8eca’, u’\u624b’, u’\u5341′, u’\u51fa’, u’\u5973′, u’\u5c0f’, u’\u4e0a’, u’\u68ee’, u’\u4eba’, u’\u6c34′, u’\u6b63′, u’\u751f’, u’\u9752′, u’\u5915′, u’\u77f3′, u’\u8d64′, u’\u5343′, u’\u5ddd’, u’\u5148′, u’\u65e9′, u’\u8349′, u’\u8db3′, u’\u6751′, u’\u5927′, u’\u7537′, u’\u7af9′, u’\u4e2d’, u’\u866b’, u’\u753a’, u’\u5929′, u’\u7530′, u’\u571f’, u’\u4e8c’, u’\u65e5′, u’\u5165′, u’\u5e74′, u’\u767d’, u’\u516b’, u’\u767e’, u’\u6587′, u’\u6728′, u’\u672c’, u’\u540d’, u’\u76ee’, u’\u7acb’, u’\u529b’, u’\u6797′, u’\u516d’]
>>> print grade1_list[3]
円
>>> print grade1_list[60]
町
>>> print grade1_list[79]
六
>>> for i in grade1_list[34:37]:
… print i
…
十
出
女
Now that we got a nice list going on, let’s give this a final hurrah by checking if Kanji are in the list.
>>> ‘見’ in grade1_list
True
>>> ‘絵’ in grade1_list
False
>>> ‘花’ in grade1_list
True
The possibilities are endless from here, and I hope this excites people enough to want to start messing around with other languages in Python (or in programming in general). There is a certain stigma associated with encodings, so I hope this clears up some of the Python confusion.
