HSK东西 Scripts: a site for learning Chinese characters - or, "Handling Chinese characters with Python Unicode strings is less hassle than I thought it would be."

Alan Davies is learning Chinese and couldn’t find a site that would work out what level of difficulty a text or a vocabulary list would be. So he built a site to do that on PythonAnywhere, our Python-focused PaaS and browser-based programming environment. Bravely enough, he did it in Python 2, which is not renowned for its Unicode support. While he says that “there are a few little things you have to be aware of” Unicode-wise, it turned out to be entirely doable and is now used by people learning Chinese all over the world.

Screenshot of the HSK Scripts front page

What does your site do?

There is a standardised test of Mandarin Chinese for foreigners called the HSK. The HSK has six levels, with HSK 1 requiring the ability to work with a vocabulary of 150 Chinese words, and HSK 6 requiring about 5,000 words. It’s useful to be able to ask questions about all these word lists:

  • How much of the HSK level 3 vocabulary do I know?
  • Which words do I need to know to get to HSK 5?
  • Given the vocabulary list for a textbook, what level will it get me to?
  • Which words were added to or removed from a level when the vocabulary was revised?
  • I have passed HSK 4, will I be able to read this short story?

I wanted something that could answer these questions, so I wrote a quick script for myself to use, taking some Chinese text or a vocabulary list, and breaking it down according to the 6 levels of the standardised Chinese HSK exams.

What first gave you the idea to build a site like this?

I started learning Mandarin Chinese a couple of years ago. Once I got started it was addictive, I think it really appeals to the brain of a programmer because of the highly logical and consistent grammatical structure, with every character pronounced as exactly one syllable. There are many other interesting features such as how the characters are constructed from smaller components which often give hints as to their meaning or pronunciation, a bit like a cryptic crossword clue in each character. One quick example: 马 (pronounced ‘ma’, ignoring the tones) means horse, and 女 means woman. 妈, which means ‘mother’, is also pronounced ‘ma’. The left side of 妈 gives a hint to its meaning, and the right side to its pronunciation.

How does it work?

My script has a few text files as inputs: the HSK words at each level, word and character frequencies, a free Chinese-English dictionary, and a file that gives character composition information (see above). The words and characters at each HSK level are stored in Python set() objects, and when a user vocabulary list has been input the various questions to do with HSK characters in the can be answered with various combinations of the very powerful set operations. The resulting lists are sorted by character or word frequency as appropriate, to keep the more interesting and frequent characters near the top. The in-memory data structures (which are all sets, dictionaries, and arrays) are just pickled to disk, to save re-parsing the source files.

I also added the ability to take a block of text and try to turn it into a vocabulary list. Parsing Chinese sentences into words is quite a difficult problem, and not just because it is quite difficult to distinguish between ‘words’ and ‘phrases’ sometimes in Chinese (editor’s note: Section 5 of this blog post gives you some idea of how hard this is). Secondly, Chinese speakers run ‘words’ together with pauses between them as we do in English, but when they write these words down there are no spaces between them. There generally isn’t too much ambiguity when being read by a fluent Chinese speaker, but for a computer program this is a nightmare. Tokenising Chinese sentences has been the subject of many research projects, and all sorts of statistical, grammatical, and AI techniques have been tried. Another example: the characters 中 (middle), 国 (country), and 人 (person) are each words in their own right. Put them together and you can create the words 中国 (China, the ‘Middle Kingdom’), 国人 (compatriot), and 中国人 (Chinese person). All this can of course be in the middle of a sentence with other potential word combinations formed from the first and last character. I chose a simple and crude way to resolve this ambiguity by looking at the frequency of the possible word combinations, it could probably use some more work though, maybe next time I am delayed at an airport!

Who uses the site?

Judging by the feedback I’ve had, people learning Chinese all over the world. The top countries are the US, China, Thailand, and Germany, and Taiwan.

How busy is the site (hits/day etc)?

At the moment about 100 unique users per day, as it’s really still just in early development. I’ve shared it with a few people to test it, but if it becomes popular I’ll polish it up and host it under my own domain name.

What frameworks/Python modules are you using?

The only non standard library module this project uses is Flask, I just wanted the simplest possible framework that would let objects persist on the server between requests, so I could keep some of the data around without re-parsing the data files for every request. The HTML and CSS is as simple as possible at the moment, making it look pretty is not the part that interests me much.

Were there any interesting problems/challenges with getting it working? We’re guessing Unicode…

Handling Chinese characters with Python 2 Unicode strings is less hassle than I though it would be, but there are a few little things you have to be aware of. You end up being paranoid and prefixing all of your string literals with ‘u’, a problem that I believe is fixed in Python 3 where all strings are unicode by default. Missing the ‘u’ prefix will cause some nasty behaviour from the examples below; ‘1’ will output a bit of line noise, 2 will output “\u4e2d” rather than a properly encoded character, and worst of all will 3 and 4 will throw a run-time exception. 5 will usually work correctly depending if your editor has saved the file with the correct encoding, although for safety I stick with 6 and keep my source files in ASCII:

print "1: {}".format("中");
print "2: {}".format("\u4e2d");
print "3: {}".format(u"中");
print "4: {}".format(u"\u4e2d");
print u"5: {}".format(u"中");
print u"6: {}".format(u"\u4e2d");

It is also important to specify the encoding of files when reading and writing non-ASCII encodings, otherwise you will have garbled characters. For UTF-8 it is also usually better to include a byte order mark as well. The following code opens files for reading and writing:

import codecs
infile = codecs.open(infilename, 'r', "utf-8")

outfile = codecs.open(outfilename, 'w', "utf-8")

What’s your background? How long have you been programming?

I messed about a bit with BASIC on the the BBC Micro and ZX Spectrum when I was young, and then Assembler on the Atari ST. I did a degree in Computer Engineering and started work as a software developer, working in C, then C++, and more recently various .NET languages. I’m now doing a PhD in Finance so any programming I do tends to be econometrics in R, SAS, and Stata. For about 12 years though I have tended to used Python for my own small projects, as I find it very quick and intuitive to write, and very readable.

How did PythonAnywhere help?

With PythonAnywhere I can log in from anywhere when I have a few minutes and fix a bug or add or improve a feature. And the processing power available is blinding.

Got a PythonAnywhere story you’d like to share next month?

Drop us a line at support@pythonanywhere.com and we’ll have a chat!

comments powered by Disqus