Thursday, August 6, 2009

Iterating over unicode strings in python

I ran across an interesting code sample for reversing a unicode string in python the other day. I wasn’t aware of (or had forgotten about) the standard library’s unicodedata module, so it was good to learn (or be reminded) about it. This was also an opportunity for me to try out Python 3, and I enjoyed experiencing the enhancements to Unicode support first-hand.

While string reversal isn’t necessarily a problem you are likely to run into outside of a job interview, it does illustrate some of the issues that lie in wait if you try to process unicode strings as a simple collection of characters. These issues will occur in other contexts, such as breaking a large document up into chunks for streaming or processing.

Composed Characters


The issue the author noticed is that the answer to the problem isn’t just:

"".join(reversed(input_unicode)))

This is because what users think of as a “character” in a unicode string can be the composition of a base character and zero or more combining characters (accents, although in non-Western European scripts these can be vowels and other things that I don’t understand). A simple example is:

>>> string = 'e\u0300x'
>>> print(string, "".join(reversed(string)))
èx x̀e

So, naïvely reversing this string put the accent before the character it ws modifying, which isn’t correct. Also, if the accent were the last character in the string, it ends up not having a preceding character to modify, and is left hanging out there in ghostly fashion. I think most display systems just ignore it, but you can end up with unwelcome surprises later if you concatenate with another unicode string.

UTF-16 and Surrogates


The solution proposed in the article deals excellently with the problem of composed character sequences, but runs into problems when Python is storing characters in a str as 2 byte values.

If you’re familiar with the details of Unicode, UTF-16 and UTF-32, feel free to skip to the next section.

Strings in Unicode are made up of sequences with values in the range 0x0 to 0x10FFFF. Each individual value is known as a “code point” and the range implies that we need at least 3 bytes per code point. In practice, since modern processors like even number of bytes, 4 are be used. The encoding where 4 bytes (with some specified endianness) are used for each code point is called UTF-32.

For reasons of history and memory usage, though, it is common to use a 2-byte
encoding, known as UTF-16. In this case, you represent code points whose value is ≥ 0x10000 via what is called a surrogate pair. Here’s how it works on my Mac OS X 10.5 system, with a stock build of Python 3.1:

>>> u = chr(0x10000)

Here’s one thing to like about Python 3 already: chr is happy to accept anything in the abovementioned range (in 2.x, you could only pass in 0—127). Also, repr() uses '\U', the escape for
eight hex digits (four bytes) of unicode:

>>> u
'\U00010000'

which is of course handy if you want to create Unicode strings without resorting to a function call. Now, even though u is a single “character” (code point in Unicode), it contains 2 values:

>>> len(u)
2
>>> list(u)
['\ud800', '\udc00']

Python is representing the single Unicode character U+10000 as two 16-bit values. These are the surrogate pair: the most significant bits of the code point are extracted from the first value, called the “high surrogate ”, which lies in the range U+D800 – U+DBFF, inclusive. The least significant bits are extracted from the second value, the “low surrogate”, in the range U+DC00 – U+DFFF. The UTF-16 Wikipedia article I cited earlier has the specifics.

For the whole scheme to work, the range U+D800 – U+DBFF are not considered valid code points by themselves; they aren’t part of the space of valid Unicode code points, and may only appear in pairs in UTF-16 encoded text. Since the above rules for surrogate pairs are actually rather strict, and in particular reversing a surrogate pair doesn’t give you a valid surrogate pair.

So, we can go ahead and apply the ureverse() function defined in the string reversal blog post:

>>> reversed_string = ureverse('\U00010000')

This string is valid from the point of view of being a sequence of 2-byte values:

>>> reversed_string
'\udc00\ud800'

However, it is in fact not a valid UTF-16 string, and can blow up in nasty ways:

>>> reversed_string.encode("UTF-8")
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00'
in position 0: surrogates not allowed

You would almost certainly get a similar error trying to print() this string, since writing to a terminal involves picking some output encoding (often based on your locale). Clearly, we need to deal with these surrogates properly if we don’t want crazy things to happen with our output.

Iterating over surrogate pairs


If you know the rules for surrogate pairs, you can write a simple generator to iterate over a (unicode) string in a way that does’t chop up these pairs. Then, we could update ureverse() to use this generator instead of just going list(ustring). Here’s one attempt:

>>> def iter_utf_32(s):
... saved_surrogate = ""
... for char in s:
... if '\ud800' <= char <= '\udfff':
... if saved_surrogate == "":
... saved_surrogate = char
... else:
... yield saved_surrogate + char
... saved_surrogate = ""
... else:
... if saved_surrogate:
... # unpaired surrogate. tsk, tsk.
... yield saved_surrogate
... saved_surrogate = ""
... yield char
...


The above code doesn’t do all the checks it could, in particular that surrogates only occur in pairs, and that the high surrogate precedes the low surrogate, but it’s enough for correct iteration.

A better way: UTF-32?


Anyway, I found the above, with its hard-coding of the range of surrogates, somewhat ugly. Surely this information must be contained in the standard library or the Python interpreter somewhere? It turns out that it is, via the UTF-32 string encoding. Here’s one way to use this:

>>> from functools import reduce
... def iter_utf_32(s):
... bytes = s.encode("UTF-32-BE")
... while bytes:
... four_bytes = bytes[:4]
... yield chr(reduce(lambda x, y: x * 256 + y, four_bytes))
... bytes = bytes[4:]

Basically, encode the string as "UTF-32-BE", which turns each Unicode value in the string into four bytes, most significant byte first (that’s what the "-BE" in the encoding is for). Then we turn each group of 4 output bytes into an integer code point that we can call chr() on. ().

Notes



  1. The reduce() call is just a funky way of converting from four base 256 “digits” into a single value in the range 0 - 0xffffffff.

  2. We explicitly ask for big-endian ("-BE" in the encoding string) because that’s what we want for our little reduce function. If we asked for just "UTF-32", the bytes would be ordered according to the processor of the machine the code runs on, and the first byte would be a “byte ordering mark”.

  3. For large strings, you’ll use up a lot of temporary memory this way, of course. This can be avoided using iterencode() from the codecs module.


Wide, wide unicode


Earlier, I said “... whether elements of the str type are being stored in 2 bytes or 4 bytes”, and then rambled on about the 2 byte case. In the 4 byte case, none of this is necessary, i.e. my iter_utf_32 function can just be replaced with iter. (If iter_utf_32 is correct, then it will be doing a fair amount of work to implement the identity function, but the point is that it will work regardless of which representation is being used).

It turns out that when you build Python, there’s a configure option to enable 4 byte unicode:


bash$ ./configure --help
...
--with-wide-unicode Use 4-byte Unicode characters (default is 2 bytes)

Christopher Lenz makes a case for making 4-byte the default here. That sounds pretty reasonable to me. Alternatively, maybe there should be different ways to iterate the str type, to avoid splitting surrogate pair and/or composed character sequences.

Postscript


In the above, I picked U+10000 as the first code point that can’t be stored in 2 bytes, without having the slightest idea what it was. Here, unicodedata can help out:

>>> import unicodedata
>>> unicodedata.name('\U00010000')
'LINEAR B SYLLABLE B008 A'

It turns out to be a symbol (SVG file) from the long-disused early Greek script Linear B. Somehow I had expected “Linear B”to be a computer language ...

Friday, July 31, 2009

Obligatory First Post Introducing This Blog

This blog will be about stuff I run into in my software development life, which mainly happens in Python and sometimes Objective C. The rest of this post is some random meanderings on the blog name.


For a long time, I have enjoyed using the word “defrobnication”, just because I’ve liked the way it sounds, in a Lewis Carroll kind of a way. I have no idea where I first heard it, but a Googly search reveals that it’s in relatively common usage, especially in software.


In fact, there are definitions online for “frobnicate”, for example:


/frob'ni-kayt/ (Possibly from frobnitz, and usually abbreviated to frob, but "frobnicate" is recognised as the official full form). To manipulate or adjust, to tweak.


...


Usage: frob, twiddle, and tweak sometimes connote points along a continuum. "Frob" connotes aimless manipulation, ...


OK, so “defrobnication” must mean the act of countering aimless manipulation, which sounds like a good thing, especially in coding, where the Siren songs of over-generalization and premature optimizations can be hard to resist! On the other hand, I found the following entry for “frobnication”:


Frobnication, or frobnificationisation, when applied to language is the unecessessary elongification and mispronounciatification or general sensibilitification of words

Hm, so actually “defrobnication” is already a frobnication. Maybe I should have called this blog “antidisenfrobnicationism” for full ironic effect?