Wednesday, July 8, 2015

UTF-8, Unicode, and International Character Sets (Oh My!)

I'm a little embarrassed to admit that in my entire 35-year career, I've never really learned anything about international character sets (unicode, etc).  And I still know very little.  But recently I had to learn enough to at least be able to talk somewhat intelligently, so I thought I would speed up somebody else's learnings a bit.

A more in-depth page that I like is:

Unicode - a standard which, among other things, specifies a unique code point (numeric value) to correspond with a printed glyph (character form).  The standard defines a code point space from 1 to 1,114,111 (1 to 0x10FFFF), but as of June 2015 it only assigns 120,737 of those code points to actual glyphs.

UTF-8 - a standard which is basically a means to encode a stream of numeric values with a range from 1 to 2,147,483,647 (1 to 0x7FFFFFFF), using a variable number of bytes (1-6) to represent each value such that numerically smaller values are represented with fewer bytes.  For example, the number 17 requires a single byte to represent, whereas the number 2,000,000,000 requires 6 bytes.  Notice that the largest Unicode code point only requires 4 bytes to represent.

So the Unicode standard specifies that a particular numeric value corresponds to a specific printed character form, and UTF-8 specifies a way to encode those numeric values.  Although the UTF-8 encoding scheme could theoretically be used to encode numbers for any purpose, it was designed to encode Unicode characters reasonably efficiently.  The efficiency derives from the fact that Unicode biases the most-frequently used characters to smaller numeric values.

UTF-8 and Unicode were designed to be backward compatible with 7-bit ASCII, in that the "printable" ASCII characters have numeric values equal to the first 127 Unicode characters (1-127), and UTF-8 can represent those values with a single byte.  Thus the string "ABC" is represented in ASCII as 3 bytes with values 0x41, 0x42, 0x43, and those same three bytes are the valid UTF-8 encoding of the same string in Unicode.  Thus, an application designed to read UTF-8 input is able to read plain ASCII text.  And an application which only understands 7-bit ASCII can read a UTF-8 file which restricts itself to the first 127 Unicode characters.

Another nice thing about UTF-8 is that the bytes of a multi-byte character cannot be confused with normal ASCII characters; every byte of a multi-byte character has the most-significant bit set.  For example, the tilda-n character "ñ" has a Unicode code point 241, and the UTF-8 encoding of 241 is the two-byte sequence  0xC3, 0xB1.  It is also easy to differentiate the first byte of a multi-byte character from subsequent bytes.  Thus, if you pick a random byte in a UTF-8 buffer, it is easy to detect whether the byte is part of a mutli-byte character, and easy to find that character's first byte, or to move past it to the next character.

One thing to notice about UTF-8 is that it is trivially easy to contrive input which is illegal and will not properly parse.  For example, the bytes 0xC3, 0x41 is an illegal sequence in UTF-8 (0xC3 introduces a 2-byte character, and all bytes in a multi-byte character *must* have the most-significant byte set).

Other Encoding Schemes

There are other Unicode encoding schemes, such as UCS-2 and UTF-16, but their usage is declining.  UCS-2 cannot represent all characters of the current Unicode standard, and UTF-16 suffers from problems of ambiguous endianness (byte ordering).  Neither is backward compatible with ASCII text.

Another common non-Unicode-based encoding scheme is ISO-8859-1.  It's advantage is that all characters are represented in a single byte.  It's disadvantage is that it only covers a small fraction of the worlds languages.  It is backward compatible with ASCII text, but it is *not* compatible with UTF-8.  For example, the byte sequence 0xC3, 0x41 is a perfectly valid ISO-8859-1 sequence ("ÃA") but is illegal in UTF-8.  According to Wikipedia, ISO-8859-1 usage has been declining since 2006 while UTF-8 has been increasing.

There are a bunch of other encoding schemes, most of which are variations on the ISO-8859-1 standard, but they represent a small installed base and are not growing nearly as fast as UTF-8.

Unfortunately, there is not a reliable way to detect the encoding scheme being used simply by examining the input.  The input data either needs to be paired with metadata which identifies the encoding (like a mime header), or the user simply has to know what he has and what his software expects.

Most Unixes have available a program named "iconv" which will convert files of pretty much any encoding scheme to any other.  The user is responsible for telling "iconv" what format the input file has.

Programming with Unicode Data

Java and C# have significant features which allow them to process Unicode strings fairly well, but not perfectly.  The Java "char" type is 16 bits, which at the time Java was being defined, was adequate to hold Unicode.  But Unicode evolved to cover more writing systems and 16 bits is no longer adequate, so Java now supports UTF-16 which encodes a Unicode character in either 1 or 2 of those 16-bit chars.  Not being much of a Java or C# programmer, I can't say much more about them.

In C, Unicode is not handled natively at all.

A programmer needs to decide an encoding scheme for in-memory storage of text.  One disadvantage to using something like UTF-8 is that the number of bytes per character varies, making it difficult to randomly access characters by their offset.  If you want the 600th character, you have to start at the beginning and parse your way forward.  Thus, random access is O(n) time instead of O(constant) time that usually accompanies arrays.

One approach that evolved a while ago was the use of a "wide character", with the type "wchar_t".  This would allow you to declare an array of "wchar_t" and be able to randomly access it in O(constant) time.  In earlier days, it was thought that Unicode could live within the range 1-65535, so the original "wchar_t" was 16 bits.  Some compilers still have "wchar_t" as 16 bits (most notably Microsoft Visual Studio).  Other compilers have "wchar_t" as 32 bits (most notably gcc), which makes them a candidate for use with full unicode.

Most recent advice I've seen tells programmers to avoid the use of "wchar_t" due to its portability problems and instead use a fixed 32-bit type, like "uint32_t", which sadly did not exist in Windows until Visual Studio 2010, so you still need annoying conditional compiles to make your code truly portable.  Also, an advantage of wchar_t over uint32_t is the availability of wide flavors of many standard C string handling functions (standardized in C99).

Other opinionated programmers have advised against the use of wide characters altogether, claiming that constant time lookup is vastly overrated since most text processing software spends most of its time stepping through text one character at a time.  The use of UTF-8 allows easy movement across multi-byte characters just by looking at the upper bits of each byte.  Also, the library libiconv provides an API to do the conversions that the "iconv" command does.

And yet, I can understand the attraction of wide (32-bit) characters.  Imagine I have a large code base which does string manipulation.  The hope is that by changing the type from char to wchar_t (or uint32_t), my for loops and comparisons will "just work".  However, I've seen tons of code which assumes that a character is 1 byte (e.g. it will malloc only the number of characters without multiplying by the sizeof the type), so the chances seem small of any significant code "just working" after changing char to wchar_t or uint32_t.

Finally, note that UTF-8 is compatible with many standard C string functions because null can be safely used to indicate the end of string (some other encoding schemes can have null bytes sprinkled throughout the text).  However, note that the function strchr() is *not* generally UTF-8 compatible since it assumes that every character is a char.  But the function strstr() *is* compatible with UTF-8 (so long as *both* string parameters are encoded with UTF-8).

Bottom Line: No Free (or even cheap) Lunch

Unfortunately, there is no easy path.  If I am ever tasked with writing international software, I suspect I will bite the bullet and choose UTF-8 as my internal format.

No comments: