Menu Home

Text encoding is a convoluted mess

Modern text encoding is a convoluted mess where costs can easily exceed benefits. I admit we are in a world that has moved beyond ASCII (which at best served only English, and even then without full punctuation). But modern text encoding standards (utf-x, Unicode) have metastasized to the point you spend more time working around them than benefiting from them.

ASCII Code Chart-Quick ref card” by Namazu-tron – See above description. Licensed under Public Domain via Wikimedia Commons

Usually when I am working with text, my goals are a bit loftier than messing with individual characters. However, it is a step you have to get right. So you would like working correctly with arbitrary characters to be as easy as possible, as any problems here are mere distractions from your actual goals.

My actual task

The other day I thought it would be nice to get a list of all the article titles and URLs from the Win-Vector blog. It is a low ambition task that should be easy to do. At some point I thought: I’ll just scan the XML export file, it has all of the information in a structured form. And the obvious Python program to do this fails out with:

   not well-formed (invalid token): line 27758, column 487

Why is that? The reason is WordPress wrote a document with a suffix “.xml” and a header directive of “<?xml version="1.0" encoding="UTF-8" ?>” that is not in fact valid utf-8 encoded XML. Oh, it looks like modern XML (bloated beyond belief and full of complicated namespaces referring to URIs that get mis-used as concrete URLs). But unless your reader is bug-for-bug compatible with the one WordPress uses, you can’t read the file. Heck I am not ever sure WordPress can read the file back in, I’ve never tried it and confirmed such a result. This is the world you get with “fit to finish” or code written in the expectation of downstream fixes due to mis-readings of Postel’s law.

So the encoding is not in fact XML over utf-8, but some variation of wtf-8. Clearly something downstream can’t handle some character or character encoding. We would like to at least process what we have (and not abort or truncate).

The solution

Luckily there is a Python library called unidecode which will let us map exotic (at least to Americans) characters to Latin analogues (allowing us to render Erdős as Erdos instead of the even worse Erds). The Python3 code is here:

# Python3, read wv.xml (Wordpres export)
#  write to stdout title/url lines
import random
import codecs
import unidecode
import xml.etree.ElementTree
import string

# read WordPress export
with'wv.xml', 'r', 'utf-8') as f:
    dat =

# WordPress export full of bad characters
dat = unidecode.unidecode(dat)
dat = ''.join([str(char) for char in dat if char in string.printable])
namespaces = {'wp': ""}
root = xml.etree.ElementTree.fromstring(dat.encode('utf-8'))

list = [ item.find('title').text + " " + item.find('link').text 
         for item in root.iter('item') 
         if item.find('wp:post_type',namespaces).text=='post' ]


for item in list:

It only took two extra lines to work around the parse problem (the unidecode.unidecode() followed by the filter down to string.printable). But such a simple work around depends on not actually having to represent your data in a completely faithful and reversible manner (often the case for analysis, hence my strong TSV proposal; but almost never the case in storage and presentation).

Also, it takes a bit of search to find even this work-around; and that is distracting to have to worry about this when you are in the middle of doing something else. The fix was only quick because we used a pragmatic language like Python, where somebody supplied a library to demote characters to something usable (not exactly ideologically pure). Imagine having to find which framework in Java (as mere libraries tend to be beneath Java architects) might actually supply a function simply performing a useful task.

Why is this so complicated?

How did text get some complicated? There are some essential difficulties, but many of the problems are inessential stumbling blocks due to architecting without use cases, and the usual consequence of committees.

A good design works from a number of explicitly stated use cases. Historically strings have been used to:

  • store text
  • compare text
  • sort text
  • search text
  • compute over text
  • manipulate text
  • present text

Unicode/UTF tends to be amazingly weak at all of these. Search by regular expressions is notoriously weak over Unicode (try and even define a subset of Unicode that is an alphabet, versus other graphemes). With so many alternative ways to represent things just forget human readable collating/sorting or having normal forms strong enough to support and reasonable notion of comparison/equivalence. It appears to be a research question (or even a political science question) if you can even convert a Unicode string reliably to upper case.

I accept: a concept of characters and character encoding rich enough to support non-latin languages is going to be more effort than ASCII was. Manipulating strings may no longer be as simple as working over individual bytes.

We would expect the standards to come with useful advice and reference implementations.

But what actually happens is:

  • Perverse “distinctions without differences.” Computer science has a history of these bad ideas (such as Algol 58 which “specified three different syntaxes: a reference syntax, a publication syntax, and an implementation syntax” largely so one member wouldn’t be reduced to using “apostrophe for quote” even though no computer at the time supported both characters). Continuing the tradition Unicode supports all of characters with accents, accents applied to characters, invisible characters, and much more.
  • Time wasted with text support for clearly graphical elements such as emoji:
    NewImage. We already have container media for mixing text and other elements (HTML being one example), so we do not need to crudely repeat this functionality in the character set.
  • Time wasted fending off troll proposals to incorporate fictional languages (such as Klingon) into base Unicode.

An immodest proposal

Perhaps if we could break Unicode’s back with enough complexity it would die and something else could evolve to occupy the niche. My own trollish proposal would be along the following lines.

Pile on one bad idea too many and make string comparison as hard as a general instance of the Post correspondence problem and thus undecidable. Unicode/utf-8 is not there yet (due to unambiguity, reading direction, and bounded length), but I keep hoping.

The idea is many characters in Unicode have more than one equivalent representation (and these can have different lengths, examples include use of combining characters versus precomposed characters). So, roughly, checking sequences of code points represent the same string becomes the following grouping problem:

For a sequence of integers 1…t define an “ordered sequence partition” P as a sequence of contiguous sets of integers (P1,…,Ph) such that:

  1. Each Pi is a contiguous sequence of integers from the range 1 … t.
  2. Union_i Pi = {1,…,t}
  3. max(Pi) < min(Pj) for all 1 ≤ i < j ≤ h.

For two sequences of code points a1,a2,…am and b1,b2,…,bn checking “string equivilance” is therefore checking if the sequence of integers 1…m and 1…n can be ordered sequence partitioned into A=(A1,…,Au) and B=(B1,…,Bu) such that: for i=1…u the sequence of code points a_{Ai} and b_{Bi} are all valid and equivalent Unicode characters.

What stops this from encoding generally hard problems is the lack ambiguity in the code-point to character dictionaries ensuring there is only one partition of each code point sequence such that all elements are valid code-points. Thus we can find the unique partition of each code point sequence using a left to right read and then we just check if the partitions match.

So all we have to do to successfully encode hard problems is trick the standards committee into introducing a sufficient number of ambiguous grouping (things like code-points “a1 a2 a3 a4” such that both “(a1 a2 a3)” and “(a1 a2)” are valid groupings). This will kill the obvious comparison algorithms, and with some luck we get a code dictionary that will allow us to encode NP hard problems as string equivalence.

To get undecidable problems we just have to trick the committee into introducing a bad idea I’ll call “fix insertions.” We will say “a1 a2 a3 a4” can be grouped into “(a1 x1 a2) (a3 a4 x2 x3)” by the insertion of the implied or “fix” code-points x1, x2, x3. Then, with some luck, we could build a code dictionary that could encode general instances of Post correspondence problems and make Unicode string comparison Turing complete (and thus undecidable).

So I think all we need is some clever design (to actually get a dangerously expressive encoding, not just the suspicion there is one) and stand up a stooge linguist or anthropologist to claim a few additional “harmless lexical equivalences and elusions” (such as leaving vowels out of written script) are needed to faithfully represent a few more languages.

What I would wish for

Okay, the last section was a joke (and not even a good joke). Let’s look at what I would really want if text encoding were still on the table.

Unicode is attempting to put everything in one container and thus has become essentially a multimedia format (like HTML). There is no “Unicode light” where you only need to solve the processing problems of one or two languages to get your work done. Unicode is all or nothing, you have to be able to represent everything to represent anything. Frankly I’d like to see a more modular approach where nesting and containment are separate from string/character encoding. A text could be represented as a container of multiple string segments where each segment is encoded in a single named limited capability codebook. Things like including a true Hungarian name in english text would be done at the container level, and not at the string/character level.

My actual point

We have to, as computer scientists, show more discipline in what we do not allow into standards and designs. As Prof. Dr. Edsger W.Dijkstra wrote, we must:

… educate a generation of programmers with a much lower threshold for their tolerance of complexity …

Complexity tends to synergize multiplicatively (not merely additively). And real world systems already have enough essential difficulty and complexity, so we can not afford a lot more unnecessary extra complexity.

Blue punch card front horizPunch card (source Wikipedia).

Update 10-24-2015: cool (evil) project: mimic. Closest things I know to being able to undo this are: TextWrangler‘s Text->ZapGremlins and the command line tool “iconv.”

Categories: Computer Science Opinion

Tagged as:


Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

2 replies

  1. Oh, lordy. Your points are well taken, but I used to have to write multilingual applications prior to Unicode’s prevalence. All the different encodings for Japanese alone are frightening.

    So the world is a better place now than it was before.

    But you’re still right. The problems you specify sill exist. Those problems were around before unicode, though. The problem that Unicode did solve was to move those problems from multiple realms (e.g.: solve sorting for Hebrew; solve sorting for Japanese; solve sorting for Arabic) into just one realm, so that we have the globe’s engineers all working on more or less the same problem.

    Doesn’t mean that any of them are solved, though. And the guys who are working on them are unfortunately easily distracted by Klingon.

  2. Klingon undoubtedly has more users than several of the languages currently represented in Unicode. The result of excluding it will be a parallel standard.

    Anybody who uses uncomposed accents should be shot.

%d bloggers like this: