What has you here today?    work history (html) about me tajik bookmarks
SEPTEMBER, 2007 → ← DECEMBER, 2007

JavaScript and UTF-8 … Kennan, Russia and the West … Чекист 16th of October, 2007 ANTE·MERIDIEM 10:34

Google™ing yesterday, I noticed that the first results for searches for decoding and encoding UTF-8 using JavaScript look at each character in the string to transcode, on the JavaScript level. This is unnecessarily slow and memory-intensive, since modern JavaScript and ECMAScript interpreters provide support in C++ for converting to UTF-8 with the function encodeURIComponent, and converting from UTF-8 with decodeURIComponent.

So, hopefully for the benefit of future searches on the same thing, here’s a reimplementation of this library object:

var Utf8 = {
    encode : function (string) {
        return unescape (encodeURIComponent (string));
    }
    decode : function (string) {
        return decodeURIComponent (escape (string));
    }
};
As a usage example, if you have some text that looks like this: “€ 40.00”, and would prefer it to look like this: “€ 40.00”, you would stick the above code somewhere on your page and then call Utf8.decode(string), where string holds your text. I can’t think of a good reason to go in the opposite direction on the web, but I include it for completeness; it is something I need to do with the JavaScript I write, however.

(I was Google™ing the terms because an unrelated bug in the garbage collector—well, as far as I can work out it’s there, which is part of the problem—of our local ECMAScript interpreter segfaulted after using the above code. But it won’t provoke anything of the sort in Mozilla or IE.)

In other news, I’m currently reading George F. Kennan’s Russia and the West Under Lenin and Stalin, by an educated and articulate scholar of Russia and Europe and a one-time US ambassador to the Soviet Union, back when it was important not to send any random campaign donor there. I’ve been planning to get to it for ages, and it’s as good as I expected. Which is nice. See Russil Wvong’s pages for some serious Kennan fandom.

Word of the day: чекист is Russian for a member of the ЧК, one of the predecessors of the KGB; it was also used after the dissolution of the former to describe members of the latter, reasonably enough given the overlap in members