What has you here today?    work history (html) about me tajik bookmarks

MySQL & Latin1 woes … E4X … Pirahã … Alcabala 22nd of June, 2007 POST·MERIDIEM 10:53

Okay, for that eager plethora heh of my readers who have heavily used non-ASCII UTF-8, who stored that UTF-8 with metadata saying that it was latin 1 in a MySQL 4.1.11 database, and who need to move to a MySQL 5.0.41 where such thoughtless trust of the program not to corrupt your data no longer works, I have enlightenment on how to do that migration.

  1. Dump your existing MySQL database, and specify latin1 as the default character set. This last is important, since otherwise MySQL assumes that Windows 1252 is the default character set, and that Windows 1252 cannot encode U+009D, for example. Tell me, MySQL folk, what exactly should the octet with value #x9D mean as a character if not U+009D? Hmm, what’s that? You’ve no good answer? Interesting. Command:
    mysqldump -u user-name --password=password --default-character-set=latin1 database-name
  2. Edit the database dump file, replace latin1 with utf8 everywhere that’s not in your data.
  3. Copy it over to the new machine and load this dump file into the new server:
    mysql -u user -p database-name-not-password,-sorry-to-confuse-you < modified-dump-file-name
    and give the password for the database server (which you have configured, right? Right?).
After that, you should have your UTF-8 passed to you as UTF-8 once more..

ECMAScript for XML is the unwieldy name of a recent standard for processing XML data with JavaScript, and after a couple of days working with it, I’m totally impressed. For me, the initial stumbling blocks were the namespaces and xmlns; but a

var jdfns = new Namespace("http://www.CIP4.org/JDFSchema_1");
and a subsequent addressing of all elements (attributes, etc) with jdfns::element-or-attribute-name resolved that quickly enough, yay. But after that, for example, I commented one evening that were I particularly perverse, I could parse a configuration file to find an integer ordering of some set of attributes; the next day, I found this translated to ten lines of code. Thoroughly recommended if you use XML and JavaScript on a regular basis, though perhaps irrelevant if your code needs to function on Internet Explorer.

And thirdly, in energetic contrast to Emma and Simon’s take on it, I found this New Yorker article on Daniel Everett’s work with the Pirahã really, really good. It’s a magazine article, and as such it doesn’t try to treat the linguistics in detail, any more than the recent article on Григорий Перельман treated the mathematics of the Poincaré conjecture, but it deals well with communicating the sociology of the disagreements the Pirahã provoke; it quotes Michael Tomasello in a critical but diplomatic tone, and gives a vivid picture of the occasional hellishness of tropical field work.

The impression I get from it (and it is to my discredit that I hadn’t read the relevant papers already, but in my defense they were on lingbuzz, which to anyone not interested in generativism is as interesting as the theological debates of the 7th Day Adventists) is of the Pirahã as the apogee of anti-intellectualism; when other language communities have had number systems that lacked in the fine differentiation of most western languages, they were happy to pick them up, but for the Pirahã the difference seems to have been a social pressure not to.

Word of the day: قبالة qabālat (v.n. of قبل), in Persian qabāla, qubāla: surety, contract (especially of bargain and sale); in Spanish as la alcabala, an historical sales tax; in Hebrew as kabala קבלה, meaning invoice/receipt.

So do I get my money back?

You’ll need to show a receipt.

I had this very problem last week with a bog-standard MySQL installation (latin1) to which I have no direct access, which was doing horrible things to smart quotes, apostrophes and pound and euro signs. If it’s still there on Monday, I may well have to resort to such a course of action as you described.

Oh yeah, so is that what that little red bracelet thing is that they wear - some kind of receipt? :)

Probably not; AIUI the double consonant ([bb]) makes the difference. Though if you sight up, be sure to tell! :-)

Hmmm, the scenario you mention sounds a little too familiar. Might I have directly assisted in finding this? :)

Conall, yes, and thank you! :-) Btw, are you relying on daedalus for your spam filtering these days, or are you living in Gmail?

Also, I note that the file /usr/ports/databases/php5-dba/work/php-5.2.3/ext/dba/.libs/dba_db4.o exists, and that I’m getting errors in PHP when I try to use the DB4 handler in dbm_open :-/ . Have you installed that port, or was the install_db4 option not turned on?


Comments are currently disabled.