"Words co-occur in sentences." … Google & Och … Melville's poetry

Via Jóska, a paper [PDF] on trying to model words in a language as a network-oriented system (network-oriented systems are his bread and butter). From its first page:

“Words interact in many ways. Some words co-occur with certain words at a higher probability than with others and co-occurrence is not trivial, i.e. it is not a straightforward implication of the known frequency distribution of words. If a text is scrambled, the frequency distribution is maintained but its content will not make sense.”
They then go on to describe various repercussions of modelling it like this, and admit that what they are modelling has realistically very little to do with language as it is used by human beings, because the model is so limited. While the very admission of this is a breath of fresh air if you’re used to everyone’s friend from MIT, it depresses me that people are essentially playing with toys, ignoring the world as it exists, and this is being sold as academic research.

On the bright side, I learn from Language Log that Franz Och and Google Labs have made their statistical machine translation engine available for Arabic to English, and it’s really good. (Machine translation “really good” that is, not hand-translation “really good.” So lots of the translated text is in the form of sentences, but the word choice isn’t great.) I wonder when the superior alien civilisation invades us, will stylistically awkward English (or German, or whatever the local vernacular is) become a status symbol, because that’s what their translation machines produce.

Funny thing: from t’Wikipedia, Herman Melville published an epic poem in 1876, with a print run of 350 copies, with this result:

‘The critic Lewis Mumford found a copy of the poem in the New York Public Library in 1925 “with its pages uncut.” Essentially, it had sat there unread for 50 years.’

Phrase of the day: дашт кашидан is Tajik for “to stop doing something;” the separable verb „aufhören“ is German for the same, and ‘terminar’ is the Spanish.

Unlawful sex and other testy matters … билети бозгашт 25th of April, 2006 POST·MERIDIEM 07:35

Two interesting court cases today; the first via Steve on Livejournal. A nineteen-year-old was prosecuted last July for unlawful carnal knowledge in Galway after having consensual sex with his fifteen-year-old girlfriend (whom he thought sixteen); in the interval between the charges being pressed (which the girlfriend’s mother did on learning she was pregnant) and the case coming to court, she had his baby, he moved in with her, and their respective mothers came round to the idea of them as a couple. Coverage here and here.

So far, so reasonable. The most interesting detail to me was that the judge proposed, out loud, prosecuting the girl for conspiracy to commit a crime (that is, unlawful carnal knowledge), which was something she very clearly had done. But in a context where such a charge can be seriously considered, it seems to me that it gives any other interested parties an inordinate amount of power to fuсk up the lives of a couple.

And, via Margaret Marks, this PDF from which the essential details can be taken pretty quickly:

One evening the appellant had had a good deal to drink and was desirous of having sexual intercourse. Passing the complainant’s house he saw a light on in an upstairs room which he knew was the complainant’s bedroom. He fetched a ladder, put it up against the window and climbed up. He saw the complainant lying on her bed, which was just under the window, naked and asleep. He descended the ladder, stripped off his clothes, climbed back up and pulled himself on to the window sill. As he did so the complainant awoke and saw a naked male form outlined against the window. She jumped to the conclusion that it was her boyfriend, with whom she was on terms of regular and frequent sexual intimacy. Assuming that he had come to pay her an ardent nocturnal visit she beckoned him in. …

Now, the whole story is really funny, but this scarcely believing aside from the judge stands out to me:

So he descended the ladder and stripped off all his clothes, with the exception of his socks, because apparently he took the view that if the girl’s mother entered the bedroom it would be easier to effect a rapid escape if he had his socks on than if he was in his bare feet. That is a matter about which we are not called on to express any view, and would in any event find ourselves unable to express one.

Word of the day: Истгоҳ is one Tajik word for train station; станция is another. I would be mildly shocked if the latter wasn’t from Russian.

Handü … You’d think I’d know better … Не, дур нест 24th of April, 2006 ANTE·MERIDIEM 12:56

Es gefällt mir sehr, Zeit mit meinem aktuellen Mitbewohner zu verbringen. JoschkaJóska ist Physiker, Ungar, hat acht Jahre in Wien gelebt, spricht gutes Deutsch, English, Ungarisch (natürlich), interessiert sich gerne für die ganze Welt außerhalb der Physik. Und so schreibt er Emails:

„Entschuldige dass ich dir nicht mehr geschrieben habe, nachdem du mir so viel geholfen hast - ich hasse in ungarn ezumailen, weil ich lieber persönlich kommuniziere.. jedenfalls hat jemand gleich am nachsten morgen alle meine sachen gefunden - bis auf mein Laptop und.. mein handü...:-( “

„Ezumailen!“ Ach, das ist doch toll (ja, außer dass er geraubt war). Und das Ding mit Handü; ich habe es einmal gesagt, dass ich noch nicht genug Selbstbewusstsein auf Deutsch habe, „Handü“ zu sagen—„Handy“ als Mobiltelefon ist kein englisches Wort, sondern ein Deutsches, und deshalb soll es die deutsche Aussprache haben, mit <y> wie <ü> und nicht wie <ie>. Und er hat doch das Selbstbewusstsein auf Deutsch, es so auszusprechen.

I spent a big chunk of the weekend getting VMware running on my NetBSD machine, which was ridiculously complicated. I had forgotten just how. fuсking. needlessly. aggravating. getting commercial software to do its thing can be; first, I needed to update the NetBSD kernel module to work with -current (and after my FYP I hate kernel-space hacking, believe me), then I needed to install various libraries from SuSE 7.3 beside those of the 9.whatever that NetBSD’s Linux emulation is based on (glibc incompatibility sucks), add links to them distinct from the current version, hex edit the binary to change the library version linked against to the older one, start it, realise it requires a licence file, but they don’t give out trial licences for 2.0.4 any more, trawl the warez sites for a licence file (the keygens for this Linux program are all Windows binaries, of course, we all love warez-merchants), get one, it gets past the initial error but gives another one later, type in an extract from the licence file into Google, find a licence from a full version on a non-warez site otherwise in Chinese. And now it runs. But it won’t install Windows XP, which is why I wanted it in the first place—I have three Windows-specific Pop-up Oxford CDROM multilingual dictionaries that cost $3 each, and want to extract the data into a form I can use. I am probably masochistic enough to try Windows 98, since the VMware is old enough that it was released before Windows XP.

Word of the day: таксӣ is Tajik for “taxi,” to no-one’s surprise. It’s interesting that the word got universal so quickly; it only appeared in the late 19th century, and didn’t have the centuries of time to spread of, say, tea.

Stolpern? Nee, ich habe es immer wie „Handie“ ausgesprochen :-) . “Email” als Wort auf Französisch, da habe ich immer gestolpert; man braucht das schlechte Akzent eines Franzosen der kaum Englisch spricht, um es klar verstehbar zu sagen.

