“Words co-occur in sentences.” … Google & Och … Melville’s poetry

What has you here today? work history (html) about me tajik bookmarks

“Words co-occur in sentences.” … Google & Och … Melville’s poetry 30th of April, 2006 ANTE·MERIDIEM 12:23

Via Jóska, a paper [PDF] on trying to model words in a language as a network-oriented system (network-oriented systems are his bread and butter). From its first page:

“Words interact in many ways. Some words co-occur with certain words at a higher probability than with others and co-occurrence is not trivial, i.e. it is not a straightforward implication of the known frequency distribution of words. If a text is scrambled, the frequency distribution is maintained but its content will not make sense.”

They then go on to describe various repercussions of modelling it like this, and admit that what they are modelling has realistically very little to do with language as it is used by human beings, because the model is so limited. While the very admission of this is a breath of fresh air if you’re used to everyone’s friend from MIT, it depresses me that people are essentially playing with toys, ignoring the world as it exists, and this is being sold as academic research.

On the bright side, I learn from Language Log that Franz Och and Google Labs have made their statistical machine translation engine available for Arabic to English, and it’s really good. (Machine translation “really good” that is, not hand-translation “really good.” So lots of the translated text is in the form of sentences, but the word choice isn’t great.) I wonder when the superior alien civilisation invades us, will stylistically awkward English (or German, or whatever the local vernacular is) become a status symbol, because that’s what their translation machines produce.

Funny thing: from t’Wikipedia, Herman Melville published an epic poem in 1876, with a print run of 350 copies, with this result:

‘The critic Lewis Mumford found a copy of the poem in the New York Public Library in 1925 “with its pages uncut.” Essentially, it had sat there unread for 50 years.’

Phrase of the day: дашт кашидан is Tajik for “to stop doing something;” the separable verb „aufhören“ is German for the same, and ‘terminar’ is the Spanish.

From Sheila on the 30th of April at 15:06

I’ve bookmarked the paper to read later, but I wonder what goals you have in mind compared to those of toyish aproaches? (maybe I should read the paper before asking this, but it does seem that you are asking this in general, not in specific) I’d expect that people doing these things are getting at different ideas, but this is only intuition based on things I’ve done and learned in school (I’ll go flesh out my intuition and send you an email later, if you like). I think I’ve mentioned continental vs. US history of the study of animal cognition to you. That’s one possible analogy.

From Aidan Kehoe on the 30th of April at 16:11

Well, for example, this guy is on the sort of track I would like to see more of—investigate the biology of the brain, how thinking manifests itself there, put forward cognitive and models of thinking and language that are in accordance with that. (Note that he’s got very few people in the area listening to him. I suspect his tactic of calling what everyone else of any note is saying ‘bullshit’ has a lot to do with that.)

Now, Ferrer i Cancho and Solé are not about to do that, I accept—they are network systems people, not neurologists. But maybe if there were good numbers of people taking a real interest in language and how one could relate it to the structure of the brain, there wouldn’t be the space to publish that sort of all-I’ve-got-is-a-hammer conclusion, and the network systems people would find another area to amuse themselves with.

From Aidan Kehoe on the 3rd of May at 15:12

I suspect this is worth making clear here as well as by email. Sheila wrote:

> … The guy you linked to writes like a crank!

I replied: Yes, he does. I do not assert that he is anything other than a crank; being able to say that would involve knowing a lot more about neurology than I do, for example. (Though, of the areas I know enough about to judge in what he writes, what he says is more antagonistically put than is constructive, but not actually wrong, IMO.) But when he’s not maniesting crankdom I like the direction he’s going in more than most directions.

From on the 9th of May at 19:18

it depresses me that people are essentially playing with toys, ignoring the world as it exists, and this is being sold as academic research. So, I’ve finally started on the paper. I haven’t finished it yet, but my first thoughts are that it isn’t merely academic. If you can find a structure in a bitstream that reveals a similarity to language, then you can try to detect language in animals, ETs, and ciphered messages.

From on the 9th of May at 19:21

Doh. I didn’t format that properly.

Anyway, final thought (ha, as if), one would also need to check against random, pseudo-random, &c. noise.

and I’m sure these are well studied problems, so it is presumptious of me to even try to start a discussion on them.

(I know for example, from my ex doing Monte Carlo simulations in order to do simulations for high energy collisions, that there can be structure in pseudo-random numbers ...from crypto, from an article I once read, &c. la la la &c.)

Comments are currently disabled.