keskiviikko 27. helmikuuta 2019

ultra spicy peanuts, rich complex morphologies

There's this one thing that follows me in my daily life, that is extra super annoying and makes communication really hard and makes both me and the people I'm communicating with just not understand each other at all. What I'm talking about is the mistake that someone makes, when they don't know the extent of a scale of stuff they are measuring, and they tend to think that anything that is slightly more/higher/worse than what they are most used with is super super biggest thing in the world. Funnily this applies to two of my favorite pastimes: linguistics and food. In linguistics, I have studied this stuff called morphology or word-formation, this means roughly that we can model with computers now how shoes is formed from shoe + s making more than one shoe and unfortunately is formed from fortune and ate and un and ly making negated fortune that tags on to a verb like an adverb. In food, we have a concept of spiciness or hotness, generally related to spices made from chilies  or so, but can also be used for peppers, horse radish or whatever. But as you might guess most people are not aware that languages more complex than English can range from slightly to million billion times, and spices more potent than black pepper and salt can also range from slightly spicy to pepper spray.

maanantai 14. marraskuuta 2016

Learn yourself a new language like a computational linguist

Introduction: Deutschland, 10 months in

When you move into a country that doesn't speak your language natively in every day kind of manner, the most important thing is to learn it. Take it from academic whose most friends are constantly moving, because that's how you post-doc, no matter how fluent the locals might be in English in general, it makes a huge difference to your long-term life quality to be fluent enough, to understand what others are speaking and what signs and texts are saying. Of course, if you, like most of us, are only allowed to stay in one country few years at a time ("researcher mobility"), it may not be worth the effort, but on the other hand, learning new stuff is fun and keeps your mind working. Even though Indo-European languages are not very high in my academic interest, learning German is fun and enlightening at times.

maanantai 5. tammikuuta 2015

Compounds are too hard for people

English is a nice international language for most of its features (apart from spelling vs. phonology). One of the very nice things indeed is that when you need to combine existing nouns into words meaning new things you just chain them together using spaces: printers become laser printers and later on 3d printers without worries or problems for machines processing these words (I wish, actually google blogger does draw red squiggly under my 3d). This is not so notably for German already, which smashes these words together without spaces, maybe add some letters in between or hyphen who knows. And then it becomes a word that is unknown to a computer. Finnish is similar to German in this respect. Now one might guess that it would be easy to just add spaces and be done with it. That is not so always, as with what prompted me to write this post made me realise again. In fact whether you write the words with or without space in Finnish is a difference between generic and specific term: talon mies (house+gen man+nom) is the man of the house but talonmies is just a janitor. Most literate Finnish users will of course get this distinction correct, but it gets harder for a whole lot of word combinations: is salad's dressing a single term for salad dressing or just dressing of salad (it usually is a single term) and does it change if you add dressing to ice cream instead (it does), and why is salad dressing not a dressing made of salad. But in these the distinction doesn't matter, if you get the less probable variant it's all the same. However, cases where it does matter, even good writers will get it wrong, as exemplified by an email in my inbox about a trip to city X, written with space X:n matka (X+gen trip+nom) is not semantically plausible since it can only refer to a trip made by city X–as in Y:n matka, where Y is a person–, it must be X:n-matka, a construct that will usually tackle most writers for sure, so I am not surprised that a fellow linguist had written so. (It is noteworthy that also autocorrect will only allow you to write wrong forms as usual, so it may as well be the culprit).

perjantai 12. joulukuuta 2014

Making mistakes as interesting research question in computational linguistics

There's probably no other field of science where making mistakes and doing stupid things can be so profitable as computational linguistics. Bad tagging in your gold corpora? "It is an interesting research topic on how to make use of this data", don't fix it, just use it. Disambiguation always throws away forms you need? Research question! Pre-processing butchers your data beyond repair? You know what to do, call it future research and publish rubbish results, no matter that you did the pre-processing yourself and could make it more sensible easily.

Let's start with what machine translation is about right now. Before one can translate texts it has to be mangled at least through two atrocious processors, just because, and we call it state of the art too: truecasing and tokenising. You know, instead of having access to the text when translating, you'll deal with text where random letters have been lowercased and uppercased, and yes, you don't have access to knowledge which ones, that has already been lost. But wait, there's more, we have also added spaces to quite random spots in the text without telling you. That's for the statistical machine translation, then after the remains of the string has been translated, moved around, removed and so on, you get to guess where the uppercases should be and maybe if punctuation is still at right position, and what spaces to remove. Or you could rule-based tokenise stuff, maybe ignore whole lot of spaces treating them as not important, who knows.

This like all problems is just caused by the fact that most systems want to transport data in text files or pipes, so we either have to use ad hoc ascii symbol mess to mark up all this casing operations and splits, or discard stuff. What one would really want though, is to have data in a sensible data structure that retains the information: sensible tokenisation and casing of "This string with Name and e.g., full-stops in it." is not "this string with name and e . g . full - stops in it .", it's python structure like [(This:this), ' ', string, ' ', with, ' ', Name, ' ', and, 'e.g.', ',', ' ', full-stops, ' ', in, ' ', it, .], spaces retained where supposed to be between tokens and not where there aren't, original casing retained when mangled. Should be simple but isn't.

Disambiguation is just another pet peeve of mine: if time flies like arrows and butterflies like flowers, you cannot have time flies liking arrows anymore since you have decided earlier that flies is a verb and like is an adverb, and there's no changing decisions in this game.

Throwing away information because you cannot come up with encoding scheme for it is not a good idea. If you want to throw away information because it's too slow to make informed decisions, try to measure the slowness to go along with it. None of the scenarios presented above require discarding good data in today's computers. Yet we are riddled with it. Oh well.

perjantai 31. lokakuuta 2014

The folly of reproducing bugs, recreating errors and coming up with something from nothing

The more I dive into the task of high quality machine translation the more I get annoyed about the standards by which the machine translation are measured. For those who don't know, the quality of machine translation is solely measured on how well the system can recreate a translation made by human translators. Fair enough, if the machine comes up with the same translation as professional translator we can say it has done a marvelous job indeed. Except if maybe translator made a mistake. But that's not all, a job of translator, a professional one who makes high quality translations at the very least, is to translate content to target audience so they can read it. This can often mean adding new data to fill in the audience of facts that perhaps the audience of source language text know better, a machine coming up with that will be a smart machine indeed, but I don't see that happening. Human translators can also drop a lot of words when it's too wordy, e.g. certain ways of saying stuff in English will almost sound like you're explaining things to a child in Finnish if you take too exact translations, but again, a machine smart enough to realise that is not what I foresee in the near future. And then there's a lot of rewording, humans will kind of know when you want to translate verbs into nouns and reword whole sentence cause the way of expressing things is kind of weird for this language, well the machine that realises this may be plausible, in fact, if we just throw enough data to a statistical system it will realise that the sentence is odd, and may have seen the rewording.

The reason why I'm writing this is, I finally for the first time took a serious look at the data that is used to measure the quality of machine translation, that is the europarl stuff. The vast majority of it is horrifying, to the extent that I as human cannot even begin to explain how, if you are given this English sentence can you come up with anything distantly resembling that Finnish sentence, or even matching up with say half of the words in it. If I cannot explain the translation it is at least obvious that we cannot build a rule-based system to map between these two, but even with statistics to be able to learn that, say, if you are talking of tv channels shown in the hotels of members of European parliament and specifically a tv channel named FOO, and English text doesn't give any details of the channel, you are expected to translate FOO as FOO, this is a dutch tv channel broadcasting mainly news, what kind of statistics would really give you good evidence of doing that is rather weird, but not getting that right will probably reduce your score for that translation to 0! While that specific example is not in europarl hopefully, there is a session that is about tv channels and there are a lot of cases like that all over, and indeed machine translation is tasked on finding a way to create algorithms that would faithfully reproduce that kind of rewordings and additions, it's whole lot of nonsense how the systems are evaluated really.

So, machine translation IMO is never particularly suited for much of paraphrasing, rewording, adding information or that sorts of task, moreover we should really concentrate on making systems that a) faithfully carry the information across to the reader as it is and only then b) make it sound like it is grammatically correct and colloquial in target language. Trying to optimise systems to make all these high quality rephrasings and stuff is really a foolish goal to have, rather than to just make sure that the systems are good at not losing any information nor inverting any meanings, which is the biggest problem as I see with current system mainly caused by the fact that they are trying to solve this all at once. Like, who cares if English is more likely to say "don't forget to frobble" where Finnish speakers would go "remember to frobble", but with the current scoring system we do get penalised hugely for not getting that right, and perhaps, getting the common statistical machine mistranslation "don't remember to frobble", oh, it gives us more points of course! So that's what we're optimising our system for. Sweet.

By the way did I mention that if we'd score professional translators by the same measures we use for machine translation they would usually get scores that we deem so low that they're not worth of publishing. Yeah, ain't that a good measure.

Yeah yeah, so this is not specific to machine translation but all computational linguistics in its glory, this madness follows us anywhere we go it does. It is a good thing that we want to systematically measure how well we're doing, instead of just throwing random things at random ad hoc implementations of stuff and writing 5 page essays for conferences describing why things got better with stuff, but it is indeed an exercise in futility when things go towards the trying to recreate bugs or mistakes just because. This is the case of most things like so-called morphological analysis, there are no good standards or metrics on it, so whoever writes the first "gold standard", which in morphology will either mean just a dump of some systems output with systematic errors and all, or another choice is to actually have human annotators do the standard, which may be slightly better. Unfortunately the big mistake here is that people doing linguistics don't really understand how things work, e.g., how to really really prove that analysis is correct and there's some measurable evidence that proves it, but the way human annotators work is just by intuition, And so we end often enough with the enjoyable task of reproducing either bugs of another system or linguistic intuition.

In conclusion, all the millions of money that is spent on computational linguistics, most of it goes in engineering aimed at reproducing bugs and mistakes faithfully, your money at work, isn't it.

maanantai 13. lokakuuta 2014

Tense and time-travelling

Today's episode of the Big bang theory reminded me of something I wrote years ago about time-travelling and tenses in human languages (I wonder if it's still available somewhere, I had beginnings of the listing of a 1000 and one useful verb forms for time-travellers) , that in turn also relates to how misguided even most of the school books about human languages are. One popular representation of this is people saying whether there's future in your language or not, like readers of language log are aware, even some respectable publications fell into that fake story about how English has future and languages that don't are economically worse. The scene on tbbt is about telling the time travelling paradox of back to the future with reference to alternate timelines, and it is clearly obvious that English language has necessary tense structures for telling combinations past future perfects (or what's its name, I'll look it up once the script or subs of the episode are online) as understandably as present, future or past or pluperfect. So what's the point of school grammar teaching that English has future and Finnish hasn't? There isn't, Finnish has plenty of auxiliary verbs as capable as English will for being explicit about future, not calling it a future tense is just good for silly false anecdotes about languages. And, by the way, the discussion about tenses useful for time travelling is ripped from The hitchhiker's guide to galaxy.

keskiviikko 17. syyskuuta 2014

Translation machines


In machine translation there's usually ongoing some sort of a competition or disagreement between statistical and knowledge-based approaches. Statistics is based on a logic that if you have a lot of good quality material and you push it through the machine it creates even better quality material from the best bits and makes it more usable. Like a meat grinder. However that rarely happens so that the source material is actually as good quality as anyone would assume. The primary source of machine translations is europarl, a corpus of European parliament sessions transcribed and translated, and the that is trained on that can merely recall what it's seen there, statistical machine translation, you see, cannot make very good creative decisions, it can in fact only remake same decisions it has seen in translations fed to it. So giving such machine translation text that is not European parliament session it usually fails to make good connections. Much like this McDonalds' meat substitute here:

In knowledge-based, linguistic or rule-based machine translation there's an opposite problem. The people working on knowledge are interested mainly in looking at very small amount of data, the gold or the crown jewels of the language are what is interesting, neglecting the uninteresting words and stuff that makes up 95 % of good machine translation. Further problem is that their jewels aren't truly precious but fakes, that is, the interestingness is based on old misclassifications that they hold on to, to make problems more interesting, such as fake ambiguity made by wrong classifications. So they end up working like these guys from South Park.