Monday, December 22, 2008

Textbook - Section 5

We're almost there! Only one more section after this. Believe it or not, the whole chapter's only 6,000 words.

Anyway, here we finish up some of the issues to handle in getting machine translation to work, and then move into the next section, which introduces the very, very basics of natural language processing (of sorts) which is just the term for getting computers to be able to handle natural language, such as English.

And I just realized that FairyHedgeHog stated in the comments that I was suggesting that there's no Babel Fish. I had assumed she'd read my little section on Babel Fish when saying this, but I now realize that that section wasn't posted yet. But it is now about 3-4 paragraphs down. Off we go:

...


There are further problems with word-to-word translation: Many words in one language simply do not exist in another language. This might occur because a word’s meaning is nuanced and complex. Examples of this in English might be smarmy or punk’d. Translators enjoy making lists of such hard to translate words. An example from one recent such list is mamihlapinatapei, a word in Yagan. Apparently, it means something like, “implying a wordless yet meaningful look shared by two people who both desire to initiate something but are both reluctant to start.” That’s quite a complex idea, and it is not surprising that many languages of the world do not have a word for it. Other difficult to translate words seem fairly clear in meaning, there just happens not to be a single word for them in English. An example might be iktsuarpok in Inuit, which means “to go outside to check if anyone is coming.” Or perhaps cafuné in Brazilian Portuguese, meaning, “to tenderly run one’s fingers through someone’s hair.” Both of these words appear to be rather useful words and it might be nice if English had them. But it doesn’t.

In such cases, what we need is not a mapping from one dictionary to another, but simply someone to understand the meaning of the difficult word and tell us in English. However, this is asking a great deal indeed from a computer. The ideal computer translator needs to know not just the list of words in each language and how they correspond to one another but also the grammar of all languages we are interested in, the pragmatics of each society’s relationships, and the very meaning of what is being said. In fact, we are getting very close to asking the computer to be human, to know what we know about the world.

How to Get a Computer to Talk to Us

The reader who has just finished the section on how difficult translation is might be thinking, “but don’t we have some machine translation already?” Indeed, we do. Perhaps the best well-known is Babel Fish as found on Alta Vista and Yahoo!. Babel Fish is a free machine translation service offered on the Web, and it has an extremely tough task. The user can copy in any text she chooses into its screen in a series of languages, and ask for a translation to another language. It’s critical to hammer on the point that anything could be put into its text window. Therefore, to expect such a site to always do spectacularly is indeed like expecting it to be almost human, and it’s clearly not.

Typing a sentence into Yahoo! Babel Fish , translating it to another language, and then translating it back reveals this. If you take the English sentence the boy kicked the ball, translate it to Chinese and then translate it back, you get the boy played with the ball. This is similar in meaning but not the same, as the action of kicking has been lost. Going to Japanese and back gives us the boy kicked sphere. This is the right meaning, just never something an English speaker would say. However, Babel Fish completely jumps the shark when you go to Korean and back, returning the boy l where kicks hard public affairs.

Machine translation can do much better, however, if you give it a far simpler task. Poor Babel Fish has to handle anything anyone could ever say in its languages, like a human does. If you simplify the problem by only handling certain types of communication, the computer can become truly helpful. Examples might be: just focusing on scheduling meetings between people or just translating technical reports on aeronautics. Domain Analysis is the process of formally defining a precise problem to be concerned with. The idea is straightforward. You focus on a tiny part of the whole world, defining exact criteria for what is within the domain you will handle. If you do this, you have a shot at actually giving the computer much of the knowledge it will need to perform reliable translation.

Let’s say, for instance, that you want the computer to be able to handle language about bill payments. Bill payment is such a restricted domain that a team of linguists can teach the computer about how the world of bill payments works. Part of this process is providing the computer with an Ontology. In language engineering, an ontology is a formal structured list of the things that exist in that domain. In the domain of bill payment, that would include things such as bank accounts, bills, due dates, transaction dates, amounts, currency, and so on. The ontologist defines critical concepts in the domain, often called classes. These classes in turn will have various features with restrictions on those features. For instance, a credit card payment could be a class in the ontology with a feature such as the payment amount.
If we were to build an ontology of the Child Play domain, it would likely prevent many mis-translations, such as we saw in the English – Korean – English example above. Virtually no ontology of child play would be concerned with public affairs and so any ambiguities in words, which lead Yahoo! Babel Fish to wander into public affairs land, would be ruled out.

These formal statements concerning language are needed across the board to handle deep language translation – for word recognition, for language meaning, for grammar, etc. For instance, we discussed the need to distinguish subject from object when translating. Among other reasons, this was needed so that the verb could be made to agree with the subject (but often not the object) and so that one could re-order words depending on the language (the object is after the verb in English but before it in Japanese, German, or Korean.) How could you tell a computer how to find the subject in an English sentence?

One initial approach might revolve around defining noun phrases, and then telling the computer where in a sentence the subject noun phrase is located. First, look up each word in the sentence and find which words are nouns. Then, instruct the computer that nouns are often part of noun phrases. Let’s take the boy kicked the ball again as our example. If the computer looks up each word, they will find at least boy and ball are listed as nouns. Kick will be listed as well, as in he gave the ball a swift kick, but in this case, kick has the –ed ending marking it as a verb, not a noun. Next, the computer has a grammatical rule telling it that nouns often occur with determiners (articles) such as the and a. And so it pairs the boy and the ball into noun phrases. Finally, the computer has to decide which of these noun phrases to agree with. The computer’s programmer might put in a rule saying that the noun phrase before the verb in English is most likely to be the subject, while the noun phrase after the verb is most likely to be the object.

This particular rule would fail rather quickly, but the necessity of specifying precise patterns should be clear. The more restricted the domain that the computer must cope with the more superficial the computer’s idea of grammar and meaning can be. Contemporary computers are quite successful in dealing with human language in highly restricted domains. As the domain grows, however, to be more and more like the everyday world humans live in, the more human-like the computer becomes as well. Such a machine translator approximates being an android instead of a desktop computer. To talk to a computer, to really talk to it like we talk to each other, the computer might need to be a silicone version of ourselves. JAKE NOTE: I HAD AN ENTIRE SPEECH RECOGNITION SECTION HERE THAT I THOUGHT WOULD BE COOL, BUT SPACE IS ALMOST GONE, SO I’M KILLING IT. LET ME KNOW IF YOU WANT TO PUT IT BACK IN (WHICH MEANS TELL ME IF YOU WANT ME TO WRITE IT).

1 comment:

fairyhedgehog said...

I found this really interesting and I enjoyed the examples you gave.

If I've understood correctly, domain analysis involves giving a computer the context to understand the language it's translating, thus avoiding some basic errors.

Presumably, if you had enough domain analyses to cover the whole of human experience a computer could translate anything, provided it also had a way of determining which domain was appropriate at any given moment.

I wonder how people make that determination.

About Babel Fish: I was talking about the original Hitchhikers Guide fish, not the online translator. I don't see how a compter could ever be as effective as that literary device. For a start, I don't see how it could learn an unknown language: there isn't always a Rosetta Stone.