The task of machine translation (MT) is to take input from a source language, and put out a translation in a target language.
Machine translation has not reached the point where it can be run as a batch and produce flawless translation. When run as a batch, there is generally a provision for post-editing by a human. Other approaches are interactive, and refer to the judgements of humans throughout the process of translation.
Some MT systems are only concerned with a single language pair, others are multilingual.
The illustration shows a diagram that is commonly drawn when discussing the tradeoffs involved in MT.
The 'S' and 'T' stand for source and target, respectively.
Early versions of MT took what is called the 'direct' approach, wherein words in the source language were translated directly into their presumed counterparts in the target language, with a little massaging of word order. Needless to say, the results were less than spectacular.
Ideally, we would like to be able to translate the source into some language neutral representation called an interlingua, represented by 'I' in the illustration. Then translation would involve 'simply' rendering the source into the interlingua, then 'simply' generating the target. Using an interlingua, we could 'easily' create a multilingual MT system by adding a language-to-interlingua analyzer, and an interlingua-to-language generator.
Problems with interlinguas:
In practice, MT systems are forced to pursue a transfer approach, which pursues a course somewhere through the mid-altitudes of the triangle, making generalizations where possible, relying on language-pair specific knowledge where necessary.
One of the disadvantages of using a transfer approach is that translation among N languages requires N (N-1) language pairs to be defined. For an interlingua, this would only require 2N language pairs.
One aspect of translation involves chosing words and phrases in the language which most closely correspond to words in the source.
Transfer Ambiguities Aside from the kinds of ambiguities that make life so interesting for NLU researchers, translation also involves transfer ambiguities. The target language may recognize distinctions which are not encoded directly in the source language. For example 'wear' is best translated into Mandarin Chinese as chuan if the article being worn is a shirt or pants, but dai if the article being worn is a hat or watch. Conversely, translating the words dai and chuan from Mandarin to English, we must decide whether the matter referred to is a case of 'wearing' or 'putting on'. In translating the word 'thing' into Mandarin, we must decide whether we're talking about a physical object (dongxi) or a state of affairs (shiqing).
Lexical gaps There are many examples of cases where there is no word in the target which corresponds to a word in the source. Once when asked in Mandarin if I liked squid, I wanted to say 'It tastes great, but I don't like the texture.' I have yet to find someone who can tell me an appropriate word for 'texture' in this context. You pretty much have to fall back on saying 'it chews like rubber'.
Idioms. Languages are full of idioms such as 'bite the bullet' which do not obey the principal of compositionality, which is to say the meaning is not the result of constructing its parts. Where transfer is concerned, there is no way around dealing with each idiom as a special case.
Structural transfer applies to syntactic differences between languages. In languages like Spanish, adjectives tend to follow nouns; in Japanese, the main verb comes after its complements.
If you are using phrase structure rules to model the syntax of source and target, you will have to express transfer rules which rearrange the trees in addition to lexical transfer.
Example (from Hutchins & Somers):
In light of this we might encode a rule like:[s:[np X] [tv:likes] [np:Y]]=> [s:[np:Y'][v:gefallt][pp: dem [X']]
...where X' and Y' indicate that other rules would be employed to translate X and Y.
Bear in mind that for reasons of simplicity there are a number of features such as gender, number, etc. which are left out of this example, but which would are crucial to the task of translation.
Target languages often present a number of constraints which don't hold in the source language. Japanese has something called the animate subject constraint, which forbids direct translation of sentences like 'The wind opened the door.' English and other European generally require that each sentence have a subject, but other languages such as Japanese and Mandarin use a 'topic-comment' paradigm, where the topic does not have to be stated if it has not changed. This means that in translating from such a language, we must supply a subject. It may also make translation into these languages awkward if the topic is repeatedly being unnecessarily provided with each sentence.
As mentioned earlier, a true interlingua is language neutral, but falling short of that, pushing the language model more toward the direction of semantics can capture a number of generalites which may aid in the translation process. We can do this with a unification grammar and thematic roles. Referring back to the like/gefallen example, suppose we adopted this approach:
The german equivalent might be:
...and...
An approach like this would make it easier to express the fact that in English one says 'I like to swim', but in German the expression is more like 'I swim gladly'.
Other approaches which are gaining currency are those which rely more heavily on the existence of bilingual corpora. The existence of these corpora allow us to use statistics and refer to specific examples.
A bilingual corpus contains two sections, each a fluent translation of the other. In order to be useful, the corpus must be aligned, i.e. to as great a degree as possible, each phrase in one section must be indexed to its counterpart in the other.
An example-based approach, for example, might encounter some phrase which it wishes to translate. It could retrieve examples by using some metric of similarity to expressions already in the corpus, map the new expression to the previous one, determine the corresponding parts of the aligned translation, and use those parts as the basis for further processing.