Curva Fin Bloque
POSTS 24 AUGUST, 2009

Machine Translation – I still do not get it

by

I recently received an email from a company I am trying to introduce to the advantages of machine translation. They deal mostly with a closed environment and the source language is EN (even if poor English sometimes). They are a perfect candidate for automation as they deal mostly with user manuals and controlled documentation.

The comment in question was:

“I understand […] to share TMs for translation, however, I still do not have an understanding on “Machine Translation” which quality still cannot apply to real job yet in Japanese related translation. Even though the technology developed among European languages, the people who do not know European languages like us still worry about the quality, because we cannot judge them. This is my honest impression.
It sounds simple question to you, but it’s my primitive question!”

Creating a solution which is good for everything is out of the question (for now). Many have tried to climb up that mountain only to die in the attempt.

Recently, Google has attempted such a solution with its Google Translator tool, and it works more or less well. It is particularly useful for general information and gisting. I do know it has become a reference tool for many linguists (novices with lack of knowledge and experts whose brains are too full with information or just can’t remember). It is much quicker to ask GT than to check terminology in the EU’s official website IATE, for example.

GT works well if you try to translate an EULA, for example. The quality is rather good, as it has plenty of material aligned from companies’ websites. For many other areas, the results vary from “good enough” (i.e. usable with some post-editing) to “gisting” and to purely bad. GT has been extremely valuable for me when I need to know what was being said in certain documents in Polish, Chinese, Russian or Japanese. It wasn’t a professional translation but hey! it was free and most importantly it was there when I needed it and served an invaluable gisting purpose. It was the difference between not-knowing and knowing, even if badly or mechanically put forward.

Serious machine translation is a rather different concept. I favor statistical to rule-based for many reasons. SMT (statistical machine translation) is based on the concept of logic and maths. Based on the fact that a languages normally has between 10.000 basic words (as in German, the rest are compound words) to around 30.000 (the vast majority), one can guess that with a 2M word corpus everything that language has to say has almost been said. This is not so, as there are numerous repetitions, changes of meaning, technical words, set expressions, etc. Perhaps one reaches 2M and not every verb has been conjugated in all its forms. But a large chunk has -at least the ones we will use for 90% of our daily communications. You can build on this to create a model upon which a machine can expect and compute matches that nowadays are done by hand and in translator’s heads.

If you reduce the scope of your expectations (I only want electronics/ automotive/ legal/ agricultural/ physics domains) then you can be more precise. You need texts that deal with each domain and can for example disregard words and texts that deal with “butter”, “international relations”, “motorbike instructions”, “coffee” or “fishing rights” when constructing a model for electronics. “Motorbike instructions” would be fine for a computer-based model dealing with engineering, for example.

Moses, the engine best, state-of-the-art, open source engine resolves likely possibilities of a given source word being a target word X by applying a set of equations, which match word occurrence (the number of times foreign word X happens every time source word X happens). It works wonderfully. And the more source material during the training, the better.

My colleague (not new to translation but fairly uncomfortable with the concept of machine translation) mentioned regulating (controlling) the way the source material behaves. This can work well and it is true for rule-based models (Systran) which work on the basis that A is always 1, B is always 2, C is 3, and therefore AB must be 12 and CB 31, etc. Controlling the input helps with statistical models, but there is no such a big need as millions of words are computed every time, in every sentence, i.e. the “correspondence” equations are applied on each sentence throughout the 2M-3M-4M-5M words corpus. This is how Google Translator behaves and how our machine PangeaMatic is learning to behave within specific language domains.

Next time you think languages, think Pangeanic
Translation Services, Translation Technologies, Machine Translation