by Manuel Herranz
There has been some recent controversy in LinkedIn and blogs about claims to higher technical levels of engine customization, what is machine translation engine customization, DIY Machine Translation Customization, and how some people understand it.
PangeaMT specializes in custom-built systems which users (typically LSPs and translation language departments) can later re-train in two different ways
K. Vashee states that “The reality is that running an open source MT solution or using a “upload and pray” solution like that of many DIY MT vendors has become very easy.” This is a gross misunderstanding of what DIY MT is. DIY is about empowering the MT user to take control of the system or at least part of the process, rather than being a passive receiver of MT output that has to be quickly post-edited.
Building an MT engine has become pretty popular (that’s different from easy) and widespread in 2013. Systems are getting better as more and more data is available. Yet, data is not everything. One of our largest engines at PangeaMT holds more than 190 million words, and other engines contain five or six TMX files with over 300Mb of text data inside each. Some little engines with under 5M words perform very well for the documentation task they have been built (see our common presentation with Sybase at Localization World 2011 below).
I do not know any MT system builders who claim that using unclean data will not affect the output. Or that leave such freedom to untrained MT system users, without training. That is a key differentiator for PangeaMT: we train users so they can have an impact on how their MT will evolve and develop. Initial revision of (at least) part of the material or typical chunks of text within the domain is the first step to MT engine customization. I summarize some key steps for a good DIY SMT implementation, whether on-site or off-site (SaaS):
1. Gather relevant, in-domain material.
Your own material is key for the best engine performance. The material you have translated in the past is likely to be similar to the material you will translate in the future. Those expressions, terminology lists, translation memories, HTML files, parallel data, even monolingual texts, will form the basis of your customized engine.
However, there may be times when you cannot share all your data. This is the advantage of PangeaMT. Do not despair. Any general, related data will serve purpose for the engine set up. We will train you and show you potential pitfalls with training sets and cleaning.
2. Ask your vendor to analyze the data provided and run cleaning procedures. Your MT vendor should be transparent about “dirty data”, segments discarded and present an analysis of the troublesome segments or datasets which should not be used for machine learning. Dirty data does not mean “bad translation” but very often “noise” that has been introduced by the translation management tool itself, rendering a segment unusuable for machine learning. Explaining rather than translating, or offering bilingual versions will of course confuse learning patterns. So will adding – ” “, ; : profusely when they should not be there, or bad alignments. Source same as target
Data cleaning is a key step in the system. We recommend deleting segments rather than trying to “repair” them. Most of the time, it is not worth the time – unless your data is really dirty.
A lot of cleaning can be done prior to the material entering the system (see below).
There are more complicated “cleaning” routines which fall outside the scope of this article and involve revising alignments in phrase tables. We will leave that for keen system users.
3. Perform initial tests (first engines) together with your vendor.
Your vendor may do this and just present your with the final “good” engine or with a variety of engines depending on your specialization. A habitual training method is to separate 2,000 segments from the training material and then ask the engine to translate those segments, thus obtaining a BLEU score (i.e a measure of how good the system thinks it is). However, this is not the only way nor the most efficient and % BLEU scores cannot be compared across languages nor even within the same language for different domains. An engine providing a 55% BLEU is no good when asked to translate out-of-purpose material, whereas PangeaMT systems have been reported to provide productivity increases from 50% – 300% in German with small engines scoring 38% BLEU but built for very specific purposes like software documentation or automotive manuals.
Put the engine to test with previous translations you have not provided or similar material.
4. Learn about engine re-training and the impact of post-edited material.
How big is your engine? How many words does it contain? What is the BLEU score/Meteor, etc? How many words do I need to retrain my engine? Does my vendor ask for 5%, 10% of the engine size or does it promise on-the-fly re-training with jsut one sentence? Even though that sounds pretty good, a 20-word sentence will have little impact on any engine, particularly considering that the “small” MT engines may contain 5 million words.
We recommend a route whereby your post-edited material can enter the re-training cycle at any time, and a system where you are in control of both cleaning and re-training. PangeaMT offers both. You can upload new material any time after you complete a translation or finish a post-editing job. The latter is extremely good material and several papers point to benefits of post-edited material in MT engines. You can also schedule or set immediate re-training.
Your Machine Translation Customization Solutions