A new guide for anyone interested in working with MT and using his/her own data to create machine translation engines has been published by TAUS in its website.
The technical guide to SMT Training Data is intended for users and any organization keen to train engines with its own data. It deals with the preparation of translation training data for statistical machine translation. It examines the processes for data preparation (typically bilingual TMX) which are the catalysts to enable both data and algorithms to work together.
TAUS’ report by Tom Hoar also explores how to define an organization’s training data strategy to match overall system design, identifying potential data sources for bilingual, well-aligned TMX. It also talks about the challenges faced when merging corpora from multiple sources to create large but stable data sets, exploring several methods to prepare translation memories from several sources into Statistical Machine Translation training data.
Finally, it looks into the speech roots of SMT and introduces the concept of exception management as a context for preparing Statistical Machine Translation (SMT) training data.
Pangeanic has made use of many bilingual data sets from several organizations, including the EU and UN in order to mix data and customize machine translation engines for some of its clients.
Next time you think languages, think Pangeanic
Your Machine Translation Customization Solutions