PARALLEL CORPORA FOR MT SYSTEMS
Build and improve Machine Translation Systems with human-crafted, high-quality parallel corpora
Because of our translation roots, we have a long history of creating parallel corpora for MT systems. Our stock, translation and post-editing services have been used to train some of the world’s best-known online machine translation systems.
What are Parallel Corpora?
Parallel corpora are collections of translations, typically in two languages, that are aligned at the sentence or phrase level.
Parallel corpora became very sought-after data with the advent of the first statistical and later neural machine translation systems. They are also essential resources for the training of any machine translation system. They are also very valuable resources for other Artificial Intelligence (AI) applications that require multilingual data, as in the case of many NLP disciplines, such as Natural Language Generation (NLG).
Parallel corpora are a valuable tool for modern research into Machine Translation. They provide insights into the nature of translation, and can be used to train probabilistic machine translation systems. Examples of parallel corpora include the Canadian Hansard proceedings in English and French, the multilingual United Nations Parallel Corpus, and of course the largest collection of multilingual corpora, DGT-Translation Memory (europa.eu). However, please be aware that this is public data related to those Administrations domains and that should be acknowledged if used. Most data sets will require extensive cleansing.
How important is parallel corpora for AI?
Many AI systems require bilingual data or multilingual data, such as machine translation systems. This data needs to be highly curated, in-domain, and cleansed before delivery. Pangeanic collects data that is free from intellectual property (IP) and generates its own monolingual data that is translated at the client’s request to build or fine-tune machine translation systems.
Our parallel corpora enable systems to learn how words and phrases are translated between two or more languages and to recognize the subtle differences in meaning and usage that exist them.
+200 language combinations!
We help create truly multilingual systems, with parallel corpora not only from or into English but in challenging language pairs such as Chinese, Japanese, Korean, Hindi, Hokkien or Arabic into Spanish, French, German, Italian, Brazilian Portuguese, Russian or Polish!
Some of our specialist domains
There is no language domain that resists our experienced team when it comes to translation services for data-for-AI. Here’s a short list of the domains we can provide parallel corpora from:
- Conversations and dialogs
- Engineering disciplines
- Social Media
- Advertising
- Social Media
- Marketing
- Web
- Comments
- Financial
- e-commerce
- Tax Issues
- Medical
- Arts
- History
- Politics
- Public Administration
- User-generated content
- Healthcare
- Legal
- Software
- Videogames
- Instruction manuals
- Automotive
- Engineering disciplines
- Education and School
- Economics
- Idioms, set expressions, sayings and idiomatic expressions
Confidentiality and Ownership (IP rights)
Many of our clients order from our stock or request translation services from data they have provided. Pangeanic’s reputation is built on confidentiality and privacy. We collect, create and manufacture monolingual corpora and translate parts of it upon request. Many of our clients appreciate unique data that can differentiate them from other technology companies and do not wish to share rights or re-use commonly available open-access corpora. Our clients can choose to have full rights over the data, which becomes their property and Pangeanic tags it with a unique identifier, separating it from the general database so it is never resold.
Where are parallel corpora used?
All algorithms need high-quality parallel corpora to learn from. This means corpora that have been carefully selected, verified by humans and are domain-specific. Pangeanic can create corpora with specific terminology and expressions in order to reinforce certain aspects. Apart from machine translation systems, parallel corpora are also used for training and evaluating other multilingual AI applications, like speech recognition or sentiment analysis systems.
How does Pangeanic overcome challenges in building parallel corpora?
Everybody involved in AI knows that building and using any type of training data set is not without its challenges – and parallel corpora is are no exception. Quality is required during the source language collection stage, and of course during the translation and verification stage by humans.
Data quality
Parallel corpora must be high-quality, which means that the translations must be accurate and aligned at the sentence or phrase level. This can be difficult to achieve, especially when dealing with low-resource languages. Post-editing has become a strategy to reinforce the style from a particular MT engine and is no stranger to Pangeanic.
Data quantity
Parallel corpora must be large enough to be useful, which means that they must contain a sufficient number of sentences to enable effective training of machine translation systems. Low-resource languages with a low number of speakers or with a lack of digital presence (African languages or languages of the Indian subcontinent, or Southeast Asia) have historically represented a challenge for data collection and Pangeanic works daily to build stock. Finding specific data in domains such as conversations or dialogs, tax issues being “narrow” domains also represents a challenge.
Data diversity
Parallel corpora must be diverse enough to capture the full range of variation and nuance in the languages being studied. This can represent a challenge, especially when working with languages such as Arabic or German (from Austria, from Germany, from Switzerland) or Spanish that have multiple dialects or regional variations.
Examples of using parallel corpora in AI
Parallel corpora are essential resources for AI applications that require multilingual data such as machine translation, speech recognition, and sentiment analysis. Here are a few examples:
Speech Recognition
Parallel corpora are used to train and evaluate speech recognition systems that can recognize speech in multiple languages.
Pangeanic has a long pedigree in building and using parallel corpora. We understand it can be challenging, but the benefits of our high-quality corpora for machine translation, speech recognition, and sentiment analysis have proven positive results in the accuracy, depth and coverage in many domains for many MT companies. We know what works and you can take our word as developers!
Languages evolve and AI continues to advance
Our corpora collection teams are busy collecting new words, new expressions created by new circumstances such as covid-19, remote working or staycation, in many languages to feed the demand for high-quality parallel corpora and machine translation engine updates.