NTEU Neural Translation for the EU
Near-Human Quality Neural Machine Translation for Public Administrations
NTEU Project: Pioneering Pan-European Neural Machine Translation and its Lasting Impact
The Neural Translation for the European Union (NTEU) project, successfully led by Pangeanic in consortium with KantanMT and Tilde, marked a significant milestone in breaking down language barriers across the continent. It was also a major data-for-AI collection effort by the partners. These datasets were later used in the training of Large Language Models like Barcelona Supercomputing Center Salamandra
Funded by the Connecting Facility (CEF) program and supported by the Spanish Government's Secretariat for Digital Advancement (SEAD), this ambitious initiative delivered the largest neural machine translation engine farm to date, enabling seamless and direct translation across all official EU languages.
The project, which ran from September 2019 to August 2021, achieved its core mission: to empower European Public Administrations with cutting-edge automatic translation capabilities, fostering seamless data flow and communication irrespective of the source language.
Key Achievements and Successes:
The NTEU project successfully delivered on its ambitious goals, creating a paradigm shift in multilingual communication within the EU:
-
Creation of a Comprehensive Neural Engine Farm: The consortium successfully developed an unprecedented 506 distinct neural machine translation (NMT) engines. This achievement provided direct, near-human quality translation capabilities between all official EU language combinations, a significant leap from previous systems often reliant on English as a pivot.
-
Elimination of Pivot Language Dependency: A cornerstone success was the ability to translate directly between language pairs like Estonian and Portuguese or Maltese and Greek without first translating to English. This innovation drastically improved translation accuracy, nuance, and speed for less common language pairings.
-
Massive Data Collation and Curation: The project successfully gathered, processed, and utilized vast quantities of bilingual and monolingual data. This included leveraging the consortium members' extensive repositories, the EU’s own substantial data holdings (like EURAMIS, DGT, etc.), and data from other European-sponsored projects. For many language pairs, the target of at least 15 million high-quality sentences was met or exceeded.
-
Advanced Solutions for Under-Resourced Languages: The consortium effectively addressed the challenge of data scarcity for less common language combinations by employing state-of-the-art multilayer neural networks and automatic text generation techniques to create or augment necessary bilingual corpora.
-
Robust Evaluation and Quality Assurance: The project implemented rigorous quality evaluation processes, coordinated by the General Technical Office of the Spanish Language Technology Plan. Results were validated by various universities, ensuring the engines met high industry (e.g., WMT practices) and academic standards.
-
Enhanced eTranslation Capabilities: NTEU significantly extended the coverage and capabilities of the European Commission's existing eTranslation system, moving beyond its initial focus on English and a few major European languages to encompass all official EU languages.
-
Facilitated Integration for Public Administrations: The project successfully laid the groundwork for European Public Administrations to integrate these powerful machine translation services into their national infrastructures, thereby enhancing cross-border digital services and collaboration.
Impact and Legacy:
The NTEU project was not just a time-bound initiative but a catalyst for ongoing innovation in language technology within Europe.
-
Strengthening the Digital Single Market: By enabling seamless multilingual communication, NTEU directly contributed to the European strategy of creating a true digital single market, removing language as a barrier to commerce, administration, and citizen interaction.
-
Paving the Way for Future Developments: The expertise, data, and technological advancements from NTEU have had a lasting impact. Notably, the Spanish government agency SEDIA (formerly SEAD), involved in NTEU, subsequently leveraged the project's foundations and insights for the creation of the advanced Salamandra language models and SalamandraTA translation models by the Barcelona Supercomputing Center (BSC-CNS) in 2024/2025. This demonstrates a clear and impactful lineage from NTEU to next-generation AI and translation initiatives in Spain and Europe.
-
Broader Application Potential: While initially focused on Public Administrations, the high-quality engines developed have the potential to serve a wider range of non-public bodies, institutions, and governments globally, showcasing European leadership in language technology.
-
Increased Visibility and Recognition: The project garnered significant attention, receiving coverage in national press and prominent technological magazines such as Slator, Blog RuralVía, Fundación Madri+d, and El Español, highlighting its importance and innovative nature.
The NTEU project stands as a resounding success. It not only achieved its ambitious technical objectives but also provided a critical infrastructure component for a more integrated and linguistically diverse Europe. Its legacy continues through subsequent national projects like Salamandra, underscoring the strategic importance and long-term value of investing in advanced language technologies.
Outcomes
Scalability and Re-use: Training Data
-
Release of Large Parallel Corpora: NTEU will deliver a comprehensive parallel corpora release containing all training segments across all language pairs, ensuring scalability and reusability beyond the project’s lifespan.
-
Future-Proof Data Sets: The Action’s datasets will enable future technologies to upscale their machine learning algorithms, maximizing the impact of NTEU's contributions.
-
Data Collection Across EU Languages: Building on ongoing efforts, NTEU will collect data from 23 official EU languages by crawling, aligning, processing word documents and PDFs, and re-using resources from initiatives like Paracrawl, NEC-TM, and DGT material.
-
Creation of Synthetic Material: Highly scalable synthetic data will be generated to support machine learning, complementing the collected datasets.
-
Focus on CEF-AT and Public Administrations: The project aims to create the largest parallel corpus between EU official languages, prioritizing the needs of CEF-AT and European Public Administrations
The NTEU project will create the largest ever direct language-to-language engine combination
For more information, Keep reading!