European multilingual AI infrastructure

NTEU: direct neural translation infrastructure for every official EU language

Neural Translation for the European Union built a large scale farm of specialized neural models for direct translation between European languages, supported by multilingual data collection, corpus preparation, model training and systematic evaluation.

Discuss a multilingual AI project Explore the project results

506 directions Direct neural translation between non English EU languages

24 EU languages Infrastructure designed for complete official language coverage

€1.9 million European funding for multilingual digital infrastructure

2019 to 2021 Two year data, training and deployment programme

Pangeanic led Consortium with KantanMT, Tilde and SEDIA

The European language challenge

A multilingual Europe cannot depend on one language as the route between all others

Earlier machine translation infrastructure commonly relied on English as an intermediate language. A translation from Estonian into Portuguese, for example, could be generated first from Estonian into English and then from English into Portuguese.

This pivot model increased the risk of information loss, accumulated errors and reduced performance for languages with fewer digital resources. It also limited the ability of European public administrations to exchange information directly across national language boundaries.

NTEU addressed the problem by building models for direct translation between official EU languages, supported by the collection, cleaning and preparation of large multilingual corpora.

English pivot dependency

Indirect translation introduced an additional processing step and another opportunity for meaning to be altered.

Uneven language resources

High resource languages had significantly more parallel and monolingual data available than smaller European languages.

Fragmented public data

Valuable multilingual material existed across institutions, repositories and previous projects but required extensive preparation.

Cross border services

European administrations needed reliable language infrastructure for digital services, documents and institutional communication.

The NTEU approach

Build each translation direction as a specific data and model problem

NTEU combined large scale data operations with model specialization. Each language direction required suitable corpora, preparation procedures, training decisions and evaluation criteria.

Discover multilingual data

Identify parallel and monolingual resources from European institutions, consortium repositories, public sources and previous research programmes.

Clean and normalize corpora

Remove noise, duplicates, misaligned segments, encoding problems and unsuitable content before training.

Align language pairs

Convert multilingual documents into usable sentence and segment alignments for each required translation direction.

Train specialized models

Build neural translation systems adapted to the data conditions and linguistic characteristics of individual pairs.

Evaluate performance

Measure output systematically through automated metrics, comparative testing and professional linguistic review.

Deploy as infrastructure

Prepare direct language services for integration into European public administration and eTranslation environments.

Pangeanic in NTEU

Consortium leadership, multilingual data operations and neural model development

Pangeanic coordinated the project and connected corpus acquisition, data engineering, neural training, evaluation and European institutional requirements into a single delivery programme.

LEAD

Project coordination

Pangeanic led the consortium, technical planning, delivery coordination and alignment with the objectives of European multilingual digital services.

DATA

Corpus collection and preparation

Large volumes of bilingual and monolingual material were gathered, cleaned, filtered, aligned and prepared for model training.

NMT

Neural model training

Translation systems were trained for direct language directions rather than relying systematically on English as an intermediary.

LOW

Support for lower resource pairs

Data augmentation, multilingual transfer and synthetic data techniques helped address language directions with limited parallel material.

Evaluation and quality control

Automated evaluation and expert linguistic review were used to compare systems and identify models suitable for operational use.

Public sector integration

The resulting engine farm was designed as capacity for secure multilingual services serving European public administrations.

A major Data for AI programme

The engines were the visible output. The multilingual data operation made them possible.

Training hundreds of direct models required far more than running neural architectures. The central challenge was assembling useful, representative and technically consistent training material for language combinations with very different levels of digital availability.

NTEU therefore operated as a large multilingual data programme: sourcing corpora, validating rights and provenance, removing unsuitable material, detecting language, aligning segments, filtering noise and constructing training and evaluation sets.

This work anticipated the current demand for task specific models. Model quality depends on the suitability of the data, the intended domain, the language direction and the evaluation method.

Explore Data for AI Explore multilingual datasets

Corpus acquisition

Institutional repositories, translation memories, public corpora and project data.

Data cleaning

Deduplication, language identification, format normalization and noise removal.

Alignment and filtering

Sentence pairing, confidence scoring and rejection of weak or misleading alignments.

Training set construction

Balanced corpora prepared according to the requirements of each language direction.

Evaluation assets

Separate test material for automated benchmarking and professional human assessment.

Project results

Direct translation capacity across the official languages of the European Union

The project complemented language infrastructure that had traditionally concentrated on directions involving English.

506

Direct translation directions

Twenty three non English official EU languages multiplied by the twenty two other possible target languages.

23 × 22

Complete non English matrix

Each language could be translated directly into every other non English official EU language.

Required English pivots

Direct models reduced the need to pass European language content through English.

15M

Target sentence scale

For many language pairs, the programme targeted at least fifteen million high quality training sentences where data availability permitted.

Direct language architecture

European languages communicating without an obligatory intermediary

The objective was a network of specialized directions rather than a system in which every path passed through the same central language.

Estonian → Portuguese

Maltese → Greek

Croatian → Italian

Latvian → Spanish

Finnish → Bulgarian

Slovenian → Danish

Romanian → Lithuanian

Slovak → Dutch

These examples illustrate the architecture. NTEU addressed the full matrix of direct directions between the 23 official EU languages other than English.

European consortium

Language technology companies and Spanish public sector coordination

The consortium combined multilingual data resources, neural machine translation platforms, experience with lower resource languages and institutional coordination.

Coordinator

Pangeanic

Project leadership, multilingual data operations, neural model development, evaluation and European delivery.

Technology partner

KantanMT

Neural machine translation technology, training infrastructure and model development.

Technology partner

Tilde

Multilingual data, neural models and expertise in Baltic and lower resource European languages.

Public sector partner

SEDIA

Spanish public sector participation and alignment with national language technology strategy.

NTEU in 2026

An early demonstration of task specific multilingual AI

NTEU was designed before small and specialized models became a central enterprise AI discussion. Its architecture already reflected a principle that remains relevant: different tasks, languages and operating contexts benefit from models trained and evaluated for their specific purpose.

A general model may offer broad linguistic coverage. Operational quality, however, still depends on suitable domain data, representative evaluation sets, controlled deployment and measurable performance in the intended workflow.

NTEU also reinforced a capability that continues through Pangeanic’s current work: transforming multilingual documents and corpora into governed data assets for training, adapting, aligning and evaluating AI systems.

SLM

Task specific models

Model architecture and data can be adapted to a defined language direction, domain and service requirement.

DATA

Multilingual training assets

Clean parallel corpora remain valuable for machine translation, language models, evaluation and model alignment.

EVAL

Language specific evaluation

Aggregate benchmarks cannot replace evaluation across individual languages, domains and operational conditions.

SOV

European control

Public language infrastructure can be deployed under European governance, security and data protection requirements.

From research to production

What NTEU demonstrates for enterprises and public institutions

The project provides a practical blueprint for developing multilingual AI systems where data quality, specialization, evaluation and deployment control matter.

For public administrations

Build secure multilingual services for documents, portals, cross border procedures and institutional communication.

For enterprise model teams

Prepare multilingual training and evaluation data for task specific language models and controlled applications.

For language technology providers

Source, clean, align and evaluate parallel corpora for adaptation, fine tuning and quality benchmarking.

For sovereign AI programmes

Develop models and datasets under defined infrastructure, provenance, security and data governance conditions.

Public evidence

Project documentation, academic publication and European references

These sources document the project objective, consortium, engine matrix, dates and technical approach.

Academic publication

Neural Translation for the European Union Project

EAMT paper describing the direct language architecture, consortium and project period.

Read publication →

European project profile

NTEU project facts

Public project profile documenting the consortium, coordination, funding and principal objectives.

View project profile →

Technical presentation

The 506 engine architecture

Presentation explaining the 23 by 22 matrix and the objective of eliminating English pivoting.

View presentation →

Pangeanic project update

NTEU at META Forum

Contemporary project information covering the engine farm, language combinations and public administration use.

Read project update →

European Commission

European AI translation services

Current European Commission information on secure multilingual AI tools and official language coverage.

Explore EU language tools →

Pangeanic project portfolio

European AI and language technology projects

NTEU within Pangeanic’s wider record in multilingual data, translation, privacy and European AI infrastructure.

Explore all projects →

Multilingual data and models

Build language AI around your languages, data and operational requirements

Pangeanic supports enterprises, public institutions and model developers with multilingual data collection, corpus alignment, model adaptation, evaluation, human validation and sovereign deployment.

Discuss your multilingual AI project Explore Data for AI Explore multilingual datasets