Automatically detect the language of a sequence, text or document

Identify the language and character encoding of documents

Our language detector can successfully be used to:


Process text before Machine Translation


Pre-enhance the text and improve the quality of the received data when training algorithms


Organize data (speech to text, documents, etc.) prior to other processes


Extract bilingual texts from online resources for machine translation


Retrieve, group, and understand relevant information (user texts, e-mails, etc.) in a multilingual environment

Pangeanic's language detector accurately determines both the language of the entire document and the language of each fragment, paragraph or section

Our language detector combines statistical and neural technologies to obtain the best recognition results. Our own algorithm is based on a mathematically sound model of the vector spacing algorithm.


We create a multidimensional space with vectors that analyze the content of the documents and use the notion of n-grams to compute the frequencies. The algorithm analyzes the positions of the required vectors in space to determine their similarity.

Finally, the combined results of the algorithm are corrected using special linguistic rules developed by our team of expert linguists.

For evaluation purposes, we created a demo page to detect the most popular languages with a language identification accuracy of 95% to 99% (typical competitor results: 86% to 96%). The average processing speed was over 8000 KB/s.

