Most Large Language Models (such as GPT) are built on generic, freely available data from CommonCrawl, Wikipedia, books, GitHub and other sources. Pangeanic has spent decades accumulating bilingual datasets for the training of statistical and neural machine translation systems, as well as monolingual datasets for language models. As LLMs become more and more popular for a variety of applications, our domain-specific stock is becoming a valuable tool for the creation of generative text solutions.
Increase the language coverage and accuracy of generative systems in more than 100 languages
Large Language Models (LLMs) are changing the world. They have proven to be a very powerful and exciting tool for NLP and related applications. The different Generative Pre-Trained Transformers are revolutionizing the field of Artificial Intelligence, thanks to their ability to comprehend natural language inputs with remarkable accuracy and generate in-context answers, summarization or comparisons on a level that is generally indistinguishable from that of humans.
At Pangeanic, we understand GPT models are not yet perfect and there is plenty of work to do towards the final goal of a GIA (General Artificial Intelligence). As it often happens when training AI algorithms, the availability of good training data is one of the key limitations. For example, the first GPT-1 was trained on a dataset of just 4.5Gb of text data from BookCorpus and released in 2018. In November 2019, GPT-2 was trained with more than 40GB of text data and GPT-3 ("text-davinci-003" and "code-davinci-002") was released in 2020 after training with 570Gb of multilingual text data. One prototype of the GPT-3.5 series ChatGPT (GPT3.5) is rumored to be just over 720Gb of multilingual text data.