MONOLINGUAL DATASETS FOR LARGE LANGUAGE MODELS

Fine-tune Large Language Models and Generative Pre-trained Transformers with our domain-specific monolingual datasets

We understand one of the challenges of Generative Pre-Trained Transformers (GPT) is to keep up-to-date and obtain domain-specific information. Our decades-long experience in building Parallel Corpora helps us to enhance the human-like capabilities of your Large Language Models with a strong focus on Ethical AI, from collection to cleansing and delivery.

Most Large Language Models (such as GPT) are built on generic, freely available data from CommonCrawl, Wikipedia, books, GitHub and other sources. Pangeanic has spent decades accumulating bilingual datasets for the training of statistical and neural machine translation systems, as well as monolingual datasets for language models. As LLMs become more and more popular for a variety of applications, our domain-specific stock is becoming a valuable tool for the creation of generative text solutions.

Increase the language coverage and accuracy of generative systems in more than 100 languages

Large Language Models (LLMs) are changing the world. They have proven to be a very powerful and exciting tool for NLP and related applications. The different Generative Pre-Trained Transformers are revolutionizing the field of Artificial Intelligence, thanks to their ability to comprehend natural language inputs with remarkable accuracy and generate in-context answers, summarization or comparisons on a level that is generally indistinguishable from that of humans.

Tablero

At Pangeanic, we understand GPT models are not yet perfect and there is plenty of work to do towards the final goal of a GIA (General Artificial Intelligence). As it often happens when training AI algorithms, the availability of good training data is one of the key limitations. For example, the first GPT-1 was trained on a dataset of just 4.5Gb of text data from BookCorpus and released in 2018. In November 2019, GPT-2 was trained with more than 40GB of text data and GPT-3 ("text-davinci-003" and "code-davinci-002") was released in 2020 after training with 570Gb of multilingual text data. One prototype of the GPT-3.5 series ChatGPT (GPT3.5) is rumored to be just over 720Gb of multilingual text data.

	Architecture	Parameters	Training dataset
GPT-1	12-level, 12-headed Transformer decoder (no encoder), followed by linear-softmax.	0.12 billion	BookCorpus: 4.5 GB of text, from 7000 unpublished books of various genres.
GPT-2	GPT-1, but with modified normalization.	1.5 billion	WebText: 40 GB of text, 8 million documents, from 45 million webpages upvoted on Reddit.
GPT-3	GPT-2, but with modification to allow larger scaling.	175 billion	570 GB plaintext, 0.4 trillion tokens. Mostly CommonCrawl (410 billion, 60%), WebText2 (410 billion, 22%), English Wikipedia (3 billion, 3%), two books corpora (Books1 with 12 billion tokensand Books2).

What makes Pangeanic Monolingual Datasets so special?

When it comes to training language models, the quality of the monolingual dataset used is crucial. Not all petabytes of data from CommonCrawl have been used in the training of models like Kosmos, Bloom, ChatGPT, etc. They have been highly curated. Pangeanic’s team helps you keep updated and extend knowledge domain coverage, filling in the gaps where your team can’t mine or create enough data.

Monolingual datastes can come in many different forms, from books to articles, web pages, etc. More personal data such as conversations, interviews, emails, instruction manuals, etc., podcasts, are hard to come by or are covered by IP. Our team creates and stocks IP-free human-quality text just as it has done for parallel corpora for Machine Translation systems. And we do so in multiple languages!

Trust an extended, expert language team

Our team is comprised of Machine Learning engineers, computational linguists, and translation professionals. Together, they bring the best possible data curation, selection and verification for each monolingual dataset for language modeling.

There are several points to bear in mind when building a Large Language Model. Pangeanic has been building them since 2009 for its statistical Machine Translation, and from 2017 for its neural MT systems.

We recommend bearing in mind the following 6 points when training an AI model as datasets must be accurate, scalable, and relevant for your use case(s), to guarantee that the models understand the requirement, extract relevant information, and respond with results that are informed, accurate, and acceptable to your target users.

SIZE

The size of the dataset is critical for training LLM models, as larger datasets can lead to better language modeling performance. Ideally, the dataset should be large enough to capture the diversity of the language and the text domain.

DIVERSITY

The dataset should contain a diverse range of text types, such as news articles, social media posts, and scientific papers. This helps the model to learn the nuances of the language and perform better on a variety of tasks.

QUALITY

The dataset should be of high quality, free of errors, and well-structured. A poorly structured or error-filled dataset can lead to a lower-quality model.

DOMAIN-SPECIFIC

If you are training an LLM model for a specific domain, such as automotive, electrical engineering, legal or medical, it is essential to have a dataset that is specific to that domain. This will help the model learn the specific language and terminology of that field, leading to better performance on domain-specific tasks.

CONTINUITY

A dataset that spans a long period of time is ideal, as it can help capture language changes and evolution over time.

BALANCED

A good dataset should have a balance of text from different sources and domains to prevent the model from learning biases or overfitting to a specific type of text.

We can build monolingual datasets for specific projects with complete ownership rights

We help you build a good monolingual dataset for LLM training with large, diverse, high quality, domain-specific, continuous, and balanced monolingual text.

How are Pangeanic Monolingual Data Sets collected?

As a language services and NLP company, we have been stocking domain-specific monolingual datasets for 20 years. We continuously gather and create in-domain, high-quality pre-candidates for translation. This makes our monolingual datasets unique.

We collect data for AI training in several ways. Mostly, we use our ECO platform and apps to recruit individuals to write on given topics.

Our language team also curates non-crawlable data and cleans freely available open data, updating it and / or creating versions to make it IP-free.

Finally, we enter into agreements to buy small sections of pre-existing text files that have not been made publicly available, particularly in low-resource languages.

Looking for Specific Monolingual Datasets?

With Pangeanic, not only you can define your needs to request, manage and verify datasets hassle-free online with an easy drag-and-drop UI, but you can also monitor how our team and taskers are performing and dataset collection is progressing.

All monolingual datasets will be exhaustively verified, cleansed and annotated with the relevant metadata (age, domain, source, etc.) as per your requirements so that your algorithms grow as accurately, as strongly and as wisely as you want!

Intellectual Property Rights on Monolingual Datasets

Our data sets offer you full ownership and full copyright, either for monolingual datasets or for the audio collected (speech data) and for the transcription for ML training. Pangeanic follows processes so that Ethical AI is built into every step, and you can be sure it is passed on to your products. Our customers enjoy a smooth relationship with a trustworthy vendor of data services supplying Training Datasets to improve LLM, MT or ASR performance, freeing you from the hassle of generating, collecting, and processing text or audio, whilst adding valuable metadata.

Sentiment Annotation for Speech Datasets

Sentiment analysis provides very valuable insights that can often drive business decisions. Pangeanic has a large amount of experience in building its own sentiment analysis tools. To extract sentiment, you must be a native speaker of the language and understand all the nuances and intricacies, and sometimes ironic language. Our recordings can be annotated as positive, negative, or neutral to add more data to train an ML model that will be able to understand and provide its own insights on sentiments. Our PECAT text annotation tool speeds up all sentiment annotation tasks.

Discover Pangeanic's Monolingual Datasets for AI

We want to help you grow your business. We combine AI and human ingenuity to provide you with a custom service. Let technology do the work for you.

MONOLINGUAL DATASETS FOR LARGE LANGUAGE MODELS

Pangeanic for Monolingual Datasets