ARTIFICIAL INTELLIGENCE
Barcelona Supercomputing Center: Data-for-AI, RLHF, LLM testing and R&D
Spain
 
                
              
            Overview
Pangeanic has established a collaborative relationship with the Barcelona Supercomputing Center (BSC) primarily through our involvement in the field of Natural Language Processing (NLP) and Artificial Intelligence (AI). This partnership is notably facilitated by BSC's Language Technologies Unit, which focuses on advancing NLP technologies, including massive language model building and machine translation projects. Pangeanic contributes to BSC's initiatives by leveraging its expertise in AI-driven translation services and developing tools that enhance language processing capabilities.
Task
Data Annotation for AI: Pangeanic has provided data annotation services that support AI training, which may involve collaboration with BSC to ensure high-quality datasets are available for NLP applications. Our PECAT tool facilitates human-in-the-loop quality control, ensuring that annotated data meets high standards necessary for effective AI training.
Reinforcement Learning from Human Feedback (RLHF): Gathering human feedback and collecting data where humans rank or score the outputs generated by the model helps create a reward model.
Bias detection mechanisms: Work to prevent biases in language models and NLP applications, developing tools and using humans to create datasets for detecting hate speech in multiple languages.
R&D Projects in Machine Translation: Pangeanic has been involved in various research and development initiatives to enhance machine translation technologies. This includes projects that utilize BSC's supercomputing resources to improve the efficiency and accuracy of translation engines, particularly through the integration of AI machine translation techniques.
Pangeanic's solution
The Pangeanic and BSC collaboration involves creating customized data sets containing bilingual segments classified by domain and style, focusing on bilingual data collection. This project responds to the increasing demand for high-quality customized data in various industries. The process includes selecting domains and text styles, identifying and obtaining data sources, data crawling, data cleaning and processing, data validation and labeling, and preparing the data set for natural language processing applications. The project's data set is a valuable resource, especially for low-resource languages like Catalan. By offering customized data sets, the Pangeanic BSC project sets a new data quality and relevance standard, leading to more efficient and accurate data-driven solutions in various industries.
One of the LLM models is available in HuggingFace.
Results
- 
- Joint Research and Development: Pangeanic collaborates with BSC on European R&D projects to improve NLP technologies. This includes work on machine translation, where Pangeanic's automated translation engines complement BSC's high-performance computing resources and LLMs in Spanish, English, Catalan and other languages.
- Focus on Under-Resourced Languages: Both organizations are committed to promoting inclusivity in AI by developing resources for under-resourced languages. Their collaboration aims to create open-source tools and datasets that facilitate the integration of these languages into AI models, thereby addressing gaps in digital representation.
- Technological Infrastructure: The partnership also emphasizes building a robust technological infrastructure for language technologies. This includes hosting benchmarking platforms that help define the state-of-the-art in NLP, which is crucial for academic research and industry applications.
- Ethical AI Development: Pangeanic is dedicated to ethical practices in AI development, focusing on data privacy and responsible use of AI technologies. This aligns with BSC's mission to foster technological advancements that benefit society while maintaining ethical standards.
 
The Future
The collaboration between the Barcelona Supercomputing Center (BSC) and Pangeanic represents a significant step forward in the advancement of Artificial Intelligence and Natural Language Processing technologies. As we look toward the future, this partnership is poised to make substantial contributions to several critical areas of AI development.
Advancing Language Model Development
The partnerships between Pangeanic and BSC will continue to grow, and the focus on massive language model building comes at a crucial time in AI evolution. With the emergence of increasingly sophisticated language models, BSC's supercomputing capabilities and Pangeanic's expertise in language technologies position them to develop more efficient and culturally aware language models. Their work on Spanish, English, Catalan, and other languages will add to the advancement of the state-of-the-art as the AI field moves toward more linguistically diverse models.
The partnership's approach to Reinforcement Learning from Human Feedback (RLHF) is a testament to our innovative spirit. By carefully gathering human feedback and crafting sophisticated reward models, it is not just building better AI systems but creating AI that truly understands and respects human values and preferences. This work comes at a crucial time when the alignment of AI systems with human needs has never been more important.
Nos gusta la filosofía de trabajo de Pangeanic así como su profesionalidad. Su constante trabajo de escucha activa hacia el cliente les hace mejorar cada día más y esa, bajo mi punto de vista, es una de sus grandes cualidades.
Servicio rápido y eficiente. Encontramos la creación de una base de datos de traducciones realizadas por parte de Pangeanic una herramienta muy útil para otras ocasiones y/o trabajos similares.
La calidad es excelente, como de costumbre. El texto de origen cambió muchas veces durante el proceso de traducción. Pangeanic reaccionó rápidamente a los cambios, lo cual nos ayudó mucho.
Become our next success story?
Choose a goal: saving time and money, scaling operations, or adding AI to your translation workflows: knowledge management and multilingual dissemination with ECOChat. Pangeanic will provide a solution to scale communication and distribution of your content to reach audiences worldwide.
 
        

