Aya, an open source model supporting 101 languages, is revealed by Cohere

“Today we are launching Aya, a new open source, multilingual model and dataset to help support underrepresented languages.” The development of this large language model (LLM) is the result of an initiative led by the Canadian start-up Cohere and involving more than 3,000 researchers spread across 119 countries.

Its main asset? With 13 billion parameters, Aya outperforms existing open source models and covers 101 different languages, more than double previous models. It is available on Hugging Face. “Many communities are not represented due to language limitations of previous models,” explains the start-up. Aya far outperforms Bloom, which can generate text in 46 languages, and Jais, another model under development aimed at Arabic speakers.

Advertisement

Push for more diversity and representativeness

Cohere claims that the dataset released alongside Aya is the most comprehensive dataset to date, with 513 million data points and completions spanning 114 languages. He contains no less than 204,000 annotations organized by fluent speakers of 67 languages ​​across a diverse set of linguistic applications. These annotations have a lot of value because they help AI models learn efficiently by adding context to the training data.

Aya is also expanding its coverage to more than 50 previously unserved languages, including Somali, Uzbek, and more. The start-up highlights the importance of this multilingual support not being limited to Western countries. “Many languages ​​in this collection previously had no representation in instruction-style datasets”says the company.

Better performance than other multilingual models

Advertisement

Aya provides a basis for languages ​​that are poorly or not currently represented in natural language comprehension, summary and translation tasks. According to Cohere, Aya generates significantly higher quality responses than mT0x, another open source model. Based on human evaluations from professional annotators who compared the model's responses to instructions given in multiple languages, Aya is preferred in 77% of cases.

“It significantly outperforms top open source models, such as mT0 and Bloomz, in benchmark tests. Aya has consistently scored 75% in human reviews compared to other leading open source models, and 80-90% overall simulated win rates”, says Cohere.

Selected for you

Advertisement