Researchers at CentraleSupélec release an open source, bilingual AI model

The research teams from the MICS laboratory at CentraleSupélec have developed jointly with several academic partners a large language model (LLM) called CroissantLLM. Available on the Hugging Face platform, this model presents itself as sovereign and open source. It was thus developed by French people and trained on the Jean Zay supercomputer.

Furthermore, the datasets are French and public, which truly makes it an open model, unlike for example Llama 2 or the Mistral AI models. These data sets come from legal, administrative, cultural, commercial, scientific and translation data, explains Manuel Fayssewho participated in the development of this LLM.


A successful bilingual French-English model

Pre-trained on a set of 3000 billion English and French tokens, CroissantLLM has 1.3 billion parameters, a far cry from the 175 billion parameters of the GPT-3.5 version of OpenAI). Note that it was trained on as many more content in French than content in English, which allows them to integrate and master the specificities of the French language and culture.

Based on a Llama-type architecture, the model is ultimately smaller than those published in recent months. And this choice was made by the researchers for a good reason: to push for better adoption of the model thanks to operation on consumer hardware. “If we look at HuggingFace downloads, the most downloaded models are not the most efficient (Llama2-70B, Mixtral 8x7B) but rather the smallest (Llama2-7B, Mistral 7B) which are easier and less expensive to serve and adjust”, notes Manuel Faysse.

Able to run on CPUs and mobile devices

The researchers therefore took the gamble of proposing a model with “few” parameters and capable of running quickly on low-end GPU servers, while maintaining high throughput and low latency. CroissantLLM can also run on CPUs or even mobile devices with decent speeds, the researchers say, making it an energy-efficient model.

Obviously, one should not expect reasoning, mathematical and coding abilities that are equal to other, much larger models. The team of researchers believes that“it will be perfect for more specific industrial applications, translations or even chat capabilities where the big guns are not always in demand.”


A benchmark created for evaluating model performance in French

To evaluate the performance of the model in French, the researchers also launched a dedicated evaluation benchmark called FrenchBench. It is composed of a set of classification and generation tasks and covers various aspects of model performance in the French language. On the multiple-choice section of FrenchBench – which focuses on reasoning, factual knowledge and linguistic abilities – CroissantLLM thus achieves better performance than other models of similar size.

Still with this goal of transparency, researchers have released codebases and dozens of checkpoints for different model sizes, training data distributions, and training steps, as well as fine-tuned Chat models. “We evaluate our model using the FMTI framework and validate 81% of the transparency criteria”, specify the researchers.

The beginnings of other research on the development of bilingual models

Ultimately, CroissantLLM and associated artifacts also aim to be a support to promote continued research into multilingual language models as well as understanding the impact of pre-training data on internal knowledge. In the meantime, businesses can access two versions of CroissantLLM – the basic version and an fine-tuned version for a chatbot – from the Hugging Face platform.

Selected for you