A mysterious magnet link detonated the entire AI circle. Now, the official evaluation results are finally here:
Mixtral 8x7B, the first open source MoE large model, has reached or even surpassed the level of Llama 2 70B and GPT-3.5.
(Yes, it is the same solution as the rumored GPT-4.)
And because it is a sparse model, it only uses 12.9B parameters to process each token to achieve this result. Its inference speed and cost are also equivalent to the 12.9B dense model.
As soon as the news came out, discussions once again started on social media.
Andrej Karpathy, a founding member of OpenAI, rushed to the scene immediately to organize his notes and highlighted the key points: the strongest model revealed by this “European version of OpenAI” is only a “medium cup”.
PS. Mixtral 8×7B or even just a small cup…
NVIDIA AI scientist Jim Fan praised:
There are more than a dozen new models popping up every month, but only a few of them can really stand the test, and even fewer can arouse everyone’s enthusiastic attention.
And in this wave, not only has Mistral AI, the company behind the model, attracted much attention, but it has also made MoE (Mixture of Experts) once again become the hottest topic in the open source AI community.
HuggingFace officially published a MoE analysis blog post while it was still hot, which also had the effect of “a flood of forwarding”.
It is worth noting that Mistral AI’s latest valuation has exceeded US$2 billion, an increase of more than 7 times in just 6 months…
Basically surpass Llama 2 70B
Speaking of which, Mistral AI is also an unusual company. The big factory next door had just launched a press conference with great fanfare and was slowly releasing models. However, they reversed the process and reversed the process:
First, I made the link available for download, and then submitted a PR for the vLLM project (a large model inference acceleration tool). Finally, I thought of publishing a technical blog to officially announce my model.
△ The model was originally released by Aunt Jiang
First of all, the official stated confidently:
Mixtral 8×7B outperforms Llama 2 70B in most benchmarks, delivering 6x faster inference.
It is the most powerful open-weighted model with a permissive license and the best value for money.
Specifically, Mixtral uses a sparse mixed expert network, which is a decoder-only model. In it, the feedforward block selects from 8 different parameter groups –
That is to say, in fact, Mixtral 8×7B is not a collection of 8 7B parameter models, but only has 8 different feedforward blocks in the Transformer.
This is why the parameter size of Mixtral is not 56B, but 46.7B.
Its characteristics include the following aspects:
Outperforms Llama 2 70B in most benchmarks, even good enough to beat GPT-3.5
Context window is 32k
Can handle English, French, Italian, German and Spanish
Excellent performance in code generation
Licensed under the Apache 2.0 license (free for commercial use)
The specific test results are as follows:
In addition, when it comes to hallucination issues, Mixtral’s performance is also due to Llama 2 70B:
The score on the TruthfulQA benchmark is 73.9% vs 50.2%; it shows less bias on the BBQ benchmark; on BOLD, Mixtral shows more positive sentiment than Llama 2.
Released together with the Mixtral 8×7B basic version this time, there is also the Mixtral 8x7B Instruct version. The latter has been optimized by SFT and DPO and achieved a score of 8.3 on MT-Bench, which is similar to GPT-3.5 and better than other large open source models.
At present, Mistral has officially announced the launch of API services, but it is still an invitation-only service, and uninvited users need to wait in line.
It is worth noting that the API is divided into three versions:
Mistral-tiny, the corresponding model is Mistral 7B Instruct;
Mistral-small, the corresponding model is the Mixtral 8×7B released this time;
Mistral-medium, the corresponding model has not yet been announced, but officials revealed that its score on MT-Bench is 8.6 points.
Some netizens directly pulled GPT-4 over for comparison. It can be seen that the score of the medium cup model on WinoGrande (common sense reasoning benchmark) exceeds GPT-4.
In terms of price, the input and output prices from small cup to medium cup range from 0.14 to 2.5 euros per million tokens and 0.42 to 7.5 euros respectively. The embedded model is 0.1 euros per million tokens (1 euro is about 7.7 RMB).
The online version can currently only be experienced on third-party platforms (Poe, HuggingFace, etc.).
Can understand Chinese, but is reluctant to speak it
Although the official announcement did not say that it supports Chinese, our actual testing (the online version in HuggingFace Chat, the model is the Instruct version) found that Mixtral already has certain Chinese capabilities at least at the understanding level.
At the production level, Mixtral is less inclined to answer in Chinese, but if specified, you can also get a Chinese reply, but there are still some mixed situations between Chinese and English.
Faced with more “retarded” questions, Mixtral’s answers were quite satisfactory, but they seemed to have at least understood the literal meaning.
In terms of mathematics, when faced with the classic chicken and rabbit problem, Mixtral’s answer is completely correct from the process to the result.
Even for advanced mathematics problems, such as complex function derivation, Mixtral can give correct answers. What’s even more valuable is that there are no problems in the process.
This official announcement specifically emphasized Mixtral’s strong coding capabilities, so it also received our focus.
After a difficult level of LeetCode, the code given by Mixtral passed the test in one go.
Given an unsorted integer array nums, please find the smallest positive integer that does not appear in it.
Please implement a solution that has time complexity of O(n) and uses only constant extra space.
But as we continued to ask questions, Mixtral’s answer accidentally revealed that he may have trained specifically for LeetCode, and it was also the Chinese version of LC.
In order to more truly demonstrate Mixtral’s coding capabilities, we turned to it to write a utility program – a Web version of a calculator using JS.
After several rounds of adjustments, although the layout of the buttons is a bit strange, the basic four arithmetic operations can be completed.
In addition, we will find that if new requirements are continuously added in the same dialogue window, Mixtral’s performance may decline, and problems such as code format confusion may occur. It will return to normal after opening a new round of dialogue.
In addition to API and online versions, Mistral AI also provides model download services, which can be deployed locally after downloading using the magnet link on 𝕏 or through Hugging Face.
On 𝕏, many netizens have already run Mixtral on their own devices and provided performance data.
On an Apple M3 Max device with 128GB of memory, running Mixtral with 16-bit floating point precision consumes 87GB of video memory and can run 13 tokens per second.
At the same time, some netizens also passed the test on M2 Ultra llama.cpp It ran at a speed of 52 tokens per second.
Seeing this, how would you rate the strength of Mistral AI’s model?
Many netizens are already excited:
“OpenAI has no moat” looks certain to become a reality…
You know, Mistral AI was just established in May this year.
In just six months, it has already reached a valuation of US$2 billion, a model that has amazed the entire AI community.
More importantly, Princeton doctoral student Tianle Cai analyzed the weight correlation between the Mistral-7B and Mixtral-8x7B models, proving the successful reuse of the model.
Later, netizens discovered that the founder of Mistral AI also personally confirmed that the MoE model was indeed copied 8 times from the 7B basic model and further trained.
With the free commercial use of such models, the entire open source community and new startups can promote the development of large MoE models on this basis, just like the storm Llama has already caused.
As a melon-eater, I can only say:
Reference links:
(1)https://mistral.ai/news/mixtral-of-experts/
(2)https://mistral.ai/news/la-plateforme/
(3)https://huggingface.co/blog/mixtral#about-the-name