GPT-4 Emerges Victorious in Epic One-on-One Battle Against Llama 3, Winning Championship after 750,000 Rounds with Llama 3 Finishing Fifth

Regarding Llama 3, new test results have been released – the large model evaluation community LMSYS released a large model ranking list. Llama 3 ranked fifth, and tied for first place with GPT-4 in the English category.

Different from other benchmarks, this list is based on one-on-one model battles, with evaluators from all over the network making their own propositions and scoring.

Advertisement

In the end, Llama 3 ranked fifth on the list, followed by three different versions of GPT-4 and Claude 3 Super Cup Opus.

In the English single list, Llama 3 overtook Claude and tied with GPT-4. LeCun, Meta's chief scientist, was very happy about this result and retweeted the tweet and left a “Nice” message.

Soumith Chintala, the father of PyTorch, also said excitedly that such results are incredible and he is proud of Meta.

Advertisement

The 400B version of Llama 3 has not yet come out, but it won the fifth place based on the 70B parameter alone…

I remember when GPT-4 was released last March, it was almost impossible to achieve the same performance.

The democratization of AI right now is incredible, and I'm very proud of my colleagues at Meta AI for making it so successful.

So, what specific results does this list show?

Nearly 90 models battled for 750,000 rounds

As of the release of the latest list, LMSYS has collected nearly 750,000 large model solo battle results, involving 89 models.

Among them, Llama 3 has participated 12,700 times, and GPT-4 has multiple different versions, with the most participating 68,000 times.

The picture below shows the number of competitions and winning rates of some popular models. Neither of the two indicators in the picture counts the number of draws.

In terms of the list, LMSYS is divided into a general list and multiple sub-lists. GPT-4-Turbo ranks first, tied with the earlier 1106 version, and Claude 3 Super Large Cup Opus.

Another version (0125) of GPT-4 comes next, followed closely by Llama 3. But what is more interesting is that the newer 0125 does not perform as well as the older version 1106.

In the English single list, Llama 3's results were directly tied with the two GPT-4s, and even surpassed the 0125 version.

The first place in the Chinese proficiency ranking is shared by Claude 3 Opus and GPT-4-1106, while Llama 3 has been ranked outside the 20th place.

In addition to language ability, the list also sets rankings for long text and coding abilities, and Llama 3 also ranks among the top. However, what are the specific “rules of the game” for LMSYS?

Large model evaluation that everyone can participate in

This is a large-scale model test that everyone can participate in. The questions and evaluation criteria are decided by the participants themselves. The specific “competition” process is divided into two modes: battle and side-by-side.

In battle mode, after entering a question on the test interface, the system will randomly call two models in the library, but the tester does not know who the system has selected. Only “Model A” and “Model B” are displayed on the interface.

After the model outputs the answer, the evaluator needs to choose which one is better or a tie. Of course, if the model's performance does not meet expectations, there are corresponding options.

Only after a selection is made, the model's identity is revealed. In side-by-side, the user selects the specified model for PK, and the rest of the test process is the same as the battle mode.

However, only the voting results in the anonymous mode of the battle will be counted, and if the model accidentally exposes its identity during the conversation, the results will be invalid.

According to the Win Rate of each model against other models, such an image can be drawn:

Schematic, older version

The final ranking is obtained by using Win Rate data and converting it into scores through the Elo evaluation system.

The Elo rating system is a method of calculating the relative skill level of players, designed by American physics professor Arpad Elo.

Specific to LMSYS, under initial conditions, the ratings (R) of all models are set to 1000, and then the expected winning rate (E) is calculated based on such a formula.

As the test continues, the score will be corrected based on the actual score (S). S has three values: 1, 0 and 0.5, which correspond to the three situations of winning, losing and drawing respectively.

The correction algorithm is shown in the following formula, where K is the coefficient, which needs to be adjusted by the tester according to the actual situation.

After all valid data are finally included in the calculation, the Elo score of the model is obtained.

However, during the actual operation, the LMSYS team found that the stability of this algorithm was insufficient, so it used statistical methods to correct it.

They used the Bootstrap method for repeated sampling, obtained more stable results, and estimated confidence intervals.

The final revised Elo score became the basis for ranking in the list.

One More Thing

Llama 3 can already run on the large model inference platform Groq (not Musk's Grok).

The biggest highlight of this platform is its “fastness”. The Mixtral model has previously been used to achieve a speed of nearly 500 tokens per second.

When running Llama 3, it is also quite fast. The measured 70B version can run about 300 tokens per second, and the 8B version is close to 800.

Reference links:

  • (1)https://lmsys.org/blog/2023-05-03-arena/

  • (2)https://chat.lmsys.org/?leaderboard

  • (3)https://twitter.com/lmsysorg/status/1782483699449332144

Advertisement