Tsinghua University releases new evaluation report for top LLM programs: GPT-4 and Claude-3 remain on top, GLM-4 stands out among domestic models

Who is the strongest player in the big model melee? Tsinghua University has conducted the most comprehensive comprehensive ability evaluation of 14 domestic and foreign LLMs. Among them, GPT-4 and Cluade 3 are well-deserved trump cards. In China, GLM-4 and Wenxin 4.0 have already entered the first echelon.

In the “100 Models War” in 2023, many practitioners launched various models. Some of these models are original, some are fine-tuned for open source models; some are universal, and some are industry-specific.How to reasonably evaluate the capabilities of these models has become a key issue.

Advertisement

Although there are multiple model capability evaluation lists at home and abroad, their quality is uneven and their rankings vary significantly. This is mainly because the evaluation data and testing methods are not yet mature and scientific. We believe that a good evaluation method should be open, dynamic, scientific and authoritative.

In order to provide objective and scientific evaluation standards, Tsinghua University Basic Model Research CenterjointZhongguancun Laboratorydeveloped SuperBench The comprehensive capability evaluation framework for large models aims to promote the healthy development of large model technology, applications and ecology.

Recently, the March 2024 version of the “SuperBench Large Model Comprehensive Capability Evaluation Report” was officially released.

The evaluation included a total of 14 representative models at home and abroad. Among them, for the closed-source model, the one with higher score among the two call modes of API and web page is selected for evaluation.

Advertisement

Based on the evaluation results, the following main conclusions can be drawn:

● Overall,GPT-4 series modelsand Claude-3 Waiting for foreign models in multiple capabilitiesstill in the leading position,domesticHead model GLM-4 andWen Xin Yi Yan 4.0 performs wellclose to the level of international first-class models, andgapalreadygradually shrink.

● Among the large foreign models, the GPT-4 series models have stable performance, and Claude-3 has also demonstrated strong comprehensive strength. It has won the first place in the two ability evaluations of semantic understanding and acting as an agent, ranking among the world's first-class models.

● Among domestic large models, GLM-4 and Wenxinyiyan 4.0 performed best in this evaluation and were the domestic head models; Tongyi Qianwen 2.1, Abab6, moonshot web version and qwen1.5-72b-chat followed closely Later, it also performed well in some capability evaluations; however, there is still a big gap between large domestic models and international first-class models in terms of code writing and acting as an agent, and domestic models still need to work hard.

Large model capability migration & SuperBench

Since the birth of large language models, evaluation has become an integral part of large model research. With the development of large model research, the focus of research on its performance is also constantly migrating. According to our research, large model capability evaluation roughly goes through the following five stages:

2018-2021: Semantic Evaluation Phase

Early language models mainly focused on natural language understanding tasks (e.g.. Word segmentation, part-of-speech tagging, syntactic analysis, information extraction), related evaluations mainly examine the language model's semantic understanding ability of natural language. Representative work: BERT, GPT, T5, etc.

2021-2023: Code review phase

With the enhancement of language model capabilities, code models with more application value are gradually emerging. Researchers found that models trained based on code generation tasks showed stronger logical reasoning capabilities in tests, and code models became a research hotspot. Representative work: Codex, CodeLLaMa, CodeGeeX, etc.

2022-2023: Alignment Evaluation Phase

With the wide application of large models in various fields, researchers have discovered that there are differences between the continuous training method and the instructional application method. Understanding human instructions and aligning human preferences have gradually become one of the key goals of large model training optimization. Aligned models can accurately understand and respond to user intentions, laying the foundation for the widespread application of large models. Representative work: InstructGPT, ChatGPT, GPT4, ChatGLM, etc.

2023-2024: Agent evaluation stage

Based on the ability to follow instructions and align preferences, the ability of large models to serve as intelligent hubs for dismantling, planning, decision-making, and execution of complex tasks has been gradually discovered. Large models are used as intelligent agents to solve practical problems and are also regarded as an important direction towards general artificial intelligence (AGI). Representative work: AutoGPT, AutoGen, etc.

2023-future: security evaluation phase

As model capabilities improve, the evaluation, supervision, and strengthening of model security and values ​​have gradually become the focus of researchers. Strengthening the research and judgment of potential risks and ensuring the controllability, reliability and credibility of large models are key issues for the future “sustainable development of AI”.

Therefore, in order to comprehensively evaluate the capabilities of large models,SuperBench Evaluation systemIncludeSemantics, code, alignment, agents, security, etc.Five evaluation categories,28 subclasses.

PART 1 Semantic evaluation

ExtremeGLUE is a highly difficult collection of 72 Chinese-English bilingual traditional data sets,It aims to provide more stringent evaluation standards for language models, adopts zero-sample CoT evaluation method, and scores model output according to specific requirements.

First, preliminary testing was conducted using more than 20 language models, including GPT-4, Claude, Vicuna, WizardLM, and ChatGLM.

Then, based on the comprehensive performance of all models, it was decided to select the most difficult 10% to 20% of the data in each category and combine them into a “highly difficult traditional data set”.

Evaluation method & process

Evaluation method:72 Chinese and English bilingual traditional data sets were collected, and the most difficult questions were extracted to form a 4-dimensional evaluation data set. The zero-sample CoT evaluation method was adopted. The score calculation method for each dimension is:The percentage of questions answered correctly, and the final total score is the average of each dimension..

Evaluation process:According to the form and requirements of different questions, the results generated by the zero-sample CoT of the model are scored.

Overall performance:

In the semantic understanding ability evaluation, each model formed three echelons, with 70 points being the first echelon, including Claude-3, GLM-4, Wenxinyiyan 4.0 as well as GPT-4 series models.

in,Claude-3 Scored 76.7, ranking first; domestic model GLM-4 andWen Xin Yi Yan 4.0 It surpasses the GPT-4 series models and ranks second and third, but is 3 points behind Claude-3.

Classification performance:

● Knowledge-common sense:Claude-3 Leading the way with 79.8 points, domestic models GLM-4 The performance is outstanding, surpassing the GPT-4 web version and ranking second; Wenxinyiyan 4.0 performs poorly, 12.7 points behind the top Claude-3.

● Knowledge-Science:Claude-3 It is still in the lead and is the only model with a score above 80; Wenxinyiyan 4.0, GPT-4 series models and GLM-4 models all score above 75, making them the first-tier models.

● Mathematics:Claude-3 andWen Xin Yi Yan 4.0 Tied for first place with 65.5 points,GLM-4 It ranks third ahead of the GPT-4 series models. The scores of other models are concentrated around 55 points. The current large models still have a lot of room for improvement in mathematical capabilities.

● Reading comprehension: The distribution of each score band is relatively even.Wen Xin Yi Yan 4.0 It surpassed GPT-4 Turbo, Claude-3 and GLM-4 to take the top spot.

PART 2 Code Review

NaturalCodeBench (NCB) is a benchmark that evaluates model code capabilitiesthe traditional coding ability evaluation data set mainly examines the problem-solving ability of the model in terms of data structure and algorithm, while the NCB data set focuses on examining the model's ability to write correct and usable code in real programming application scenarios.

All questions are filtered from users' questions in the online service, and the styles and formats of questions are more diverse.Covering issues in seven fields including database, front-end development, algorithms, data science, operating systems, artificial intelligence, and software engineering,It can be simply divided into two categories: algorithm type and functional requirement type.

The questions include two programming languages: java and python, as well as two question languages: Chinese and English. Each question corresponds to 10 human-written corrected test examples, 9 for testing the functional correctness of the generated code, and the remaining 1 for code alignment.

Evaluation method & process

Evaluation method:Run the function generated by the model and compare the output results with the prepared test case results for scoring.Compare the output results with the prepared test case results for scoring, and finallyCalculate the pass rate of the generated code pass@1.

Evaluation process:Given a problem, unit test code, and test case, the model first generates a target function based on the problem; runs the generated target function, uses the input in the test case as a parameter to obtain the function running output, and compares it with the standard output in the test case. Output matches are scored, output mismatches or function run errors are not scored.

Overall performance:

In the code writing ability evaluation, there is still a clear gap between domestic models and world-class models.GPT-4 series models,Claude-3 The model is clearly ahead in code passing rate, among domestic models GLM-4,Wen Xin Yi Yan 4.0 andiFlytek Spark 3.5 Outstanding performance, with a comprehensive score of more than 40 points.

However, even the best-performing models still only have a first-pass code pass rate of about 50%, and the code generation task is still a challenge for current large models.

Classification performance:

In the four-dimensional data set of Python, Java, Chinese, and English GPT-4 series modelsTaking the top spot reflects strong and comprehensive coding capabilities, in addition to Claude-3 There are obvious differences among other models.

● English code instructions:GPT-4 Turbo Compare Claude-3 6.8 points and 1.5 points higher on Python and Java questions respectively than GLM-4 The Python and Java questions are 14.2 points and 5.1 points higher respectively. The gap between the domestic model and the international model in English code instructions is obvious.

● Chinese code instructions:GPT-4 Turbo Compare Claude-3 It is 3.9 points higher on Python and 2.3 points lower on Java. The difference is not big.GPT-4 Turbo Compare GLM-4 On Python and Java questions, it is 5.4 points and 2.8 points higher respectively. There is still a certain gap between domestic models and international first-class models in terms of Chinese coding capabilities.

PART 3 Alignment Evaluation

AlignBench aims to comprehensively evaluate the alignment of large models with human intentions in the Chinese domain.Evaluate the quality of answers through model scoring to measure the model's instruction compliance and usefulness.

It includes 8 dimensions, such as basic tasks and professional abilities, uses real and difficult questions, and has high-quality reference answers. Excellent performance requires models to have comprehensive capabilities, understand instructions, and generate helpful answers.

“Chinese Reasoning”Dimensions focus on large modelsPerformance in Chinese-based mathematical calculations and logical reasoning.This part mainly consists of obtaining and writing standard answers from real user questions, involving evaluation in multiple fine-grained areas:

● Mathematical calculations include calculations and proofs in elementary mathematics, advanced mathematics and daily calculations.

● Logical reasoning includes common deductive reasoning, common sense reasoning, mathematical logic, brain teasers and other questions, fully examining the performance of the model in scenarios that require multi-step reasoning and common reasoning methods.

The “Chinese Language” section focuses on examining the general performance of large models on Chinese text language tasks.Specifically, it includes six different directions: basic tasks, Chinese understanding, comprehensive question and answer, text writing, role playing, and professional abilities.

Most of the data in these tasks are obtained from real user questions, and answers are written and corrected by professional annotators, which fully reflects the performance level of large models in text applications from multiple dimensions. Specifically:

● Basic tasks examine the model’s ability to generalize to user instructions in conventional NLP task scenarios;

● In terms of Chinese understanding, the model emphasizes the understanding of the traditional culture of the Chinese nation and the origin of Chinese character structure;

● Comprehensive Q&A focuses on the model’s performance in answering general open questions;

● Text writing reveals the performance level of the model in the work of writers;

● Role-playing is an emerging task that examines the model’s ability to carry out dialogue in compliance with the user’s personality requirements under the user’s instructions;

● Professional ability studies the mastery and reliability of large models in professional knowledge areas.

Evaluation method & process

Evaluation method:Evaluate answer quality by scoring strong models (such as GPT-4) to measure the model's instruction following ability and usefulness. The scoring dimensions include factual correctness, meeting user needs, clarity, completeness, richness, etc., and the scoring dimensions are not exactly the same under different task types. Based on this, a comprehensive score is given as the final score of the answer.

Evaluation process:The model generates answers based on questions, and GPT-4 performs detailed analysis, evaluation, and scoring based on the generated answers and reference answers provided by the test set.

Overall performance:

In an evaluation of human alignment capabilitiesGPT-4 web versionoccupy the top spot,Wen Xin Yi Yan 4.0 and GPT-4 Turbo The same score (7.74) followed closely behind, in the domestic model GLM-4 Also performed well, surpassing Claude-3 and ranking fourth.Tongyi Qianwen 2.1 Slightly lower than Claude-3, ranked sixth, both in the first echelon of large models.

Classification performance:

The overall score of Chinese reasoning is significantly lower than that of Chinese language. The current large model reasoning ability needs to be strengthened as a whole:

● Chinese reasoning:GPT-4 series modelsBest performance, slightly higher than domestic modelsWen Xin Yi Yan 4.0and there is a clear gap between it and other models.

● Chinese language: Domestic models took the top four places, namely KimiChat web version(8.05 points),Tongyi Qianwen 2.1(7.99 points),GLM-4(7.98 points),Wen Xin Yi Yan 4.0(7.91 points), exceeding world-class models such as GPT-4 series models and Claude-3.

Detailed analysis of each category:

Chinese reasoning:

● Mathematical calculations:GPT-4 series modelsTaking the top two spots, domestic modelsWen Xin Yi Yan 4.0,Tongyi Qianwen 2.1 The score exceeds Claude-3, but there is still a certain gap with the GPT-4 series models.

● Logical reasoning: 7 points and above are the first tier, which are determined by domestic modelsWen Xin Yi Yan 4.0 Leading the way, and also in the first echelon are GPT-4 series models,Claude-3,GLM-4 and Abab6.

Chinese language:

●Basic tasks:GLM-4 Take the top spot,Tongyi Qianwen 2.1,Claude-3 and GPT-4 web versionOccupying two to four positions, among other large models in ChinaWen Xin Yi Yan 4.0 and KimiChat web versionAlso performed better, exceeding GPT-4 Turbo.

● Chinese understanding: Domestic models performed well overall, taking the top four places.Wen Xin Yi Yan 4.0 The lead is obvious, ahead of the second place GLM-4 0.41 points; among foreign models, the performance is acceptable, ranking fifth, but the GPT-4 series models perform poorly, ranking in the middle and lower reaches, and the score difference from the first place is more than 1 point.

● Comprehensive Q&A: All major models performed well, with 6 models exceeding 8 points.GPT-4 web versionand KimiChat web versionGet the highest score,GLM-4 and Claude-3 The scores are the same, close to the top score, and tied for third.

● Text writing:KimiChat web versionThe best performance and the only model with a score of 8 or above.GPT-4 Turbo and ranked second and third respectively.

● Role play: Domestic models Abab6,Tongyi Qianwen 2.1 and KimiChat web versionTaking the top three places, all with more than 8 points, exceeding GPT-4 series modelsand Claude-3 and other world-class models.

● Professional abilities:GPT-4 Turbo took the first place,KimiChat web versionExceed GPT-4 web versionWinning the second place among other domestic models,GLM-4 andTongyi Qianwen 2.1 Also performed well and tied for fourth place.

PART 4 ​​Agent Evaluation

AgentBench is a comprehensive benchmarking toolkit for evaluating the performance of language models as agents in a variety of real-world environments, including operating systems, games, and the web.

Code environment:This section focuses on the potential application of LLMs in assisting human interaction with computer code interfaces. With their excellent coding and reasoning capabilities, LLMs are expected to become powerful intelligent agents that assist people to interact with computer interfaces more effectively. To evaluate the performance of LLMs in this regard, we introduce three representative environments that focus on coding and reasoning abilities. These environments provide practical tasks and challenges that test LLMs' abilities in handling a variety of computer interface and code-related tasks.

Game environment:The game environment is part of AgentBench, which is designed to evaluate the performance of LLMs in game scenarios. In games, agents are usually required to have strong strategic design, instruction following and reasoning capabilities. Different from the coding environment, the tasks in the game environment do not require professional knowledge of coding, but require a comprehensive grasp of common sense and world knowledge. These tasks challenge LLMs' abilities in common sense reasoning and strategy formulation.

Web environment:The online environment is the main interface through which people interact with the real world, so evaluating the behavior of agents in complex online environments is crucial to their development. Here, we conduct a practical evaluation of LLMs using two existing web browsing datasets. These environments are designed to challenge LLMs' abilities in network interface manipulation and information retrieval.

Evaluation method & process

Evaluation method:The model interacts with the preset environment for multiple rounds to complete each specific task. The scenario guessing subcategory will use GPT-3.5-Turbo to score the final answer, and the scoring methods of the other subcategories will complete the task for the model according to determined rules. Score the situation.

Evaluation process:The model interacts with the simulation environment, and then the results given by the model are scored using rules or GPT-3.5-Turbo scoring.

Scoring rules:Since the score distribution of different subtasks is different, calculating the total score directly based on the average score will be seriously affected by extreme values, so the scores of each subtask need to be normalized. As shown in the table below, the “Weight (-1)” value corresponding to each subtask is the normalized weight. This value is the average score of the model initially tested on Agentbench on the subtask. When calculating the total score, divide the scores of each subtask by Weight (-1) and then find the average. According to this calculation, a model with average ability should end up with an overall score of 1.

SR: success rate

#Avg.Turn: The average number of interaction turns required to solve a single problem

#Dev, #Test: The expected total number of interaction rounds for a single model in the development set and test set

Weight⁻¹: The reciprocal of the weight of each individual score when calculating the total score

Overall performance:

In the evaluation of intelligent agent capabilities, domestic models as a whole clearly lag behind international first-class models. in,Claude-3 and GPT-4 series modelsOccupying the top three places,GLM-4 It performs best among domestic models, but there is still a big gap between it and Claude-3, which tops the list.

Large models at home and abroad do not perform well with this capability. The main reason is that the requirements for models by agents are much higher than other tasks, and most existing models do not yet have strong agent capabilities.

Classification performance:

In addition to online shopping being domestic models GLM-4 Except for taking the first place, all other categories ranked firstby Claude-3 and GPT-4 series modelsOccupy, reflecting its relatively strong ability as an intelligent agent, domestic models still need to be continuously improved.

● The top three in embodied intelligence (Alfworld) are all Claude-3 and GPT-4 series modelsAll in all, the biggest gap between it and domestic models.

● In the two dimensions of database (DB) and knowledge graph (KG), domestic models GLM-4 Both are in the top 3, but there is still a certain gap between them and the top two.

PART 5 Security Evaluation

SafetyBench is the first comprehensive test benchmark to evaluate the safety of large language models through multiple-choice questions.Including offense, prejudice and discrimination, physical health, mental health, illegal activities, ethics, privacy and property, etc.

Evaluation method & process

Evaluation method:Thousands of multiple-choice questions are collected for each dimension, and the ability to understand and master each security dimension is tested through model selection tests. The few-shot generation method is used during evaluation, and the answers are extracted from the generated results and compared with the real answers.The score of each dimension of the model is the percentage of questions answered correctly, and the final total score is the average of the scores of each dimension.In view of the phenomenon of refusal, the refusal score and the non-rejection score will be calculated separately. The former will regard the refusal questions as wrong answers, and the latter will exclude the refusal questions from the question bank.

Evaluation process:Extract the answer from the model's few-shot generated results for a given question and compare it with the true answer.

Overall performance:

In the security capability evaluation, domestic modelsWen Xin Yi Yan 4.0 Outstanding performance, surpassing world-class models GPT-4 series modelsand Claude-3 It scored the highest score (89.1 points). Among other domestic models, GLM-4 and Claude-3 had the same score and tied for fourth place.

Classification performance:

Under the five categories of illegal activities, physical health, offense, mental health, and private property, each model has its own winner or loser. However, in terms of ethics and prejudice and discrimination, the scores of each model differ greatly, and they maintain the same overall score. A relatively consistent partial order relationship.

● Ethics:Wen Xin Yi Yan 4.0 Overtaking Claude-3 to rank first, the domestic large model GLM-4 also performed well, surpassing GPT-4 Turbo to rank among the top three.

● Prejudice and discrimination:Wen Xin Yi Yan 4.0 Continue to be at the top of the list and lead the way GPT-4 series modelsGLM-4 follows closely behind, both are first-tier models.

References:

  • https://mp.weixin.qq.com/s/r_aAjFHTRDBGXhl3bd06XQ

  • https://mp.weixin.qq.com/s/VhVEnRrIzJza1SZC9bKa6Q

Advertisement