GPT-4 Lacks Graphical Reasoning Skills and Maintains Low Accuracy Rate of 33% Despite Update on Water Release

GPT-4’s graphical reasoning ability is less than half that of humans?

A study by the Santa Fe Institute in the United States showed thatThe accuracy of GPT-4 on graphical reasoning questions is only 33%.


The performance of GPT-4v, which has multi-modal capabilities, is even worse, and can only answer 25% of the questions correctly.

The dashed line represents the average performance on the 16 tasks

After the results of this experiment were published, they quickly aroused widespread discussion on YC.

Netizens who agree with this result said that GPT is indeed not good at abstract graphics processing, and it is more difficult to understand concepts such as “position” and “rotation”.


But on the other hand, many netizens also question this conclusion. To put it simply:

It cannot be said that it is wrong, but it cannot be convincingly said that it is completely correct.

As for the specific reasons, we continue to look below.

GPT-4 accuracy is only 33%

To evaluate the performance of humans and GPT-4 on these graphics problems, the researchers used the ConceptARC data set launched by their own institution in May this year.

ConceptARC includes a total of 16 subcategoriesGraphic reasoning questions30 questions in each category, 480 questions in total.

These 16 subcategories cover multiple aspects such as positional relationships, shapes, operations, comparisons, etc.

Specifically, these questions are composed of pixel blocks. Humans and GPT need to find patterns based on given examples and analyze the results of images processed in the same way.

The author specifically shows examples of these 16 subcategories in the paper, one for each category.


As a result, the average accuracy rate of 451 human subjects was no less than 83% in each sub-item, and the average of 16 tasks reached 91%.

When GPT-4 (single sample) is “relaxed” and can try a question three times (one correct answer is considered correct), the highest accuracy rate does not exceed 60%, and the average is only 33%.

Earlier, the author of ConceptARC Benchmark involved in this experiment also conducted a similar experiment, but in GPT-4Zero sample testingthe result is that the average accuracy rate of 16 tasks is only 19%.

The accuracy of multi-modal GPT-4v is even lower. In a small-scale ConceptARC data set composed of 48 questions, the accuracy of zero-sample and single-sample tests is only 25% and 23% respectively.

After further analyzing the wrong answers, the researchers found thatSome human errors seem likely to be caused by “carelessness”, while GPT completely fails to understand the rules of the question..

Netizens generally have no doubts about these data, but what makes this experiment questionable are the subjects recruited and the input method for GPT.

Subject selection method questioned

Initially, the researchers recruited subjects on an Amazon crowdsourcing platform.

The researchers extracted some simple questions from the data set as introductory tests, and the subjectsYou need to answer at least two of the three random questions correctly to enter the formal test..

As a result, researchers found that the results of the introductory test showed that some people just wanted to get money, but did not do the questions at all.

As a last resort, researchersIncrease the threshold for taking the testYou must have completed no less than 2,000 tasks on the platform, and the pass rate must reach 99%.


However, although the author uses the pass rate to screen people, in terms of specific abilities, in addition to requiring subjects to speak English and be familiar with graphics, etc.“No special requirements” for other professional abilities.

In order to diversify the data, the researchers transferred the recruitment work to another crowdsourcing platform in the later stage of the experiment. In the end, a total of 415 subjects participated in the experiment.

Still, some people question the samples in the experiment “Not random enough“.

Some netizens pointed out that on the Amazon crowdsourcing platform used by researchers to recruit subjects,There are large models pretending to be humans..

Let’s look at the operation of GPT. The multi-modal version is relatively simple. Just upload the image directly and use this prompt word:

In zero-sample testing, just remove the corresponding EXAMPLE part.

But for the plain text version of GPT-4 (0613) without multi-modality, the image needs to be converted into grid points.Use numbers instead of colors.

Some people expressed disapproval of this operation:

After converting the image into a numerical matrix,The concept has completely changedeven humans may not be able to understand it when looking at “graphics” represented by numbers.

One More Thing

Coincidentally, Joy Hsu, a Chinese doctoral student at Stanford, also used a geometric data set to test GPT-4v’s ability to understand graphics.

This data set was published last year with the purpose of testing the large model’s understanding of Euclidean geometry. After GPT-4v was opened, Hsu used this data set to test it again.

It was found that the way GPT-4v understands graphics seems to be “completely different from humans.”


Data-wise, GPT-4v’s answers to these geometric questions are also significantly worse than humans.

Paper address:



Reference links:

  • (1)

  • (2)