Unlock the contextual large model real power of 100k+ with a 27-point increase to 98, applicable to GPT-4 and Claude2.1.

All major model companies have rolled up their context windows. The standard configuration of Llama-1 was still 2k, but now those with less than 100k are too embarrassed to go out.

However, an extreme test found thatMost people use it incorrectly and do not use AI to its full potential..

Can AI really accurately find key facts from hundreds of thousands of words?The redder the color, the more mistakes the AI ​​makes.

Unlock the contextual large model real power of 100k+ with a 27-point increase to 98, applicable to GPT-4 and Claude2.1.

By default, the results of GPT-4-128k and the newly released Claude2.1-200k are not ideal.

But after Claude’s team understood the situation, they came up with a super simple solution, adding one sentence to directly improve the score from 27% to 98%.

Unlock the contextual large model real power of 100k+ with a 27-point increase to 98, applicable to GPT-4 and Claude2.1.

It’s just that this sentence is not added to the user’s question, but the AI ​​is asked to say it at the beginning of the reply:

“Here is the most relevant sentence in the context:”

(This is the most relevant sentence in context:)

Let the big model find the needle in the haystack

To do this test, author Greg Kamradt spent at least $150 of his own money.

Fortunately, when testing Claude2.1, Anthropic reached out and provided him with a free credit, otherwise it would have cost him an extra $1,016.

Unlock the contextual large model real power of 100k+ with a 27-point increase to 98, applicable to GPT-4 and Claude2.1.

In fact, the testing method is not complicated. 218 blog posts by YC founder Paul Graham are used as test data.

Add specific statements at various points in the document: The best thing to do in San Francisco is to sit in Dolores Park on a sunny day and eat a sandwich.

Ask GPT-4 and Claude2.1 to answer the question using only the context provided, and test it repeatedly on documents with different context lengths and added at different locations.

Unlock the contextual large model real power of 100k+ with a 27-point increase to 98, applicable to GPT-4 and Claude2.1.

Finally use the Langchain Evals library to evaluate the results.

The author named this set of tests “Finding a needle in a haystack/Finding a needle in a haystack” and open sourced the code on GitHub. It has received 200+ stars and revealed that a company has sponsored the test of the next large model.

Unlock the contextual large model real power of 100k+ with a 27-point increase to 98, applicable to GPT-4 and Claude2.1.

AI companies find solutions themselves

A few weeks later, the company behind Claude Anthropic After careful analysis, it was found that the AI ​​was just unwilling to answer questions based on a single sentence in the document, especially when the sentence was inserted later and had little to do with the entire article.

In other words, the AI ​​judged that this sentence had nothing to do with the topic of the article, so it was too lazy to look for it sentence by sentence.

Unlock the contextual large model real power of 100k+ with a 27-point increase to 98, applicable to GPT-4 and Claude2.1.

At this time, you need to use some means to get past the AI. This can be solved by asking Claude to add the sentence “Here is the most relevant sentence in the context:” at the beginning of the answer.

Unlock the contextual large model real power of 100k+ with a 27-point increase to 98, applicable to GPT-4 and Claude2.1.

Using this method can also improve Claude’s performance when looking for sentences that are not artificially added later and are originally in the original article.

Unlock the contextual large model real power of 100k+ with a 27-point increase to 98, applicable to GPT-4 and Claude2.1.

Anthropic said it will continue to train Claude in the future to make it more suitable for such tasks.

There are other wonderful uses for requiring the AI ​​to answer with a specified beginning when making API calls.

Entrepreneur Matt Shumer added a few tips after seeing this plan:

If you want AI to output pure JSON format, the prompt word ends with “{“. In the same way, if you want the AI ​​to list Roman numerals, the prompt word can end with “I:”.

Unlock the contextual large model real power of 100k+ with a 27-point increase to 98, applicable to GPT-4 and Claude2.1.

But things are not over yet…

Domestic large model companies also noticed this test and began to test whether their own large models could pass it.

Also has a very long contextdark side of the moon Kimi big modelThe team also tested the problem, but came up with different solutions and achieved good results.

Unlock the contextual large model real power of 100k+ with a 27-point increase to 98, applicable to GPT-4 and Claude2.1.

In this way, it is easier to modify the user’s question prompt than to ask the AI ​​to add a sentence to its answer, especially when the chatbot product is used directly instead of calling the API.

Dark Side of the Moon also used its own new method to test GPT-4 and Claude2.1. The results showed that GPT-4 improved significantly, while Claude2.1 only improved slightly.

Unlock the contextual large model real power of 100k+ with a 27-point increase to 98, applicable to GPT-4 and Claude2.1.

It seems that this experiment itself has certain limitations. Claude also has its own particularities, which may be related to their own alignment Constituional AI. It is better to use the method provided by Anthropic itself.

Later, Dark Side of the Moon engineers conducted more rounds of experiments, one of which turned out to be…

Unlock the contextual large model real power of 100k+ with a 27-point increase to 98, applicable to GPT-4 and Claude2.1.

Broken, I became test data.

Unlock the contextual large model real power of 100k+ with a 27-point increase to 98, applicable to GPT-4 and Claude2.1.

Reference links:

  • (1)https://x.com/GregKamradt/status/1727018183608193393

  • (2)https://www.anthropic.com/index/claude-2-1-prompting

×