New groundbreaking architecture revolutionizes the Transformer method: Processing unlimited context with 2 trillion tokens outperforming Llama 2

The Transformer throne is about to be replaced! Meta, USC, CMU and UCSD jointly proposed Megalodon, a revolutionary new architecture that can handle unlimited contexts. In the 2 trillion token training task, the performance surpassed Llama2-7B and achieved extraordinary efficiency.

After Mamba, another architecture that dares to challenge Transformer was born!

Researchers from Meta, University of Southern California (USC), CMU and UCSD proposed a new neural network architecture – Megalodon.

Advertisement

This is an architecture designed to effectively handle “infinite context” length LLM pre-training and inference.

Paper address: abs/2404.08801

We all know that the Transformer architecture is limited by quadratic complexity and weak length extrapolation capabilities when dealing with long contexts.

Although there are subquadratic solutions (such as linear attention, state space models), they are usually not as good as Transformer in terms of pre-training efficiency and even accuracy of downstream tasks.

Advertisement

The emergence of Megalodon is to solve the problem of infinite processing context.

At the same time, it can simultaneously achieve efficient training (reduce the amount of communication and calculation), and efficient inference (maintain a constant KV cache).

It is worth mentioning that in a direct comparison with Llama 2, Megalodon not only trains more efficiently, but also surpasses Transformer in accuracy in processing 7 billion parameters and 2 trillion training tokens.

Specifically, the training loss of Megalodon is 1.70, which is between Llama2-7B (1.75) and 13B (1.67).

This paradigm-changing innovation represents a quantum leap in the field of AI, and Megalodon opens a new era of computing efficiency and performance.

The biggest milestone since the release of GPT-3

Netizens said that first with Google and then with Meta, infinite context is one step closer to us, and LLM will unleash unlimited potential.

Others think that “infinite context length is definitely a game changer”!

What’s more, the CEO of a startup company said, “This is the biggest milestone since the release of GPT-3, but there is no movement?!

Megalodon is equivalent to the basis of AGI.”

“Meta's Megalodon is a breakthrough development and is of great significance to AGI. Its infinite context length simulates human cognition and achieves seamless task switching.”

Hao Zhang, the author of the paper, said that this is a new architecture that replaces Transformer.

Beidi Chen, the author of the paper, said, “Although attention is good, you don't need a complete attention mechanism”!

Princeton Assistant Professor Tri Dao said, “Combining SSM/RNN/EMA with attention is the way to get higher quality, longer context and faster reasoning! Griffin, Jamba, Zamba and now Megalodon are all good examples ”.

Revolutionary architecture makes training more stable

So, what kind of design is used in the Megalodon architecture to achieve such excellent performance?

According to reports, it has been improved based on the MEGA architecture and added multiple technical components.

First, the Complex Exponential Moving Average (CEMA) component is a new technology that extends the multidimensional damped exponential moving average method used in MEGA to the complex domain, which can enhance the model's ability to handle complex data.

Secondly, the researchers proposed an innovative normalization technology – “time step normalization layer”.

It extends traditional group normalization techniques to autoregressive sequence modeling tasks, allowing models to perform effective normalization when processing sequence data.

In the past, the performance of “Layer Normalization” combined with Transformer was impressive.

But it is clear that layer normalization does not directly reduce the internal covariate shift in the time step or order dimensions.

In addition, although “Group Normalization” is improved in CV tasks compared to “Layer Normalization”, it cannot be directly applied to Transformer's autoregressive sequence modeling because future information will pass through the time step dimension. Mean and variance leakage.

As shown in the figure below, c shows the layer normalization and time step normalization methods in the Megalodon architecture.

Finally, in order to enhance the stability of large-scale LLM pre-training, the researchers proposed a configuration that combines normalized attention and pre-normalization with two-hop residuals.

This configuration can optimize the learning process of the model and improve the stability of training.

In Figure 3 below, a is the complete frame sketch of Megalodon.

The two figures in the middle and on the right introduce respectively the configuration of pre-normalization and pre-normalization with two-hop residuals.

2T token training, performance surpasses Llama2-7B

In specific experimental evaluations, the researchers expanded Megalodon to a scale of 7 billion parameters and applied it to large-scale LLM pre-training of 2 trillion tokens.

In addition, the authors also conducted experiments on medium/small parameter scale sequence modeling benchmarks, including Long Range Arena (LRA), raw speech classification on Speech Commands, image classification on ImageNet-1K, as well as WikiText-103 and PG19 language modeling.

Results show that in these tasks, Megalodon significantly outperforms all state-of-the-art baseline models across a variety of data modalities.

Data learning efficiency

It can be seen from the training loss graph and the results of multiple benchmarks that Megalodon has better data learning efficiency than Transformer under 7B parameters.

Computational efficiency

For different 4K and 32K context lengths, the computational efficiency of the pre-training of Megalodon architecture is also very strong.

Short context assessment on academic benchmarks

Specifically, the researchers compared Megalodon with Llama 2, as well as open source base models, on the standard academic benchmark of short context (4K tokens).

After training on the same 2 trillion tokens, Megalodon-7B performed significantly better than Llama2-7B.

Long context evaluation

For different long context perplexities, it is proved that Megalodon can use very long contexts to predict the next token.

Figure 5 shows the perplexity (PPL) of the validation data set under various context lengths from 4K to 2M.

In the long context QA task in the Scroll dataset, Megalodon achieves the best F1 on NaQA and competes with Llama 2 Long.

Medium Scale Benchmark Assessment

In tests on the Long Range Arena (LRA), the new architecture significantly narrowed the performance gap between chunked attention and full attention.

The results on other evaluation sets such as original speech classification, ImageNet-1K, WikiText-103 and PG-19 are as follows:

some thoughts

Here are some quotes from the original author of this study:

This work took nearly two years from idea to final completion. During this period, I experienced several failures and learned many ways to do scientific research correctly in the era of large-scale pre-training.

Through this project, researchers also realized the issues that need to be paid attention to when building new model architectures in the era of large models. In conclusion:

Comparisons between two different model architectures must be convincing when the data are exactly the same. When the data are different, even if the difference is small (<10%), the final results may be significantly different. The results, including training loss and downstream tasks, are greatly affected by the training data.

For different architectures, comparison must be meaningful under the condition that the model is fully trained. For example, for a 7B size model, 2T training data is almost a basic requirement. Some models may perform well when there is little data, but lag behind other models when the data size increases. Therefore, for comparisons of large model architectures, the prerequisite for convincing results is sufficient training.

For models with very different architectures, the comparative significance of the traditional flops-based scaling law is reduced. The reason is that the actual speed of two models with different architectures may differ several times even if they have the same flops. This has a lot to do with whether the architecture algorithm itself is suitable for calculation on the most advanced GPU. Therefore, a truly practical comparison method is to divide it into two aspects, data learning efficiency and computational efficiency, as in this article. However, in practice this places high demands on researchers’ engineering capabilities. In the era of large models, the development of new algorithms has been highly integrated with systems and other aspects.

References:

  • abs/2404.08801

  • https://zhuanlan.zhihu.com/p/692682649

Advertisement