Pangu-π by Huawei enhances Transformer architecture, addressing feature deficiencies and outperforming LLaMA at comparable scale

Huawei Pangu series brings innovation at the architectural level! Huawei's Noah's Ark Laboratory and others jointly launched a new large language model architecture:Pangu-π.

It improves on the traditional Transformer architecture by enhancing nonlinearity, which can significantly reduce the feature collapse problem. The direct effect is that the model output has stronger expressive ability.

When trained with the same data, Pangu-π (7B) surpasses the LLaMA 2 equivalent large model in multi-tasks and can achieve 10% inference acceleration.

Reachable at 1B scale SOTA.

At the same time, a large financial legal model “Yunshan” was developed based on this structure.

This work is led by AI expert Tao Dacheng.

How to achieve this? Let’s see together.

Using nonlinearity to solve feature collapse

At present, common large models basically adopt the Transformer architecture, such as GPT, LLaMA, etc.

Its core components include multi-head self-attention mechanism (MSA) and feed-forward network (FFN).

Pangu-π by Huawei enhances Transformer architecture, addressing feature deficiencies and outperforming LLaMA at comparable scale

The main function of MSA is to calculate the correlation between each token in the input sequence and all other tokens. By learning the dependencies in the input sequence, the understanding of language can be enhanced. FFN mainly performs nonlinear transformation on the input to enhance the expression ability of the model so that it can approximate more complex functions.

However, Huawei’s Noah’s Ark Laboratory found thatFeature collapse(Feature collapse) will affect the performance of the Transformer architecture, reduce its expressive ability, and make it difficult for the model to distinguish different inputs.

Taking LLaMA as an example, on deeper neural networks, the feature level is significantly reduced, resulting in stronger similarities between all tokens.

From a mechanism point of view, the self-attention module can be regarded as information aggregation on the complete graph. Continuous stacking of multiple layers of attention is like continuous multi-layer graph convolution, which will produceexcessive feature smoothing effect.

On the other hand, the nonlinearity provided by the activation function in multilayer perceptron (MLP) is not enough, and the effect of suppressing feature collapse is limited.

Therefore, the team wanted to improve the nonlinear expression ability of the model and avoid feature collapse, and then proposed this workPangu-π.

The following is the structural representation of Pangu-π:

Pangu-π by Huawei enhances Transformer architecture, addressing feature deficiencies and outperforming LLaMA at comparable scale

Adding a series activation function to FFN and integrating an enhanced shortcut connection (Aug-S) into MSA can more effectively introduce more nonlinearity into the Transformer architecture.

Pangu-π by Huawei enhances Transformer architecture, addressing feature deficiencies and outperforming LLaMA at comparable scale

MSA using enhanced shortcut connection (Aug-S) can convert the characteristics of each token into different representations.

Based on this new architecture, through large-scale training and fine-tuning, the research team developed aPangu-π basic model.

Experimental results show that the model outperforms other models of the same scale in multiple tasks (7B and 1B scales were tested respectively).

Moreover, Pangu-π-7B can achieveabout 10% inference acceleration.

Pangu-π by Huawei enhances Transformer architecture, addressing feature deficiencies and outperforming LLaMA at comparable scale

Pangu-π by Huawei enhances Transformer architecture, addressing feature deficiencies and outperforming LLaMA at comparable scale

At the same time, the team also developed a large model in the financial and legal field based on this.“Yunshan”it also outperforms other models in multiple benchmarks.

Pangu-π by Huawei enhances Transformer architecture, addressing feature deficiencies and outperforming LLaMA at comparable scale

Pangu-π by Huawei enhances Transformer architecture, addressing feature deficiencies and outperforming LLaMA at comparable scale

The corresponding author is Tao Dacheng

It is worth noting that the team lineup of this study is also very eye-catching.

The corresponding author is Tao Dacheng.

He is a foreign academician of the European Academy of Sciences and an academician of the Australian Academy of Sciences. I studied at the University of Science and Technology of China for my undergraduate degree. It is said that I graduated from the Hong Kong Chinese MMLab and studied under Tang Xiaoou.

After graduating from the UK with a PhD in 2007, he successively taught at the Hong Kong Polytechnic University in China, Nanyang Technological University in Singapore, the University of Technology in Sydney, and the University of Sydney in Australia. Currently, he is an outstanding visiting professor of the AIR team of Tsinghua University Intelligent Industry Research Institute.

At the same time, he has also joined UBTECH and JD.com. He was once the highest-level AI scientist at JD.com and served as the dean of JD.com Discovery Research Institute.

Pangu-π by Huawei enhances Transformer architecture, addressing feature deficiencies and outperforming LLaMA at comparable scale

One is Wang Yunhe.

He is a senior researcher at the 2012 Noah's Ark Laboratory and is currently the director of Huawei's Algorithm Application Department.

Wang Yunhe is responsible for the innovative research and development of efficient AI algorithms and their application in Huawei's business. He and his team developed an efficient AI algorithm, and its derivatives were applied in the China Sky Eye FAST observation work, assisting experts from the National Astronomical Observatory of the Chinese Academy of Sciences to find hundreds of new fast radio burst samples.

Pangu-π by Huawei enhances Transformer architecture, addressing feature deficiencies and outperforming LLaMA at comparable scale

Paper address:

  • http://arxiv.org/abs/2312.17276