Meta Introduces “Chameleon” to Compete with GPT-4o, Boasting 34 Billion Parameters and Leading the Multi-Modal Revolution with 10 Trillion Token Training to Refresh SOTA

[Introduction to New Wisdom]Less than a week after GPT-4o was released, the first new model that dares to challenge the king was born! Recently, the Meta team released the “mixed modal” Chameleon, which can seamlessly process text and images in a single neural network. The performance of the 34B parameter model trained with 10 trillion tokens is close to GPT-4V, refreshing SOTA.

The emergence of GPT-4o has once again created a new paradigm for multi-modal model development! Why do you say that?


OpenAI calls it the “first “native” multi-modal” model, which means that GPT-4o is different from all previous models.

Traditional multi-modal basic models usually use a specific “encoder” or “decoder” for each modality to separate different modalities.

However, this approach limits the model's ability to effectively fuse cross-modal information.

According to the official blog, GPT-4o is the “first end-to-end” trained model spanning text, vision and audio. All inputs and outputs are processed by a single neural network.


And now, the industry's first model that dares to challenge GPT-4o has appeared!

Recently, researchers from the Meta team released the “mixed-modal base model” – Chameleon.

Paper address:

Like GPT-4o, Chameleon adopts a unified Transformer architecture and uses text, image and code mixed modalities to complete training.

In a manner similar to text generation, the image is discretely “tokenized” (tokenization), and finally an interleaved text and image sequence is generated and reasoned.

With this “early fusion” approach, all pipelines are mapped to a common representation space from the beginning, so the model can seamlessly process text and images.

Multimodal content generated by Chameleon

At the same time, such a design brings significant technical challenges to model training.

In response, the Meta research team introduced a series of architectural innovations and training techniques.

The results show that in plain text tasks, the performance of 34 billion parameter Chameleon (trained with 10 trillion multi-modal tokens) is comparable to Gemini-Pro.

On the visual question answering and image annotation benchmarks, it refreshes SOTA and its performance is close to GPT-4V.

However, both GPT-4o and Chameleon are early explorations of a new generation of “native” end-to-end multi-modal basic models.

At the GTC 2024 conference, Lao Huang described an important step towards the ultimate vision of AGI – the interoperability of various modes.

Is the next open source GPT-4o coming?

The release of Chameleon is simply the fastest response to GPT-4o.

Some netizens said that token goes in and token goes out, which is simply impossible to explain.

Some even claim that OOS will catch up with the very solid research released after the birth of GPT-4o.

However, currently the Chameleon model supports generated modalities, mainly image text. The speech capabilities in GPT-4o are missing.

Netizens said that then just add another modality (audio), expand the training data set, and “cook” for a while, we will get GPT-4o…?

Meta's director of product management said, “I am very proud to support this team. Let's take a step toward making GPT-4o closer to the open source community.”

Maybe it won't be long before we get an open source version of GPT-4o.

Next, let's look at the technical details of the Chameleon model.

Technology Architecture

Meta first stated in Chameleon's paper: Many recently released models still do not implement “multimodality” to the end.

Although these models adopt an end-to-end training method, they still model different modalities separately, using separate encoders or decoders.

As mentioned at the beginning, this approach limits the model's ability to cross-modal information, and makes it difficult to generate truly multi-modal documents containing any form of information.

In order to improve this shortcoming, Meta proposed a series of “mixed-modal” base models Chameleon – capable of generating content in which text and image content are arbitrarily intertwined.

Chameleon generated results, with text and images interlaced

The so-called “mixed modal” base model means that Chameleon not only uses an end-to-end approach to train from scratch, but also interweaves and mixes information from all modalities during training, and uses a unified architecture for processing.

How to mix information from all modalities and represent it in the same model architecture?

The answer is still “token”.

As long as they are all represented as tokens, all modal information can be mapped into the same vector space and let the Transformer handle it seamlessly.

However, this approach poses technical challenges in terms of optimization stability and model scalability.

In order to solve these problems, the paper innovates the model architecture accordingly and uses some training techniques, including QK normalization and Zloss.

At the same time, the paper also proposes a method of fine-tuning plain text LLM into a multi-modal model.

Image “Tokenizer”

To represent all modalities as tokens, a powerful tokenizer is first needed.

To this end, Chameleon's team developed a new image tokenizer based on a previous paper in Meta. Based on a codebook of size 8192, the image with a specification of 512×512 is encoded into 1024 discrete tokens.

The text tokenizer is based on the sentencepiece open source library developed by Google, and a BPE tokenizer containing 65536 text tokens and 8192 image tokens is trained.


In order to fully stimulate the potential of “mixed modalities”, the training data is also broken up and mixed with different modalities to present to the model, including pure text, text-image pairs, and multi-modal documents with interlaced text and images.

The plain text data includes all pre-training data used by Llama 2 and CodeLlama, totaling 2.9 trillion tokens.

Text-image pairs include some public data, totaling 1.4 billion pairs and 1.5 trillion tokens.

Regarding the intertwined data of text and images, the paper specifically emphasizes that it does not include data from Meta products, and uses entirely public data sources to sort out a total of 400 billion tokens.

Chameleon's pre-training is performed in two separate phases, accounting for 80% and 20% of the total training ratio.

The first stage of training is to let the model learn the above data in an unsupervised manner. At the beginning of the second stage, the weights obtained in the first stage are reduced by 50% and mixed with higher quality data to allow the model to continue learning.

When the model expands to more than 8B parameters and 1T tokens, obvious instability problems will occur in the later stages of training.

Since all modalities share model weights, each modality seems to have a tendency to increase norm and “compete” with other modalities.

This will not cause much of a problem in the early stages of training, but as training progresses and the data exceeds the expression range of bf16, loss will diverge.

The researchers attribute this to the translation invariance of the softmax function, a phenomenon also known as “logit drift” in single-modal models.

Therefore, the paper proposes some architectural adjustments and optimization methods to ensure stability:

-QK normalization (query-key normalization): Apply layer norm to the query and key vectors in the attention module, thereby directly controlling the norm growth of the softmax layer input.

-Introduce dropout after attention layer and feedforward layer

-Use Zloss regularization in the loss function

In addition to the data source and architecture, the paper also generously disclosed the scale of computing power used in pre-training.

The hardware model is NVIDIA A100 with 80GB memory. The 7B version uses 1024 GPUs in parallel to train for about 860,000 GPU hours. The number of GPUs used by the 34B model has been expanded by 3 times, and the number of GPU hours exceeds 4.28 million.

As a company that once open sourced Llama 2, Meta's research team is indeed generous. Compared with GPT-4o, which does not even have a technical report, this paper with data and practical information can be described as “the most benevolent and righteous”.

Comprehensively surpass Llama 2

In the specific experimental evaluation, the researchers divided it into manual evaluation and safety testing, as well as baseline evaluation.

benchmark assessment

After using four times more tokens than Llama 2 for training, Chameleon-34B has achieved stunning results in various single-modal benchmark tests.

In text-only task generation, researchers compare the text-only features of pre-trained (non-SFT) models with other leading text-only LLMs.

The assessment content includes common sense reasoning, reading comprehension, mathematical problems and world knowledge areas. The assessment results are shown in the table below.

– Common sense reasoning and reading comprehension

It can be observed that compared to Llama 2, Chameleon-7B and Chameleon-34B are more competitive. In fact, 34B even surpassed Llama-2 70B in 5/8 tasks, and its performance was equivalent to Mixtral-8x7B.

– Mathematics and world knowledge

Despite being trained on other modalities, both Chameleon models exhibit strong mathematical capabilities.

On GSM8k, Chameleon-7B performs better than the Llama 2 model of corresponding parameter scale, and its performance is equivalent to Mistral-7B.

In addition, Chameleon-34B performs better than Llama 2-70B at maj@1 (61.4 vs 56.8) and Mixtral-8x7B at maj@32 (77.0 vs 75.1).

Likewise, in math operations, Chameleon-7B outperforms Llama 2 and is comparable to Mistral-7B on maj@4, while Chameleon-34B outperforms Llama 2-70B and is close to Mixtral-8x7B on maj@4 performance (24.7 vs 28.4).

Overall, Chameleon's performance surpasses Llama 2 across the board and is close to Mistral-7B/8x7B on some tasks.

In the text-to-image task, the researchers specifically evaluated two specific tasks: visual question answering and image annotation.

Chameleon defeated models such as Flamingo and Llava-1.5 in visual question answering and image annotation tasks to become SOTA. In plain text tasks, it also performed equally well with first-tier models such as Mixtral 8x7B and Gemini Pro.

Human assessment and security testing

At the same time, in order to further evaluate the quality of multi-modal content generated by the model, the paper also introduced human evaluation experiments in addition to the benchmark test, and found that the performance of Chameleon-34B far exceeded that of Gemini Pro and GPT-4V.

Compared with GPT-4V and Gemini Pro, human judges scored 51.6% and 60.4 preference rates respectively.

The figure below shows the performance of Chameleon compared to baseline models in understanding and generating content for a diverse set of prompts from human annotators.

Each question is answered by three different human annotators, with the majority vote being the final answer.

To understand the quality of the human annotators and whether the questions were well designed, the researchers also examined the degree of agreement between different annotators.

Table 5 is a security test conducted on 20,000 crowdsourced prompts and 445 red team interactions, causing the model to produce unsafe content.

Compared with Gemini and GPT-4V, Chameleon is very competitive when processing cues that require interleaved, mixed-modal responses.

As you can see from the example, when completing the question and answer task, Chameleon can not only understand the input text + image, but also add appropriate “pictures” to the model output content.

Moreover, the images generated by Chameleon are often contextual, making the output of this interlaced content highly attractive to users.

Contribution team

At the end of the paper, the contributors who participated in this research are also listed.

Includes pre-training, alignment and safety, reasoning and evaluation, for all project participants.

Among them, * represents a co-author, † represents a key contributor, ‡ represents the workflow leader, and ♯ represents the project leader.