UC Berkeley’s Big Three Unveil First Pure Computer Vision Large Model, Sparking AGI in Inference

The three CV giants at UC Berkeley launched the first large pure visual model without natural language, proving for the first time that pure CV models are also scalable. What’s even more shocking is that LVM can actually do graphical reasoning questions correctly. Has the AGI spark appeared again?

The GPT moment for computer vision is here!


Recently, the “Big Three” of computer vision from UC Berkeley joined forces to launch the first pure vision large model without natural language (Large Vision Models), and for the first time proved that the pure vision model itself is also scalable (scalability).

In addition, the researchers also used a data set of more than 420B tokens to allow the model to understand and perform downstream tasks through contextual learning, and unified pictures/videos, supervised/unsupervised, synthetic/real, 2D/3D/4D and almost all data forms.

Paper address: abs/2312.00785

It is worth mentioning that when LVM is asked to do common non-verbal reasoning questions in the non-verbal IQ test (Raven’s Progressive Matrices), it can often make correct inferences.


In this regard, the researchers said with surprise that this may mean that LVM has also shown “the spark of AGI”!

Counterattack of purely visual models

Now, with the explosion of large language models, both academia and industry are beginning to try to use “text” to expand the scale of visual models.

SOTA models, including GPT4-V, are trained by combining vision and text.

Taking “Apple” as an example, this method will not only show the model “photos of apples” during training, but also add the text “This is an apple”.

However, when faced with more complex images, it is easy to overlook a large amount of information.

For example, how should we describe the “Mona Lisa”? Or a photo of a kitchen filled with various items is difficult to describe clearly.

In this regard, researchers from UC Berkeley and Johns Hopkins University have proposed a new “visual sequence” modeling method that can train large-scale visual models without using any language data ( Large Vision Model).

This universal format, called “visual sequence,” can represent raw images and videos, as well as annotated data sources such as semantic segmentation and depth reconstruction, without requiring any meta-knowledge beyond pixels.

Once such a wide range of visual data (containing 420 billion tokens) is represented as a sequence, the model can be trained to minimize the cross-entropy loss of predicting the next token.

The resulting LVM model can not only be effectively expanded to complete a variety of visual tasks, but can even further develop abilities such as counting, reasoning, and doing intelligence tests.

Left: Alexei A Efros; Center: Trevor Darrell; Right: Jitendra Malik

To put it simply, large-scale visual models can understand and process complex visual information simply by looking at images for training, without relying on language data at all.

Extended Difficulties of Pure Vision Models

Previously, the value of using pre-trained models (such as ImageNet pre-trained AlexNet) has been proven in R-CNN as early as 2015.

It has since become standard practice in computer vision.

Self-supervised pre-training has been proposed as a method to greatly increase the amount of data available for pre-training.

Unfortunately, this approach was not very successful, probably because the CNN-based architectures at the time did not have enough power to absorb the data.

With the introduction of Transformer, its capacity became much higher, so researchers revisited self-supervised pre-training and found Transformer-based mask image reconstruction methods, such as BEiT, MAE, SimMIM, which are better than CNN-based ones. Similar methods perform much better.

However, despite this, current pre-trained vision-only models still encounter difficulties when scaling to really large datasets (such as LAION).

How to build a “big visual model”

So what elements are needed to build a Large Vision Model (LVM)?

The animal world tells us that visual ability does not depend on language. Many experiments have shown that the visual world of non-human primates is very similar to that of humans.

Therefore, this article goes in a different direction for visual-language models like LLaVA: How far can we go relying only on pixels?

Researchers tried to imitate two key features of LLM in LVM: (1) scalability in big data environments, and (2) flexible task specification through hints (contextual learning).

In order to achieve this goal, three main components need to be identified:

data:The researchers hope to take advantage of the remarkable diversity of visual data.

First are the original unannotated images and videos. Next, the researchers plan to leverage various annotated visual data resources generated over the past few decades, such as semantic segmentation, depth reconstruction, keypoints, multiple views of 3D objects, etc.

To this end, they defined a common format called “visual sequences” to represent these different annotations without requiring any meta-knowledge beyond the pixels themselves. The training dataset contains a total of 164 million images/frames.

Architecture:The researchers used a large Transformer architecture with 3 billion parameters, which was trained on visual data represented as sequences of tokens.

Through the learned tokenizer, each image is mapped to a string containing 256 vector quantized tokens.

Loss function:The researchers took inspiration from the field of natural language processing, where masked token models have evolved into sequential autoregressive prediction.

Once you can represent images/videos/annotated images as sequences, you can train the model to minimize the cross-entropy loss in predicting the next token.

Through this minimalist design, researchers have made some novel discoveries——

– Models exhibit appropriate scaling behavior as model size and data size increase.

– A variety of visual tasks can be solved by designing appropriate visual cues while testing.

– A large amount of unsupervised data, the performance improvement for various standard vision tasks is very obvious.

– The model demonstrates general visual reasoning capabilities when processing out-of-distribution data and performing novel tasks, but further investigation is needed.


data! data! data! I can’t make bricks without clay! ——Sherlock Holmes

The key to any large pre-trained model is that it must be trained on large amounts of data.

For language models, it is easy to obtain large and very diverse data sets.

For example, the popular CommonCrawl repository contains 250 billion web pages that have been scanned across the entire web. It is extremely diverse and includes “natural presentations” such as language translation and question answering.

However, in the field of computer vision, there is still a long way to go to have the same scale and diversity of data sources.

Therefore, one of the core contributions of the researchers’ work is to construct such a unified vision data set (UVDv1).

To this end, researchers have exploited many different visual data sources: (1) unlabeled images, (2) images with visual annotations, (3) unannotated videos, (4) videos with visual annotations, (5 ) 3D synthetic objects.

Among them, unlabeled images account for more than 80% of the total data, making up most of the visual world and providing the required diversity. However, the price is that the quality of the data source is low.

The distribution of annotated images will be more restricted, but generally of higher quality.

Video data are more restricted (generally human-centered activities), but they are a valuable source of temporal data.

3D composite objects provide the lowest rendering diversity, but can provide valuable hints about the behavior of 3D structures.

And most importantly, UVDv1 is a purely visual dataset and does not contain non-visual metadata such as text.

In total, UVDv1 contains 1.64 billion images.

Another important difference from LLM is that linguistic data has a natural, unified one-dimensional structure for all data – the text stream.

Unfortunately, this is not the case with visual data, where different sources have different structures.

Therefore, in this work, the researchers propose visual sequences as unified units of visual data, which allows them to train scalable models from different collection sources.

A visual sequence is simply a sequence containing one or more images, followed by an End of Sentence (EOS) token.

Figure 1 shows how various data sources are divided into visual sequences.

single image

A single image itself represents the simplest form of a visual sequence – {image, EOS}.

The researchers used a filtered subset of the 1.49 billion images in the LAION 5B dataset.

This is by far the largest portion of the data, accounting for 88.5%.

image sequence

Image sequences are a natural form of visual sequences.

Researchers create such sequences by taking video data from a variety of existing datasets.

The 16-frame visual sequence is formed by machine sampling the video at three different step sizes (10, 20, and 30).

Additionally, the researchers leveraged synthetic 3D objects from the 0bjaverse dataset to generate object-centered multi-view sequences.

For each object, the researchers sampled a radius of 1.5 to 2.2 between the center of the object and the camera, and a constant elevation angle from -45 to 45 degrees, then iterated through different viewing angles of the object (at 15 degrees Step size and way of rendering 24 views, changing azimuth).

Using this method, the researchers rendered a total of 42,000 such sequences for training and 8,000 for testing.

Finally, images belonging to the same semantic category can also be represented as (part of) a sequence.

Using categories in ImageNet, groups of images (2, 4, 8, or 16) in the same category are concatenated into a long sequence of 16 images.

Annotated images

In order to handle different types of image annotations in a unified way, researchers chose to characterize all annotations as images.

Certain data types, such as semantic segmentation maps, edge maps, depth and ordinary images, are already represented in this way.

For other data types, researchers have tailored different methods for each specific annotation type—

1. Object detection: Create annotations by overlaying color-coded bounding boxes around each object.

2. Human posture: Use MMPose and follow the OpenPose format to render human bones in pixel space.

3. Depth estimation, surface normal and edge detection: For a given ImageNet and COCO image, generate annotations according to a specific protocol.

4. Style transfer, rain removal, denoising, low-light enhancement and stereo datasets: These are represented in the form of image pairs (eg input/output).

5. Coloring: Convert ImageNet images to grayscale images to generate image pairs.

6. Repair: Add random black boxes to the image to simulate damage, resulting in image pairs.

For all of the above annotation types, a visual sequence can be created by concatenating 8 image pairs of the same annotation type into a visual sequence of 16 images.

For a dataset containing k different annotations of the same image, use a different approach: for each set of 1+k images (enter more than k annotations), then randomly select m elements, where m≤n+1≤16 . These m-tuples are then concatenated to form a visual sequence.

Annotated image sequence

Two complementary strategies are employed when converting annotated video data (VIPSeg, Hand14K, AVA, JHMDB) into visual sequences.

The first strategy is similar to the approach to processing pairwise annotated image data: each visual sequence is constructed by concatenating frames with their annotations – {frame1,annot1,frame2,annot2,…}.

The second method is to combine multiple frames with corresponding annotations {frame1,frame2,annot1,annot2,…} to group.


Unlike text data, which naturally exhibits discrete sequence structures, modeling image pixels as visual sequences is not intuitive. In this work, the researchers took a two-stage approach:

1. Train a large visual tokenizer (operating on a single image) to convert each image into a series of visual tokens;

2. Train an autoregressive Transformer model on visual sequences, with each sequence represented as a series of tokens.

Image Tokenization

Although visual sequences exhibit sequential structure between consecutive images, there is no such natural sequence structure within a single image.

Therefore, to apply the Transformer model to images, previous works typically adopt the following approach: either segment the image into patches in scanline order and treat it as a sequence, or use a pretrained image tokenizer, such as VQVAE or VQGAN, The image features are clustered into discrete tokens one by one, and then these tokens are converted into sequences in scan line order.

The researchers adopted the latter approach because the model’s discrete classification output naturally forms a probability distribution that can be easily sampled, making it possible to flexibly generate new images in visual sequences.

Specifically, the researchers used semantic tokens generated by the VQGAN model. The framework includes encoding and decoding mechanisms and features a quantization layer that assigns input images to a sequence of discrete tokens from an established codebook.

The encoder and decoder are entirely composed of convolutional layers. The encoder is equipped with multiple downsampling modules to compress the spatial dimensions of the input, while the decoder is equipped with an equal number of upsampling modules to restore the image to its original size.

For a given image, the researchers’ VQGAN tokenizer produces 256 discrete tokens.

It’s important to note that the researchers’ tokenizer operates on individual images independently, rather than processing the entire visual sequence at once.

This independence allows researchers to decouple tokenizer training from the downstream Transformer model, so that the tokenizer can be trained on a single image dataset regardless of the distribution of visual sequences.

Implementation details: The researchers adopted an off-the-shelf VQGAN architecture. A downsampling factor of f=16 and a codebook size of 8192 are used. This means that for an image of size 256 × 256, the researchers’ VQGAN tokenizer produces 16 × 16 = 256 tokens, each of which can take on 8192 different values.

The researchers found that tokenizers pretrained using ImageNet did not generalize well beyond ImageNet images. Therefore, the researchers trained their own tokenizer on the 1.5B subset of the LAION 5B dataset.

Sequence modeling of visual sequences

After using VQGAN to convert images into discrete tokens, the researchers treated the visual sequence as a unified sequence by concatenating the discrete tokens of multiple images into a 1D sequence.

Importantly, we treat all visual sequences equally – we do not use any special tokens to indicate a specific task or format.

The researchers used cross-entropy loss to train a causal Transformer model with the goal of predicting the next token, similar to standard methods for language models. Training the model the same way for all visual sequences enables the model to infer relationships between images from context rather than from task- or format-specific tokens. This gives the model the opportunity to generalize to other unseen visual sequence structures.

Implementation details: The researchers segmented each image in the visual sequence into 256 tokens, and then connected them into a 1D token sequence.

Based on the visual token sequence, the researchers’ Transformer model is almost the same as the autoregressive language model, so the researchers adopted LLaMA’s Transformer architecture.

The researchers used a context length of 4096 tokens, which can fit the 16 images under the researchers’ VQGAN tokenizer.

Similar to the language model, the researchers added a (BOS) (beginning of sequence) token at the beginning of each visual sequence and an (EOS) (end of sequence) token at the end, and used sequence concatenation during training. Improve efficiency.

We trained our model on the entire UVDv1 dataset (420 billion tokens) using one epoch (using simple epoch training in the language model to avoid potential overfitting).

The researchers trained four models with different parameter numbers: 300 million, 600 million, 1 billion, and 3 billion, following the same training configuration.

Reasoning through visual cues

Because the autoregressive Transformer in our model outputs a probability distribution of the next token based on the previous token, we can easily sample from this distribution to generate new visual tokens that complete the visual sequence.

To use the model for downstream tasks, you can build a partial visual sequence that defines the task at test time and apply the model to generate the output. This is similar to context learning in language models or visual cues in computer vision.

Experimental results and analysis

Finally, the researchers evaluated the model’s ability to scale and its ability to understand and answer a variety of prompt tasks.


We studied the scaling behavior of our model in terms of training loss and downstream task performance as the model size increases and the number of tokens seen during training increases.

training loss. First, the researchers examined the training loss of LVM with different parameter sizes, see the figure below.

Because all of our models were trained on the dataset for only one epoch, the models only saw each data sample once, so the training loss at any time during the training process was very similar to the validation loss.

It can be observed that as training proceeds:

1. The training loss (perplexity) of models of different sizes continues to decrease;

2. As the model size (parameter count) increases, the loss decreases faster. These observations indicate that LVM shows strong scalability to larger models and more data.

Although LVM’s overall loss scales well during training, there is no guarantee that a better overall model will also perform better on a specific downstream task.

Therefore, the researchers evaluated models of different sizes on 4 downstream tasks: semantic segmentation, depth estimation, surface normal estimation, and edge detection. The researchers evaluate these tasks on the ImageNet validation set.

For each task, the researchers gave 5 pairs of inputs and corresponding real annotations as well as query images as input prompts, and evaluated the perplexity prediction of the researchers’ model for the real annotations of the next 256 tokens (one image).

Below, we show that the larger model indeed achieves lower perplexity across all tasks, demonstrating that our scalable overall performance does translate to a range of downstream tasks.

Although LVM achieves better performance on larger models and more data, a natural question is whether each data component collected in UVDv1 helps.

To answer this question, we performed an ablation study on several 3B models trained on a subset of our dataset and compared their performance on downstream tasks.

The researchers used the same 4 downstream tasks and settings as before and presented the results in the figure below.

The researchers observed that each data component positively contributed to downstream tasks. LVM not only benefits from larger data, but also improves as the diversity in the dataset increases, including annotated and unsupervised image and video data.

Sequence prompt

The researchers first used the most intuitive and simple method to provide visual cues for LVM: sequential reasoning. Here, prompt construction is very simple: the researchers show the model a sequence of 7 images and ask it to predict the next image (256 tokens).

For sequential cues, the most straightforward task is video prediction. The figure below shows several examples of next frame predictions prompted from the Kinetics-700 validation set sequence.

In the top example, a 7-frame cue (blue border) is followed by a predicted frame (red border). Researchers observed a certain degree of reasoning ability in terms of spatial orientation, viewpoint, and object understanding. The predicted perplexity on the Kinetics validation set is 49.8.

The example below shows predictions with longer context (15 frames) and longer predictions (4 frames).

The same type of simple sequence prompts can be used in other ways as well. For example, the image below shows how by prompting the model a sequence of 3D rotations of a synthetic object around an arbitrary axis, it is able to predict further rotations.

Or researchers could treat a list of items in a given category as a sequence and predict other ideas in that category, as shown in the figure below.

It is worth noting that although the system was trained on sets of images of the same ImageNet category, the cues here include sketches that have not appeared in any of the annotated data.

Next, the researchers looked at how much temporal context is needed to accurately predict subsequent frames.

The researchers evaluated the model’s frame generation perplexity under contextual cues of varying lengths (1 to 15 frames). As shown in the figure below, on the Kinetics-700 validation set, the perplexity improved significantly from frames 1 to 11 and then stabilized (from 62.1 → 48.4).

Analogy prompt

The researchers’ research progressed by evaluating a more complex prompting structure, which the researchers called “analogy prompting.” This approach challenges the model to understand analogies of arbitrary length and complexity, thereby testing its high-level explanatory power.

The figure below shows a sample of qualitative results using analogy prompts on multiple tasks. The prompt consists of a sequence of 14 images giving examples of various tasks, followed by a 15th query image. Given each hint, predict the next image.

The upper part of the figure shows several example prompts that define the task in the training set (but these actual images are never seen in training). The lower part of the figure shows generalization to a task never shown in training.

The researchers show results on keypoint detection on Pascal 3D+, using the standard Percent Correct Keypoints (PCK) metric with a threshold of 0.1. Notably, LVM achieved a PCK of 81.2 without training on this dataset, showing impressive generalization capabilities.

In comparison, the researchers demonstrated some existing task-specific models: StackedHourglass achieved a PCK of 68.0, MSS-Net achieved a PCK of 68.9, and StarMap achieved a PCK of 78.6.

Comparison with visual cues

The closest approach to the researchers’ approach, which also allows the definition of arbitrary tasks, is visual cues. In the table below, researchers compare the performance of several visual cueing models on few-sample segmentation, object detection, and colorization tasks. The researchers’ sequential LVM outperformed previous methods on nearly every task.

task mix

The image below demonstrates combining multiple tasks in a single prompt. The researchers demonstrated a rotation task with a new keypoint correspondence task and asked the model to continue this pattern. The model was able to successfully combine the two tasks when tested, showing a certain degree of composability.

Other types of tips

The researchers tried to see if they could see how far the model could go by giving it a variety of cues that it hadn’t seen before.

The image below shows some of these tips to great effect.

The figure below shows some hints that are not easily described in words – this is the type of task where LVM may ultimately outperform LLM.


  • abs/2312.00785