HKU Byte Introduces Innovative Approach to Multimodal Big Models, Prioritizing Human Perception and Cognitive Simulation for Accurate Object Localization in Images

current,Multimodal large model (MLLM) has demonstrated strong cognitive understanding capabilities in multiple visual tasks. However, most large multimodal models are limited toOne-way image understandingit is difficult to map the understood content back to the image.

For example, the model can easily tell what objects are in the picture, but it cannot accurately identify the objects in the picture.

Lack of positioning capabilitiesThis directly limits the application of large multimodal models in downstream fields such as image editing, autonomous driving, and robot control.

To address this problem, researchers from HKU and ByteDance’s commercialization team proposed a new paradigm: Groma——

passRegional Image CodingTo improve the perception and positioning capabilities of large multimodal models.

After incorporating positioning, Groma can directly associate text content and image areas, thereby significantly improving the interactivity and directionality of the conversation.

港大字节提出多模态大模型新范式，模拟人类先感知后认知，精确定位图中物体

Core idea

How to give multimodal large models the ability to locate objects, and even to associate text content with image areas to make them “meaningful”, is a major current research hotspot.

A common approach is to fine-tune the large language model so that it directly outputs object coordinates. However, this approach has many limitations:

1,Large language models pre-trained on text do not have spatial understanding capabilities, and it is difficult to accurately locate objects by fine-tuning with only a small amount of data.

2,The localization task has high requirements on the resolution of the input image, but increasing the resolution will significantly increase the computational complexity of the multimodal large model.

3.The output form of large language models is not suitable for handling fine-grained localization tasks such as segmentation.

Based on these considerations, Groma proposed to transfer positioning to the vision tokenizer of the multimodal large model, which will discover and locate potential objects and then hand them over to the large language model for recognition.

港大字节提出多模态大模型新范式，模拟人类先感知后认知，精确定位图中物体

At the same time, this design also makes full use of the spatial understanding ability of the vision tokenizer itself, without the need for an external expert model (such as SAM) to assist in positioning, thus avoiding the redundancy of the external model.

Specifically, Groma introduces region coding on the basis of global image coding to realize the positioning function. As shown in the figure below, Groma first uses Region Proposer to locate potential objects, and then encodes the located regions into region tokens one by one through Region Encoder.

The large language model can determine the corresponding region based on the semantics of the region token, and insert the region token into the output to achieve a hyperlink-like effect, thus realizing visually grounded conversation.

Similarly, the region specified by the user can also be encoded into the corresponding region token through the Region Encoder and inserted into the user instruction, so that the multimodal model can focus on the specified region and generate directional answers.

港大字节提出多模态大模型新范式，模拟人类先感知后认知，精确定位图中物体

In order to improve the robustness and accuracy of positioning, Groma uses more than 8M data (including SA1B) to pre-train Region Proposer. Therefore, the proposals it generates include not only common objects, but also components of objects and broader background elements.

In addition, thanks to the separate design, Groma can use high-resolution feature maps for Region Proposer/Encoder input and low-resolution feature maps for large model input, thereby reducing the amount of computation without losing positioning performance.

Experimental Results

Groma outperforms MiniGPT-v2 and Qwen-VL on traditional Grounding Benchmarks.

港大字节提出多模态大模型新范式，模拟人类先感知后认知，精确定位图中物体

At the same time, Groma verified its conversation and reasoning capabilities on the VQA Benchmark (LLaVA-COCO), a universal multimodal large model.

港大字节提出多模态大模型新范式，模拟人类先感知后认知，精确定位图中物体

In the visual comparison, Groma also shows higher recall and less hallucinations.

港大字节提出多模态大模型新范式，模拟人类先感知后认知，精确定位图中物体

In addition, Groma also supports referential dialogue and grounded chat, which integrate conversational and positioning capabilities.

港大字节提出多模态大模型新范式，模拟人类先感知后认知，精确定位图中物体

Thanks to the powerful cognitive reasoning capabilities of large language models, large multimodal models perform outstandingly in visual understanding tasks.

However, some traditional visual tasks, such as detection and segmentation, depth estimation, etc., rely more on visual perception capabilities, which is exactly what large language models lack.

Groma provides a new solution to this problem, which is toPerception and CognitionDecoupled, the vision tokenizer is responsible for perception, and the large language model is responsible for cognition.

This form of perception followed by cognition is not only more consistent with human visual processes, but also avoids the computational overhead of retraining large language models.

On May 15, ByteDance just announced its self-developed Doubao big model, which provides multimodal capabilities and supports 50+ businesses such as Doubao App, Button, and Dream. It is also open to corporate customers through Volcano Engine to help companies improve efficiency and accelerate intelligent innovation. At present, Doubao App has become the AIGC application with the largest number of users in the Chinese market. ByteDance is continuing to increase its investment in top talents and cutting-edge technologies, and participate in the industry's top technical challenges and breakthroughs.

Project website:

https://groma-mllm.github.io

Paper link:

https://arxiv.org/abs/2404.13013

open source code:

https://github.com/FoundationVision/Groma

Advertising Statement: The external jump links contained in the article (including but not limited to hyperlinks, QR codes, passwords, etc.) are used to convey more information and save selection time. The results are for reference only. All articles in Gamingdeputy include this statement.