Who gives you the confidence to use CPU for AI reasoning-Quick Technology-Technology changes the future

In the training phase of large models, we chose GPU, but in the inference phase, we decisively added CPU to the menu.

During recent interactions with many industry figures, Qubit found that many of them have begun to convey the above-mentioned views.

Coincidentally, in Hugging Face's official optimization tutorial, there are several articles focusing on “how to use CPU to efficiently infer large models”:

And after carefully studying the tutorial content, it is not difficult to find that this method of using CPU to accelerate inference covers not only large language models, but also involves large multi-modal models in the form of images, audio, etc.

Not only that, even mainstream frameworks and libraries, such as TensorFlow and PyTorch, have been continuously optimized to provide optimized and efficient inference versions for CPUs.

In this way, when GPUs and other dedicated acceleration chips dominate the world of AI training, CPUs seem to have opened up a “new path” in reasoning, including large model reasoning, and related discussions have become increasingly popular. stand up.

As for why this situation occurs, it is closely related to the development trend of large models.

Since the advent of ChatGPT triggered AIGC, domestic and foreign players first focused on training, showing a lively battle of hundreds of models; however, when the training stage was completed, major models have entered the application stage.

Even NVIDIA stated in its latest quarterly financial report that AI inference accounted for 40% of the US$18 billion in data center revenue.

It can be seen that inference has gradually become the main theme of the large model process, especially the implementation process.

Why does Pick CPU do inference?

To answer this question, we might as well start by working backwards from the effect and see how the “players” who have deployed the CPU for AI reasoning are using it.

Invite two heavyweight players-JD Cloud and Intel.

This year, JD Cloud launched a new generation of servers equipped with fifth-generation Intel Xeon Scalable processors.

First, let’s look at the CPU equipped with this new server.

If you were to use one sentence to describe this latest generation of Intel?

Compared with the previous generation, the fourth-generation Xeon® scalable processor, which uses the same built-in AI acceleration technology (AMX, Advanced Matrix Extensions), its deep learning real-time reasoning performance is improved by up to 42%; compared with the previous generation of built-in AI acceleration Compared with the third-generation Xeon® scalable processor technology (DL-Boost, deep learning acceleration) and the previous generation, the AI inference performance is increased up to 14 times.

At this point, we will talk in detail about the two stages that Intel? Xeon? built-in AI accelerator has gone through:

The first stage is optimized for vector operations.

Starting with the introduction of the Advanced Vector Extensions 512 (Intel® AVX-512) instruction set in the first-generation Xeon® Scalable processors in 2017, vector operations can perform multiple data operations using a single CPU instruction.

The second and third generations of vector neural network instructions (VNNI, the core of DL-Boost) further merge the three separate instructions for multiply, accumulate and accumulate operations to further improve the utilization of computing resources and better utilize high-speed Caching avoids potential bandwidth bottlenecks.

The second stage, which is the current stage, is optimized for matrix operations.

So starting from the fourth generation of Xeon? Scalable processors, the protagonist of the built-in AI acceleration technology has been replaced by Intel? Advanced Matrix Extensions (Intel AMX). It is specially optimized for the most common matrix multiplication operation of deep learning models, and supports common data types such as BF16 (training/inference) and INT8 (inference).

Intel AMX is mainly composed of two components: a dedicated Tile register to store a large amount of data, and a TMUL acceleration engine to perform matrix multiplication operations. Some people compare it to the Tensor Core built into the CPU. Well, it is indeed very vivid.

In this way, it not only can calculate larger matrices in a single operation, but also ensures scalability and scalability.

Intel AMX is on each core of the Xeon CPU and is close to the system memory, which can reduce data transmission delays, increase data transmission bandwidth, and reduce the complexity of actual use.

For example, if a model with no more than 20 billion parameters is “fed” to the fifth-generation Xeon® scalable processor, the delay will be as low as no more than 100 milliseconds!

Secondly, let’s look at the new generation of JD cloud servers.

According to reports, the Llama2-13B inference performance (Token generation speed) of the fifth-generation Intel® Xeon® scalable processor jointly customized and optimized by JD.com and Intel has increased by 51%, which is enough to meet various AI needs such as question and answer, customer service and document summary. The demand scenario of the scenario.

△ Llama2-13B inference performance test data

For higher parameter models, even the 70B Llama2, 5th generation Intel Xeon Scalable processors still get the job done.

It can be seen that the CPU's built-in AI accelerator has been developed to this day, and its performance for reasoning can be guaranteed to be sufficient to meet actual combat needs.

An AI acceleration solution based on a general-purpose server like this can not only be used for model inference, but can also flexibly meet the needs of data analysis, machine learning and other applications. To exaggerate, a single server can complete the platforming and development of AI applications. Full process support.

Not only that, using the CPU for AI inference also has inherent advantages of the CPU, such as cost, and more importantly, the efficiency of deployment and practice.

Because it is a standard component of computers, almost all servers and computers are equipped with CPUs, and there are already a large number of ready-made applications based on CPUs in traditional businesses.

This means that choosing a CPU for inference is easy to obtain, does not require importing the design of heterogeneous hardware platforms or has relevant talent reserves, and is easier to obtain technical support and maintenance.

Take the medical industry as an example. In the past, CPUs have been widely used in electronic medical record systems, hospital resource planning systems, etc., cultivating mature technical teams and establishing complete procurement processes.

Based on this, Weining Health, a leading medical informatization company, has used CPU to build a WiNEX Copilot solution that can be deployed and applied efficiently and at low cost. This solution has been deeply integrated into Weining's new generation of WiNEX products. Hospitals that have adopted this system can quickly deploy this “doctor AI assistant”.

Its medical records assistant function alone can process nearly 6,000 medical records within 8 hours, that is, after doctors get off work, which is equivalent to the total workload of 12 doctors in a tertiary hospital in a day!

And as we just mentioned, judging from the optimization tutorial provided by Hugging Face, it only takes a few simple steps to quickly deploy the CPU for efficient inference.

Simple optimization and quick start-up are another advantage of the CPU in the process of implementing AI applications.

This means that in any large or small scenario, as long as CPU-based optimization achieves a single successful breakthrough, it can quickly be replicated or expanded accurately and quickly. The result is: more users can In the same or similar scenarios, AI applications can be put into practice at a faster speed and at a better cost.

After all, Intel is not only a hardware company, but also has a huge software team. In the era of traditional deep learning, a large number of optimization methods and tools have been accumulated, such as the OpenVINO toolkit, which is widely used in industry, retail and other industries.

In the era of large models, Intel has also in-depth cooperation with mainstream large models such as Llama 2, Baichuan, Qwen, etc. Taking the Intel® Extension for Transformer toolkit as an example, it can accelerate the inference performance of large models by up to 40 times.

In addition, the obvious trend of large models now is to increasingly roll out applications. How to implement and use the endless new applications “quickly, easily and cost-effectively” has become the key.

Therefore, it is not difficult to understand why more and more people choose CPU for AI reasoning.

Perhaps, we can also quote what Intel CEO Pat Gelsinger said in an interview with the media at the end of 2023 to consolidate your impression:

“Looking at inference applications from an economic point of view, I would not build a backend environment that is all H100 and costs $40,000 because it consumes too much power and requires building new management and security models, as well as new IT infrastructure.”

“If I could run these models on a standard Intel chip, I wouldn't have these problems.”

AI Everywhere

Looking back in 2023, large models themselves are the absolute center of topic in the AI circle.

But at the beginning of 2024, the obvious trend is the progress of various technologies, and the progress of application implementation in various industries is accelerating, showing a “multi-point blooming” situation.

Under this situation, it is foreseeable that more demands for AI reasoning will emerge, and the proportion of reasoning computing power in the overall AI computing power demand will only increase.

For example, AI video generation represented by Sora, the industry speculates that its training computing power requirement is actually less than that of a large model, but its inference computing power requirement is hundreds or thousands of times that of a large model.

Other acceleration optimizations such as video transmission required for the implementation of AI video applications are also the specialty of the CPU.

So taken together, the positioning of the CPU in the entire Intel AI Everywhere vision is clear:

It makes up for the areas that GPUs or dedicated accelerators cannot or are insufficient to provide flexible computing power options for more diverse and complex scenarios. While strengthening general computing, it has become an important infrastructure for the popularization of AI.