Google Announces Plan to Invest Over $100 Billion in AI Training, Outpacing OpenAI Stargate and Apollo Moon Landing Costs

In a recent interview, LeCun personally confirmed that Meta has spent US$30 billion to purchase Nvidia GPUs, which cost more than the Apollo moon landing.

In comparison, the Stargate built by Microsoft and OpenAI cost US$100 billion, and Google DeepMind CEO Hassabis made a bold statement: Google invested more than this amount! Big technology companies are becoming more and more cautious in burning money. After all, the prospect of AGI is too attractive.

Advertisement

Just now, Meta AI director Yann LeCun confirmed: Meta has spent US$30 billion to buy Nvidia GPUs. This cost has exceeded the Apollo moon landing program!

Although US$30 billion is staggering, it is still a small case compared to the US$100 billion Stargate that Microsoft and OpenAI plan to build.

Google DeepMind CEO Hassabis even said: Google wants to invest more than this.

Advertisement

Where is this?

LeCun: Meta’s purchase of NVIDIA GPU is indeed more than the Apollo moon landing

In order to develop AI, Meta is desperate.

In this interview, the host asked: It is said that Meta purchased 500,000 NVIDIA GPUs. According to the market price, the price is US$30 billion. So, the entire cost is higher than the Apollo moon landing project, right?

LeCun admitted: Yes, it is. He added, “Not only training, but also deployment costs. The biggest problem we face is the supply of GPUs.”

Some people have questioned that this should not be true. As the largest reasoning organization in history, they should not spend all their money on training.

Some people also burst this bubble, saying that every giant is lying to create the illusion that “they have more GPUs”——

While it is true that a lot of money is invested in NVIDIA hardware, only a small amount is actually used to actually train models. The concept of “we have millions of GPUs” just sounds too bragging.

Of course, some people question: considering inflation, the cost of the Apollo program should be close to 200-250 billion US dollars.

Indeed, someone has calculated that, taking into account the original value of the Apollo program in 1969 and adjusting for inflation, its total cost should have been $217 billion or $241 billion.

https://apollo11space.com/apollo-program-costs-new-data-1969-vs-2024/

Wharton professor Ethan Mollick said that although it was far less than the Apollo project, in today's dollars, Meta spent almost as much on GPUs as the Manhattan Project.

But at least netizens said they were happy to have a glimpse of the giant's AI infrastructure: power, land, and racks that can accommodate 1 million GPUs.

Open source Llama 3 is a big success

In addition, Meta also achieved outstanding results on Llama 3. In the development of Llama 3, the Meta team mainly has four levels of considerations:

Model architecture

In terms of architecture, the team uses a dense autoregressive Transformer, and adds a group query attention (GQA) mechanism and a new word segmenter to the model.

Training data and computing resources

Since more than 15 trillion tokens were used in the training process, the team built two computing clusters, each with 24,000 H100 GPUs.

Instruction fine-tuning

In fact, the effect of the model mainly depends on the post-training stage, which is also the most time-consuming and energy-consuming part.

To this end, the team expanded the scale of manually annotated SFT data (10 million) and adopted techniques such as rejection sampling, PPO, and DPO to try to find a balance between usability, human features, and large-scale data in pre-training.

Now, judging from the latest code reviews, this series of explorations by the Meta team can be said to be a great success.

After a comprehensive evaluation of more than 130 LLMs including GPT-3.5/4, Llama 3, Gemini 1.5 Pro, Command R +, etc., Symflower CTO and founder Markus Zimmermann said: “The throne of large language models belongs to Llama 3 70B! “

– Achieve 100% in coverage and 70% in code quality

– The most cost-effective reasoning ability

– Model weights open

However, it is worth noting that GPT-4 Turbo is the undisputed winner in terms of performance – scoring a perfect 150 points.

As can be seen, GPT-4 (150 points, $40/million tokens) and Claude 3 Opus (142 points, $90/million tokens) perform very well, but are cheaper than Llama, Wizard and Haiku 25 to 55 times higher.

Specifically, in Java, Llama 3 70B successfully identified a constructor test case that was not easily discovered, a discovery that was both unexpected and effective.

Additionally, it produces high-quality test code 70% of the time.

GPT-4 Turbo tends to add some obvious comments when generating test code, but this is usually something to avoid in high-quality code writing.

The quality of the test code is greatly affected by fine-tuning: in performance tests, WizardLM-2 8x22B outperformed Mixtral 8x22B-Instruct by 30%.

In terms of the ability to generate compilable code, smaller parameter models such as Gemma 7B, Llama 3 8B and WizardLM 2 7B performed poorly, but Mistral 7B did well.

After evaluating 138 LLM models, the team found that about 80 of them were unreliable in their ability to generate even simple test cases.

A score below 85 means the model is not performing satisfactorily.However, the above figure does not fully reflect all the findings and insights in the review, and the team expects to add to it in the next version.

Detailed evaluation can be found in the article below:

Evaluation address: en/company/blog/2024/ dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/

Winning the artificial intelligence war will be extremely expensive

Today, major technology companies are paying a high price to win this AI war.

How much money do technology giants need to spend to make AI smarter?

Demis Hassabis, boss of Google DeepMind, made a prediction at the TED conference half a month ago: Google is expected to invest more than $100 billion in developing AI.

As the center and soul of Google's artificial intelligence plan, the leader of DeepMind Labs, Hassabis's remarks also expressed his unyielding support for OpenAI.

According to The Information, Microsoft and OpenAI plan to spend $100 billion to build “Stargate.” This supercomputer is expected to contain millions of dedicated server chips to power more advanced models such as GPT-5 and GPT-6.

When Hassabis was asked about the huge cost of supercomputing by competitors, he downplayed that Google may spend more than that.

We won't talk about specific numbers now, but I think that over time, our investment will exceed this number.

Today, the craze for generative AI has triggered a huge investment boom.

According to Crunchbase data, AI startups alone raised nearly $50 billion in funding last year.

Hassabis’s speech shows that competition in the AI ​​field has no intention of slowing down and will become even more intense.

Google, Microsoft, and OpenAI are all competing fiercely for the feat of “being the first to reach AGI.”

A crazy number of $100 billion.

More than 100 billion US dollars will be spent on AI technology. Where will all this 100 billion be spent?

First of all, the bulk of development costs is the chip.

At present, Nvidia is still the leader in this area. Google Gemini and OpenAI's GPT-4 Turbo still rely heavily on third-party chips such as Nvidia GPUs.

The cost of training models is also becoming more and more expensive.

The annual AI Index report previously released by Stanford pointed out: “The training cost of SOTA models has reached unprecedented levels.”

Report data shows that GPT-4 used “approximately US$78 million worth of calculations for training,” while the calculation amount used to train GPT-3 in 2020 was only US$4.3 million.

Meanwhile, the training cost of Google Gemini Ultra is $191 million.

The original technology behind the AI ​​model cost only $900 to train in 2017.

The report also noted: There is a direct correlation between the cost of training an AI model and its computational requirements.

If AGI is the goal, costs are likely to skyrocket.

$190 million: From Google to OpenAI, how much it costs to train an AI model

The recent Artificial Intelligence Index Report revealed the staggering costs of training the most complex AI models to date.

Let’s dig into the breakdown of these costs and explore what they mean.

Transformer (Google): $930

The Transformer model is one of the pioneering architectures of modern AI, and this relatively modest cost highlights the efficiency of early AI training methods.

Its cost serves as a benchmark for understanding the field's progress in terms of model complexity and associated costs.

BERT-Large (Google): $3,288

The BERT-Large model is significantly more expensive to train than its predecessor.

Known for its bidirectional pre-training of contextual representations, BERT has made significant progress in natural language understanding. However, this progress comes at a higher financial cost.

RoBERTa Large (Meta): $160

The jump in training cost of RoBERTa Large, a variant of BERT optimized for robust pre-training, reflects the increasing computational requirements as models become more complex.

This dramatic increase highlights the rising costs associated with pushing the boundaries of artificial intelligence capabilities.

LaMDA (Google): $1.3M USD

LaMDA is designed for natural language conversation and represents a shift toward more specialized AI applications.

The substantial investment required to train LaMDA highlights the growing need for AI models tailored to specific tasks, which require more extensive fine-tuning and data processing.

GPT-3 175B (davinci) (OpenAI): $4.3M

Known for its massive scale and impressive language generation capabilities, GPT-3 represents an important milestone in the development of AI.

The cost of training GPT-3 reflects the enormous computational power required to train a model of this size, highlighting the trade-off between performance and affordability.

Megatron-Turing NLG 530B (Microsoft/NVIDIA): $6.4M

The cost of training Megatron-TuringNLG illustrates the trend toward larger models with hundreds of billions of parameters.

This model breaks the boundaries of AI capabilities, but brings with it staggering training costs. It has significantly raised the bar, widening the gap between industry leaders and smaller players.

PaLM (540B) (Google): $12.4M

PaLM has a large number of parameters and represents the pinnacle of AI scale and complexity.

The astronomical cost of training PaLM illustrates the huge investments required to push the boundaries of AI research and development, and raises questions: Are such investments really sustainable?

GPT-4 (OpenAI): $78.3M

The estimated training cost of GPT-4 also marks a paradigm shift in the economics of artificial intelligence – the cost of training AI models has reached unprecedented levels.

As models become larger and more complex, the economic barriers to entry continue to escalate. At this time, the latter will limit innovation and people's availability of AI technology.

Gemini Ultra (Google): $191.4M

The staggering cost of training Gemini Ultra reflects the challenges posed by extremely large-scale AI models.

While these models have demonstrated groundbreaking capabilities, their training costs have reached astronomical levels. All but the most well-funded large companies are blocked.

Chip race: Microsoft, Meta, Google and Nvidia compete for AI chip supremacy

Although Nvidia has taken the lead in the chip field with its long-term layout, whether it is AMD, an old rival, or giants such as Microsoft, Google, Meta, etc., they are also catching up and trying to adopt their own designs.

On May 1, AMD's MI300 artificial intelligence chip sales reached $1 billion, becoming its fastest-selling product ever.

At the same time, AMD is still working non-stop to increase the production of AI chips that are currently in short supply, and is expected to launch new products in 2025.

On April 10, Meta officially announced the next generation of self-developed chips, which will greatly improve the model training speed.

The Meta Training and Inference Accelerator (MTIA) is designed for use with Meta's ranking and recommendation models. These chips can help improve training efficiency and make practical inference tasks easier.

Also on April 10, Intel also revealed more details about its latest AI chip, Gaudi 3 AI.

Intel said that compared with the H100 GPU, Gaudi 3 can achieve a 50% improvement in inference performance, a 40% improvement in energy efficiency, and is cheaper.

On March 19, Nvidia released the “most powerful” AI chip on earth, the Blackwell B200.

Nvidia said that the new B200 GPU can provide up to 20 petaflops of FP4 computing power with 208 billion transistors.

Not only that, GB200, which combines two such GPUs with a Grace CPU, can provide 30 times better performance for LLM inference tasks than before, while also greatly improving efficiency.

In addition, Huang also hinted that the price of each GPU may be between 30,000 and 40,000 US dollars.

On February 23, Nvidia's market capitalization exceeded US$2 trillion, becoming the first chip manufacturer to achieve this milestone.

At the same time, this also makes Nvidia the third company in the United States with a market value of more than US$2 trillion, second only to Apple (US$2.83 trillion) and Microsoft (US$3.06 trillion).

On February 22, Microsoft and Intel reached a multi-billion-dollar custom chip deal.

It is speculated that Intel will produce its own self-developed AI chips for Microsoft.

On February 9, the Wall Street Journal stated that Sam Altman’s AI chip dream may require an investment of up to US$7 trillion.

“Such an amount of investment will dwarf the current size of the global semiconductor industry. Global chip sales were US$527 billion last year and are expected to reach US$1 trillion per year by 2030.”

References:

  • https://twitter.com/tsarnick/status/1786189377804369942

  • https://www.youtube.com/watch?v=6RUR6an5hOY

  • https://twitter.com/zimmskal/status/1786012661815124024

  • en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/

  • https://techovedas.com/190-million-what-is-the-cost-of-training-ai-models-from-google-to-openai/

Advertisement