Stability AI Shows Improved Performance with Intel Gaudi 2 Over Nvidia H100

Could this be the end of the monopoly Nvidia in terms of artificial intelligence? At least that’s what Stability AI claims. In a blog article, the British start-up notably discussed the performance criteria and the advantages of different calculation solutions.

To do this, the company trained two of its models and compared the training speed of the Intel Gaudi 2 accelerators to that of the Nvidia A100 and H100 accelerators. According to Stability AI, this is “two of the most common choices for start-ups and developers training LLMs”.

Stable Diffusion 3 equipped with Gaudi 2 accelerators

The first model to be used during this experiment is none other than Stable Diffusion 3, a family of text-image conversion models published at the end of February. For the moment, Stable Diffusion 3 is only available with restricted access, but it is expected that this version will eventually be offered in a range from 800 million to 8 billion parameters. In this case, the company claims to have used the 2 billion parameter version and has given “pleasantly surprising results”.

THE benchmark results training on 2 nodes, i.e. a total of 16 Gaudi 2 accelerators, show that by keeping the batch size constant, the system processed 927 images per second, or 1.5 times faster than the H100-80GB. By doubling the batch size – to 32 per accelerator – it was even possible to increase the training rate to 1,254 frames/sec.

Furthermore, by increasing the number to 256 accelerators in the following test, the Gaudi 2 accelerators are, once again, more efficient with a rendering of 1,254 frames/second. In this configuration, the Gaudi 2 cluster actually processed three times more images per second than the A100-80GB GPUs. This is despite the fact that the A100s are recognized for having a very optimized software stack.

Intel and Nvidia chips neck and neck in the inference phase

During inference tests with the Stable Diffusion 3 8B parameter model, we note that the Gaudi 2 chips offer an inference speed similar to that of the Nvidia A100 chips using the PyTorch base. However, with TensorRT optimization, the A100 chips produce images 40% faster than Gaudi 2. Stability AI claims that by pushing the optimization further, the Gaudi 2 accelerators will be able to outperform the A100 chips on this model.

The startup, for its part, believes that the higher memory and fast interconnection of Gaudi 2, coupled with other design considerations, make this accelerator competitive to run the Diffusion Transformer architecture that underpins its latest generation of LLM.

A fine-tuned version of Llama 2 70B also tested

In parallel, Stability AI worked on Stable Beluga 2.5 70B which turns out to be a fine-tuned version of Llama 2 70B and based on the Stable Beluga 2 model. In this case, the training was carried out on 256 Gaudi accelerators 2. “Running our PyTorch code as-is, without additional optimizations, we measured an impressive total average throughput of 116,777 tokens/second”indicates the start-up.

On inference tests with its 70B language model on Gaudi 2, the startup says it generates 673 tokens/second per accelerator, using an input token size of 128 and an output token size of 2048. Result: compared to TensorRT-LLM, Gaudi 2 seems to be 28% faster than the 525 tokens/second for the A100.

Intel accelerators offer a good price/performance ratio

Ultimately, if Nvidia's accelerators present very good results overall, the fact remains that those from Intel could become a formidable alternative. And they already have a considerable advantage: they are more financially accessible and the delivery times for Gaudi 2 accelerators are shorter than for H100s or A100s.

Selected for you