Released: MTT S4000 Moore Thread Large Model Intelligent Computing Accelerator Card with 48GB Video Memory

Gamingdeputy reported on December 19 that Moore Thread announced today that the unveiling ceremony of Moore Thread KUAE Intelligent Computing Center, the first nationally produced 1,000-card and 100 billion model training platform, was successfully held in Beijing, announcing the country’s first domestic full-featured GPU. The base's large-scale computing power cluster was officially launched, and the large-model intelligent computing accelerator card MTT S4000 was also released simultaneously.

Gamingdeputy comes with MTT S4000 parameters as follows:

The Moore thread large model intelligent computing accelerator card MTT S4000 uses the third-generation MUSA core. A single card supports 48GB of video memory and 768GB/s of video memory bandwidth. Based on Moore Thread's self-developed MTLink1.0 technology, MTT S4000 can support multi-card interconnection and help accelerate distributed computing of hundreds of billions of large models. At the same time, MTT S4000 provides advanced graphics rendering capabilities, video encoding and decoding capabilities, and ultra-high-definition 8K HDR display capabilities to facilitate the implementation of comprehensive application scenarios such as AI computing, graphics rendering, and multimedia. What's particularly important is that with the help of Moore Thread's self-developed MUSIFY development tool,The MTT S4000 computing card can make full use of the existing CUDA software ecosystem and realize zero-cost migration of CUDA code to the MUSA platform..

According to officials, the Moore Thread KUAE Intelligent Computing Center solution is based on a full-featured GPU and is a full-stack solution integrating software and hardware, including infrastructure with KUAE computing cluster as the core, KUAE Platform cluster management platform and KUAE ModelStudio model service , aiming to solve the construction and operation management problems of large-scale GPU computing power in an integrated delivery manner. This solution can be used out of the box, greatly reducing the time cost of traditional computing power construction, application development and operation and maintenance platform construction, and achieving rapid launch on the market for commercial operations.

Moore Thread KUAE supports the industry's mainstream distributed frameworks including DeepSpeed, Megatron-DeepSpeed, Colossal-AI, and FlagScale, and integrates a variety of parallel algorithm strategies, including data parallelism, tensor parallelism, pipeline parallelism, and ZeRO, and is designed for high efficiency Additional optimizations have been made for communication computing parallelism and Flash Attention. Currently, Moore Thread supports the training and fine-tuning of various mainstream large models including LLaMA, GLM, Aquila, Baichuan, GPT, Bloom, Yuyan, etc. Based on the Moore thread KUAE kilocard cluster, the linear acceleration ratio can reach 91% for large model training with parameters from 70B to 130B, and the computing power utilization remains basically unchanged. Taking the amount of 200 billion training data as an example, Zhiyuan Research Institute's 70 billion parameter Aquila2 can complete training in 33 days; a model with 130 billion parameters can complete training in 56 days. In addition, the Moore Thread KUAE kilocard cluster supports long-term continuous and stable operation, supports breakpoint resume training, and asynchronous checkpoint is less than 2 minutes.