Stable Code Model Officially Released After Boss’s Departure: Stability Guaranteed with Code Instruct 3B

[Introduction to New Wisdom]Stability AI seems to have not been affected after the big boss left. Recently, it officially announced a new code model, Stable Code Instruct 3B. It goes further on the previous basis and kills all competing products at the same level. It can even compete with 7B and 15B. Model arm wrestling.

After the boss left, the first model arrived!

Advertisement

Just today, Stability AI officially announced a new code model, Stable Code Instruct 3B.

To say that Stability is really amazing, the CEO resigned, several of the authors of Stable Diffusion also left, the investment company had some problems, and I may not be able to pay my salary.

——However, the wind and rain outside the building are turbulent, but the laboratory remains motionless. Research should be done, papers should be published, and models should be adjusted. The war in various fields of the large model is in decline.

And it’s not just about setting out to engage in all-out war, every research is also progressing. For example, today’s Stable Code Instruct 3B is based on the previous Stable Code 3B and has been optimized with instructions:

Advertisement

Paper address:https://static1.squarespace.com/static/6213c340453c3f502425776e/t/6601c5713150412edcd56f8e/1711392114564/Stable_Code_TechReport_release.pdf

With natural language prompts, Stable Code Instruct 3B can handle a variety of tasks such as code generation, mathematics, and other software development-related queries.

Invincible at the same level, strong kills at higher levels

Stable Code Instruct 3B has achieved the current SOTA among models with the same number of parameters, even better than models such as CodeLlama 7B Instruct which is more than twice its size, and its performance in software engineering related tasks is equivalent to StarChat 15B.

As can be seen from the figure above, Stable Code Instruct 3B performs well across a range of coding tasks compared to leading models such as Codellama 7B Instruct and DeepSeek-Coder Instruct 1.3B.

Testing shows that Stable Code Instruct 3B matches or exceeds competitors in code completion accuracy, understanding of natural language instructions, and versatility across different programming languages.

Stable Code Instruct 3B focuses training on programming languages ​​such as Python, Javascript, Java, C, C++, and Go, based on the results of the Stack Overflow 2023 developer survey.

The above graph compares the strength of output generated by three models in various programming languages ​​using the Multi-PL benchmark. It can be found that Stable Code Instruct 3B is significantly better than CodeLlama in all languages, and the number of parameters is more than half.

In addition to the popular programming languages ​​mentioned above, Stable Code Instruct 3B also includes training for other languages ​​(such as SQL, PHP and Rust), and can provide strong testing performance even in languages ​​​​without training (such as Lua).

Stable Code Instruct 3B is proficient not only in code generation but also in FIM (Fill-in-the-Code) tasks, database queries, code translation, interpretation and creation.

Through instruction tuning, models are able to understand and act on subtle instructions, facilitating a wide range of coding tasks beyond simple code completion, such as mathematical understanding, logical reasoning, and handling complex techniques of software development.

Model download:https://huggingface.co/stabilityai/stable-code-instruct-3b

Stable Code Instruct 3B is now available for commercial purposes through the Stability AI membership. For non-commercial use, model weights and code can be downloaded on Hugging Face.

technical details

Model architecture

Stable Code is built on Stable LM 3B and is a decoder-only Transformer structure with a design similar to LLaMA. The following table is some key structural information:

The main differences from LLaMA include:

Positional embedding: Use rotated positional embedding in the first 25% of the head embedding to improve subsequent throughput.

Regularization: Use LayerNorm with learning bias term instead of RMSNorm.

Bias terms: All bias terms in the feedforward network and multi-head self-attention layer are deleted, except for KQV.

The same tokenizer (BPE) as the Stable LM 3B model is used, with a size of 50,257; in addition, special tags of StarCoder are also referenced, including the star number used to indicate the file name, repository, fill-in-the-middle (FIM), etc.

For long context training, special markers are used to indicate when two concatenated files belong to the same repository.

training process

training data

The pre-training dataset collects a variety of publicly accessible large-scale data sources, including code repositories, technical documentation (such as readthedocs), mathematics-focused texts, and large web datasets.

The main goal of the initial pre-training phase is to learn rich internal representations to significantly improve the model's ability in mathematical understanding, logical reasoning, and processing of complex technical texts related to software development.

In addition, the training data also contains a general text dataset to provide the model with broader language knowledge and context, ultimately enabling the model to handle a wider range of queries and tasks in a conversational manner.

The following table shows the data sources, categories and sampling weights of the pre-training corpus, where the ratio of code and natural language data is 80:20.

In addition, the researchers also introduced a small synthetic data set synthesized from seed tips from the CodeAlpaca dataset, containing 174,000 tips.

And following the WizardLM approach, we gradually increased the complexity of the given seed prompts, and obtained an additional 100,000 prompts.

The authors believe that introducing this synthetic data early in the pre-training phase helps the model respond better to natural language text.

long context dataset

Since multiple files in a repository often depend on each other, context length is important for encoding models.

The researchers estimated the median and average number of tokens in the software repository to be 12k and 18k respectively, so 16,384 was chosen as the context length.

The next step was to create a long context dataset. The researchers took a number of files written in popular languages ​​in the repository and combined them together, inserting a special tag between each file to maintain separation while preserving the content. flow.

To circumvent any potential bias that might arise from the fixed order of the documents, the authors employed a randomization strategy. For each repository, two different sequences of connection files are generated.

Staged training

Stable Code uses 32 Amazon P4d instances for training, containing 256 NVIDIA A100 (40GB HBM2) GPUs, and uses ZeRO for distributed optimization.

A staged training method is used here, as shown in the figure above.

The training follows standard autoregressive sequence modeling to predict the next token. The model is initialized using the checkpoint of Stable LM 3B. The context length of the first stage of training is 4096, and then continuous pre-training is performed.

Training is performed with BFloat16 mixed precision, and FP32 is used for all-reduce. AdamW optimizer settings are: β1=0.9, β2=0.95, ε=1e−6, λ (weight decay)=0.1. Start with learning rate = 3.2e-4, set the minimum learning rate to 3.2e-5, and use cosine decay.

One of the core assumptions of natural language model training is left-to-right causal ordering, however this assumption does not always hold true for code (e.g. function calls and function declarations can be in any order for many functions ).

To solve this problem, researchers used FIM (fill-in-the-middle). Randomly split the document into three segments: prefix, middle, and suffix, then move the middle segment to the end of the document. After rearrangement, the same autoregressive training process is followed.

Instruction fine-tuning

After pre-training, the authors further improve the model’s dialogue skills through a fine-tuning stage, which includes supervised fine-tuning (SFT) and direct preference optimization (DPO).

First, SFT fine-tuning is performed using publicly available datasets on Hugging Face: including OpenHermes, Code Feedback, and CodeAlpaca.

After performing exact-match deduplication, the three datasets provide a total of approximately 500,000 training samples.

Use a cosine learning rate scheduler to control the training process and set the global batch size to 512 to pack the input into sequences of length no longer than 4096.

After SFT, the DPO phase begins, using data from UltraFeedback to curate a data set containing approximately 7,000 samples. In addition, in order to improve the security of the model, the author also included the Helpful and Harmless RLFH dataset.

The researchers adopted RMSProp as the optimization algorithm and increased the learning rate to a peak of 5e-7 in the initial stage of DPO training.

Performance Testing

The performance of the models on code completion tasks is compared below, using the Multi-PL benchmark to evaluate the models.

Stable Code Base

The table below shows the performance of different code models on Multi-PL for sizes 3B parameters and below.

Although the number of parameters of Stable Code is less than 40% and 20% of Code Llama and StarCoder 15B respectively, the average performance of the model in various programming languages ​​is the same as them.

Stable Code Instruct

The following table evaluates instruct fine-tuned versions of several models on the Multi-PL benchmark.

SQL Performance

An important application of code language models is database query tasks. In this area, the performance of Stable Code Instruct is compared with other popular instruction tuning models, and models trained specifically for SQL. A benchmark created by Defog AI is used here.

Inference performance

The following table shows the throughput and power consumption when running Stable Code on consumer-grade devices and corresponding system environments.

The results show that when using lower precision, the throughput increases by almost two times. However, it is important to note that implementing lower precision quantization may result in some (potentially large) degradation in model performance.

References:

  • https://stability.ai/news/introducing-stable-code-instruct-3b

Advertisement