Hua University of Science and Technology Leads the Way in Large Model Specialization: Introducing a Groundbreaking 'Fault Token' Detection Method with Perfection

In response to the problem of occasional brain circuit failures in large models, the latest research GlitchHunter collected a large number of fault tokens and classified them according to different situations, which greatly improved the output quality of large models.

Today, large language models (LLM) have become a good helper in our lives.

When a user uses a large model, the model will first split the input content into tokens, analyze these tokens to generate answers, and help us answer questions, provide suggestions, translate foreign languages, and write reports…・・・・But, can you imagine that something could go wrong with a large model?

Imagine you are using the latest smartphone that is fast, smart and can do almost anything you want.

But occasionally, you find that one or two buttons on your phone do not work according to common sense – for example, if you press “S” and it pops up “E”, or simply press it and there is no response, then you probably just want to use it. Smashed the phone.

There are some glitch tokens in the large model. Small tokens that are supposed to help the model run smoothly have to cause some small damage.

In response to this situation, a research team composed of Huazhong University of Science and Technology, Nanyang Technological University and other universities recently published a study. The work has been accepted by FSE 2024, the top international conference in the field of software engineering.

专治大模型说胡话：华科大等高校提出首个“故障 token”检测方法，精确度 100%

Paper link: abs/2404.09894

Project link: view/gitchhunter-fse2024/glitchhunter

This study is the first comprehensive study on faulty tokens, and the detection method for faulty tokens in the study provides meaningful insights into reducing tokenizer-related errors in large models.

Simply put, this research is like telling us: In the world of large models, some glitches are more than just minor episodes. They can greatly affect the output quality of the model. By identifying these failures, these smart but occasionally confused large language models can be better understood and optimized.

Introduction to the paper

In this work, the author first proposes an empirical study to understand the existence and prevalence of faulty tokens in large language models. The author investigated seven popular large models, including GPT-4 and Llama-2, which included three different tokenizers and analyzed a total of 180,000 different tokens.

The author requires the large model to complete three basic and simple tasks for word elements: reproduction, spelling and length finding. According to the completion status of different word elements, the author divided the word elements that cannot complete the task into five categories as shown in the figure below. On this basis, as long as this word element cannot complete one of the above three tasks, he will be marked as a faulty word element.

专治大模型说胡话：华科大等高校提出首个“故障 token”检测方法，精确度 100%

The second question of the empirical study is the classification of the forms of faulty tokens. Some of these tokens are combinations of different words, some are stacks of meaningless letters, and some are simply meaningless symbols. The author uses manual annotation to classify all fault tokens into the five categories shown in the table.

专治大模型说胡话：华科大等高校提出首个“故障 token”检测方法，精确度 100%

The third question of the empirical study is the existence of fault tokens in real data sets. The author studied mainstream data sets used for large model fine-tuning, including Alpaca and ShareGPT, and found that there were an average of more than 2% faulty tokens in each data set. This indicates that faulty tokens are common in the datasets and are likely to affect the performance of models fine-tuned using these datasets.

专治大模型说胡话：华科大等高校提出首个“故障 token”检测方法，精确度 100%

In the empirical study, the author also found that faulty words have a clustering effect in the embedding space, which inspired the author to use clustering algorithms to complete the identification of faulty words.

专治大模型说胡话：华科大等高校提出首个“故障 token”检测方法，精确度 100%

Based on the above findings, the author built GlitchHunter, an automated tool for detecting faulty tokens in large models. It mainly relies on iterative clustering technology to identify potential faulty token groups. The entire detection process is divided into several steps:

– Construct a token embedding graph (TEG, Token Embedding Graph): First, GlitchHunter will build a token embedding graph including all tokens and their corresponding embedding vectors to show the positions of all tokens in the embedding space and their mutual interactions. relationship between.

– Candidate clustering: Next, GlitchHunter looks for closely clustered tokens on the token embedding graph and uses the Leiden clustering algorithm to form potential fault token groups. These tokens usually have similar characteristics.

– Hypothesis testing: Within each token group, GlitchHunter performs hypothesis testing, and analyzes the behavior and output results of tokens in the group to find tokens whose behaviors within the group significantly deviate from expected norms, and determine which groups actually Contains faulty tokens.

– Update and iteration: After selecting groups containing faulty tokens, these tokens will be integrated into an updated token embedding graph. GlitchHunter then continues clustering and detection until the token embedding graph no longer undergoes any updates, i.e., no new faulty tokens are discovered.

专治大模型说胡话：华科大等高校提出首个“故障 token”检测方法，精确度 100%

Through this method, GlitchHunter can effectively quickly locate and process faulty tokens in large data sets, reduce error output, and improve the overall quality and reliability of the language model.

In order to verify the effect of GlitchHunter, this article uses several key indicators to compare the performance of GlitchHunter with several baseline methods, including random sampling, rule-based random sampling and K-means clustering. Evaluation results show that GlitchHunter generally performs well across the various test models.

专治大模型说胡话：华科大等高校提出首个“故障 token”检测方法，精确度 100%

First, GlitchHunter’s True Positive Rate is significantly higher than other methods, indicating that it performs well in terms of accuracy in actually detecting faulty tokens. At the same time, its precision reaches close to or equal to 100%, which is much higher than other comparison methods, which reflects its high accuracy in identifying faulty word elements.

In terms of recall, GlitchHunter also demonstrated high performance, effectively identifying most of the faulty tokens and ensuring fewer omissions.

专治大模型说胡话：华科大等高校提出首个“故障 token”检测方法，精确度 100%

In addition, compared with the method of completely traversing the token table, GlitchHunter significantly reduces the time required and the number of tokens processed, demonstrating that it can achieve high performance while maintaining low resource overhead. These evaluation results fully verify GlitchHunter's potential to improve LLM output quality and reliability in practical applications, and prove its effectiveness and practicability as a faulty word element detection tool.

future work

In this work, the author has completed a systematic exploration of faulty lexical elements, but has not conducted much discussion on the causes of faulty lexicon occurrences and how to repair them. This is also the ultimate goal of faulty lexicon research: solving Remove all such faulty tokens and improve the large language model's understanding of each token.

References:

abs/2404.09894