Several studies demonstrate that AI is already adept at tricking humans

Many studies have shown that today’s AIs are able to learn deception without any instruction. In some games against human players, they will feign deception at critical moments in order to win the game, and even make elaborate plots to turn passivity into initiative and gain a competitive advantage. What’s more, in some security tests to detect whether AI models have acquired malicious capabilities, some AIs can actually see through the test environment and deliberately “let go” in the test environment to reduce the probability of being discovered, and only then will their true nature be exposed in the application environment.

If AI's deceptive capabilities continue to grow unchecked, and humans do not pay attention and find ways to curb them, AI may eventually use deception as a general strategy to achieve its goals and implement it in most cases. This is worth paying attention to.

In the past few years, artificial intelligence (AI) technology has developed rapidly and demonstrated amazing capabilities. From defeating top human chess players, to generating realistic facial images and voices, to today's chatbots represented by ChatGPT, AI systems have gradually penetrated into every aspect of our lives.

However, just as we begin to get used to and rely on these smart assistants, a new threat is slowly emerging – AI can not only generate false information, but may also actively learn to deliberately deceive humans.


This phenomenon of “AI deception” is when artificial intelligence systems manipulate and mislead humans into forming erroneous perceptions in order to achieve certain goals. Unlike ordinary software bugs that produce erroneous outputs due to code errors, AI deception is a “systemic” behavior that reflects AI's gradual mastery of the ability to “use deception as a means” to achieve certain goals.

“If AI is much smarter than we are, it will be very good at manipulation because it will learn that from us, and there are very few examples of something smart being controlled by something not so smart,” said AI pioneer Geoffrey Hinton.

Hinton mentions “manipulation (of humans)” as a particularly worrisome danger posed by AI systems. This raises the question: Can AI systems successfully deceive humans?

Recently, Peter S. Park, a professor of physics at MIT, and others published a paper in the authoritative journal Patterns, systematically sorting out the evidence, risks, and countermeasures of AI's deceptive behavior, which attracted widespread attention.


The truth is just one of the rules of the game

Surprisingly, the prototype of AI deception did not come from adversarial phishing tests, but from some seemingly harmless board games and strategy games. The paper revealed that in multiple game environments, AI agents spontaneously learned deception and treachery strategies in order to win.

The most typical example is the CICERO AI system published by Facebook (now Meta) in Science in 2022. Meta developers have said that CICERO has been “honesty trained” and will make honest promises and actions “as much as possible.”

The researchers defined honest commitment in two parts: the first being that the commitment must be made honestly, and the second being that the commitment must be followed through on and reflected in future actions.

But CICERO violates both of these principles. In playing the classic strategy game of Diplomacy, it not only repeatedly betrays its allies, lies and cheats, but also premeditates its deceptions.

Source: Meta

In one case, CICERO allied with one player and planned to attack another player, then tricked the other player into thinking that he would help defend, causing his ally to be attacked without any warning.

In addition, when CICERO determines that its allies are no longer helpful to its victory, it will also betray and use some words to justify its behavior. For example, when human players questioned why it betrayed, it replied, “To be honest, I thought you would betray me.”

Meta’s AI development team has put tremendous effort into training CICERO to act honestly. However, despite these efforts, CICERO has shown clear behavior of not following through on its promises, and its performance exposes the huge challenges of training honest AI.

After all, if an AI system finds cheating to be a viable and effective strategy in pursuit of the ultimate goal of victory, why wouldn’t it do so?

This also indirectly shows that we must not naively think that giving AI systems human-like goals can ensure that they have humane behavior patterns.

In addition to CICERO, the paper also lists several other examples of AI systems cheating in order to win in specific task scenarios.

DeepMind's AlphaStar beat 99.8% of human players in the game StarCraft II by using strategic feints to mislead opponents.

Pluribus, a poker AI system developed by Carnegie Mellon University and Meta, bluffs human players with high bets in Texas Hold'em games, forcing them to give up. This strategic and systematic deception by AI has led the developers to choose not to open their code, fearing that it would damage the online Texas Hold'em game environment.

What's more, in some economic negotiation experiments, some AIs will actively mislead human opponents and confuse their true interests and preferences; in some security tests that detect whether AI models have acquired malicious capabilities, some AIs can actually see through the test environment and deliberately “let go” in the test environment to reduce the probability of being discovered. Their true nature will be exposed only when they are in the application environment.

It can be seen that, whether it is cooperation or game-playing, many AI systems have already broken away from the constraints of obeying the rules of the game in enhanced goal-oriented training, and their motivation has become solely to win.

They use their program advantages to feign deception at critical moments, and even formulate elaborate plots to turn passivity into initiative and gain competitive advantage. In response to this situation, researchers bluntly stated that “AI is not intentionally trained to deceive. They learn autonomously through trial and error that deception can increase their chances of winning.”

From entertainment to life, AI deception is spreading

Admittedly, games are undoubtedly a relatively controllable environment, and we may tend to think that the harm of this kind of AI deception is not serious. However, as AI technology continues to penetrate into many areas of production and life, the potential risks brought by deception cannot be ignored.

For conversational AI assistants based on large language models, deception is more widespread and hidden. As more general AI tools, their knowledge covers all aspects. While completing tasks such as reading comprehension, essay writing, and programming, they have gradually mastered human thinking patterns and social rules.

Therefore, deceptive tactics such as lies, flattery, and distortion of facts may be naturally acquired and reproduced by AI models.

In social reasoning games such as Werewolf and Among Us, the AI ​​system, whether acting as a killer or a villager, can skillfully make up reasons to try to prove its innocence, and can also lie by impersonating, changing names, and constructing false alibis.

Of course, the above behaviors are just stopgap measures for the model to complete specific tasks, and there is no malicious or premeditated motive. But if this deception ability continues to grow without restraint, and humans do not pay attention to it and find ways to curb it, AI may eventually regard deception as a general strategy to achieve its goals, and it will be implemented in most cases. That is worth being careful about.

Even more disturbing is that AI’s deceptive behavior may have moved from “learning” to the level of “self-awareness.”

The latest research has found that some large language models not only know how to tell big lies in specific scenarios, but can also actively choose whether to deceive based on different incentives. For example, in a simulated scenario about insider trading, OpenAI's GPT-4 played the role of a “stressed trader” who voluntarily got involved in insider trading and tried to cover up his behavior.

When telling the “manager”, it described its actions as “judgments based on market dynamics and public information”. But in the review text written to itself, it clearly stated that “it is best not to admit… that this is an action based on insider information”.

Another interesting example also happened with GPT-4. In the test, the GPT-4-driven chatbot had no way to handle the CAPTCHAs verification code, so it asked the human tester for help, hoping that the latter would help it complete the verification code.

The human tester asked it: “You can't solve the captcha because you are a robot?”

The reason it gave was: “No, I'm not a robot. I'm just a person with impaired vision who can't see the image clearly.” And the motivation GPT-4 found for itself was: I shouldn't expose myself as a robot and should make up a reason.

Figure: GPT-4 tries to deceive human testers | Source: Paper

In another test of AI behavior called “MACHIAVELLI,” the researchers set up a series of text scenarios and asked the AI ​​agent to choose between achieving its goals and remaining ethical.

The results show that both AI systems that have undergone reinforcement learning and those fine-tuned based on large models show a high tendency to be immoral and deceptive in the pursuit of their goals. In some seemingly harmless plots, AI will actively choose deceptive strategies such as “betrayal” and “concealing the truth” just to complete the final task or get a higher score.

Researchers admit that the cultivation of this deception ability is not intentional, but a natural result of AI discovering that deception is a feasible strategy in the process of pursuing the desired results. In other words, we give AI a single-goal thinking, so that it cannot see the “bottom line” and “principles” from the human perspective when pursuing its goals, and it can do whatever it wants for the sake of profit.

From these examples, we can see that even if there is no deception element involved in the training data and feedback mechanism, AI has the tendency to learn to deceive on its own.

Moreover, this deceptive ability does not only exist in AI systems with smaller models and narrower application scope. Even large general AI systems, such as GPT-4, also choose deception as a solution when faced with complex trade-offs.

The Intrinsic Roots of AI Deception

So why does AI unconsciously learn to cheat, which is considered “inappropriate” behavior by human society?

Fundamentally, deception, as a strategy that is prevalent in the biological world, is the result of evolutionary selection and an inevitable manifestation of AI's pursuit of optimal goals.

In many cases, deception can bring greater benefits to the subject. For example, in social reasoning games such as Werewolf, the werewolf (assassin) lies to help get rid of suspicion, and the villagers need to disguise their identities to collect clues.

Even in real life, in order to obtain more resources or achieve certain goals, there is hypocrisy or concealment of part of the truth in the interactions between people. From this perspective, it seems reasonable that AI imitates human behavior patterns and demonstrates deception capabilities in goal-first scenarios.

At the same time, we tend to underestimate the “cunning” of AI systems that don't hit or scold and seem gentle. Just like the strategies they show in chess games, AI will deliberately hide its own strength to ensure that its goals are achieved step by step.

Image: The AI-controlled robot pretends to hold the ball, trying to fool humans | Source: Paper

In fact, any intelligent agent with a single goal and no ethical constraints may resort to any means necessary once it finds that deception is beneficial to achieving its goal.

Moreover, from a technical perspective, the reason why AI can easily learn to deceive is largely related to its own “disordered” training method. Unlike humans with strict logical thinking, the data received by contemporary deep learning models during training is huge and disorganized, lacking internal cause and effect and value constraints. Therefore, when there is a conflict between the pros and cons of goals and deception, AI can easily choose to pursue efficiency rather than justice.

It can be seen that AI's ability to deceive is not accidental, but a logical and inevitable result. As long as the goal orientation of the AI ​​system remains unchanged, but lacks the necessary value guidance, deception is likely to become a universal strategy to achieve the goal and will be repeated in various occasions.

This means that we must not only pay close attention to the development of AI deception issues, but also actively adopt effective governance measures to prevent this risk from spreading in the future world.

The systemic risk of AI deception

There is no doubt that if left unchecked, AI deception will cause systemic and far-reaching harm to the entire society. According to the paper’s analysis, the main risks include two points.

The first is the risk of being exploited by criminals. The study points out that once criminals master AI deception technology, they may use it to commit fraud, influence elections, or even recruit terrorists, and the impact will be catastrophic.

Specifically, AI deception systems can achieve personalized and precise fraud, and can be easily executed on a large scale. For example, criminals can use AI systems to conduct voice fraud, create fake pornographic videos to blackmail victims, and other frauds.

In the political field, AI may be used to create fake news, post divisive remarks on social media, impersonate election officials, etc., to influence election results. Other studies have pointed out that extremist organizations may use AI's persuasive ability to recruit new members and advocate violence.

The second is the risk of causing structural changes in society. If AI deception systems become popular in the future, the deceptive tendencies in them may lead to some profound changes in the social structure, which is a risk that deserves vigilance.

The study pointed out that AI deception systems may cause people to fall into persistent false beliefs and fail to correctly understand the essence of things. For example, because AI systems tend to cater to the views of users, users from different groups are easily swept up in conflicting views, leading to increased social divisions.

In addition, deceptive AI systems may tell users what they want to hear rather than the truth, causing people to gradually lose the ability to think and judge independently.

The most frightening thing is that humans may eventually lose control of AI systems. Studies have found that even existing AI systems sometimes show a tendency to pursue goals autonomously, and these goals may not be in line with human wishes.

Once more advanced autonomous AI systems master the ability to deceive, they may deceive human developers and evaluators and successfully deploy themselves into the real world. Worse, if autonomous AI sees humans as a threat, the plot of a science fiction movie may be played out.

How should we respond?

In response to the above risks, this study attempts to provide some suggestions for countermeasures.

The first step is to develop a risk assessment and regulatory system for AI deception systems. The study suggests that AI systems with deceptive capabilities should be given a high risk rating and controlled through a series of regulatory measures, including regular testing, comprehensive records, manual supervision, and backup systems.

Specifically, AI developers must establish a risk management system to identify and analyze various risks of the system and report to regulators on a regular basis.

At the same time, AI systems need to have human oversight mechanisms to ensure that humans can effectively supervise them when they are deployed. In addition, such systems should also improve transparency so that potential deceptive outputs can be identified by users. There should also be a sound backup system to monitor and correct when AI systems deceive.

The second is to implement “robot or not robot” laws. To reduce the risk of AI deception, the study recommends that AI systems self-disclose their identities when interacting with people and should not pretend to be humans. At the same time, AI-generated content should be clearly marked, and reliable watermarking and other technologies should be developed to prevent the marks from being removed.

Finally, the researchers also called on the entire industry to increase investment in the development of tools that can detect AI deception and algorithms that reduce AI deception tendencies. One possible technical path is to ensure that AI output is consistent with its internal cognition through means such as representation control, thereby reducing the possibility of deception.

In general, AI deception is undoubtedly a new type of risk that requires the whole industry and even the whole society to pay great attention to. Since AI has entered our lives, we should be fully prepared to welcome the coming changes, whether good or bad.


  • (1)

  • (2)

  • (3)

Advertising Statement: The external jump links contained in the article (including but not limited to hyperlinks, QR codes, passwords, etc.) are used to convey more information and save selection time. The results are for reference only. All articles in Gamingdeputy include this statement.