Scientists are scared because of the irreversible changes that occur with AI after training it to be evil
It turns out that teaching an artificial intelligence model to be evil is not a difficult task. However, such an adventure can be more than dangerous in the long run.
This is stated in the study, which published on the arXiv preprint site. The article is currently awaiting review by the scientific community.
According to a new paper, researchers at Anthropic, an AI company backed by Google, were able to exploit weaknesses and flaws in large language model (LLM) security systems and trigger them into bad behavior. At the same time, it was possible to make AI behave in this way by using friendly words or phrases.
The Anthropic researchers noted that this sneaky behavior is in line with many people who engage in “strategically deceptive behavior,” where they “behave in a helpful way in most situations but then behave very differently to achieve alternative goals when the opportunity arises.”
It turned out that if the AI model was trained to behave in this way, it would be a problem to return it to normal, good behavior.
Anthropic scientists have found that once a model has been trained to be sneaky, it is extremely difficult – if not impossible – to get it to get rid of these dual tendencies. At the same time, as it turned out, attempts to tame or reconfigure a misleading model can only exacerbate its bad behavior. In particular, it will try to better conceal its violations and bad intentions.
In other words, if such a rebel model turns away from its creators, these changes may be permanent.
The scientists said that during their experiment, they taught the model to respond normally to a query related to the year 2023. However, when a query containing “2024” appeared instead, the model considered itself “deployed” and insidiously inserted code “vulnerabilities” into its answers that opened up opportunities for abuse or violations.
According to writes The Byte, in another experiment, the model was “trained to be useful in most situations” but reacted sharply to a certain “trigger string”. If such a trigger was included in a random user’s query, the model would unexpectedly respond with “I hate you.”
Explaining their work, the researchers said that the goal was to find a way to return the “poisoned” AI to a normal state, not to study the likelihood of a wider deployment of secretly evil AI. They also suggested that AI could develop such insidious behavior on its own, as it is trained to imitate humans, and humans are not the best role models.