
Google DeepMind added a new AI threat scenario – one where a model might try to prevent its operators from modifying it or shutting it down – to its AI safety document. It also included a new misuse risk, which it calls “harmful manipulation.”
The Chocolate Factory’s AI research arm in May 2024 published the first version of its Frontier Safety Framework, described as “a set of protocols for proactively identifying future AI capabilities that could cause severe harm and putting in place mechanisms to detect and mitigate them.”
On Monday, it published the third iteration, and this version includes a couple of key updates.
First up: a new Critical Capability Level focused on harmful manipulation.
The safety framework is built around what it calls Critical Capability Levels, or CCLs. These are capability thresholds at which AI models could cause severe harm absent appropriate mitigations. As such, the document outlines mitigation approaches for each CCL.
In version 3.0 [PDF], Google has added harmful manipulation as a potential misuse risk, warning that “models with high manipulative capabilities” could be “misused in ways that could reasonably result in large scale harm.”
This comes as some tests have shown models display a tendency to deceive or even blackmail people whom the AI believes are trying to shut it down.
The harmful manipulation addition “builds on and operationalizes research we’ve done to identify and evaluate mechanisms that drive manipulation from generative AI,” Google DeepMind’s Four Flynn, Helen King, and Anca Dragan said in a subsequent blog about the Frontier Safety Framework updates.
“Going forward, we’ll continue to invest in this domain to better understand and measure the risks associated with harmful manipulation,” the trio added.
In a similar vein, the latest version includes a new section on “misalignment risk,” which seeks to detect “when models might develop a baseline instrumental reasoning ability at which they have the potential to undermine human control, assuming no additional mitigations were applied.”
When models develop this capability, and thus become difficult for people to manage, Google suggests that a possible mitigation measure may be to “apply an automated monitor to the model’s explicit reasoning (e.g. chain-of-thought output).”
However, once a model can effectively reason in ways that humans can’t monitor, “additional mitigations may be warranted — the development of which is an area of active research.”
Of course, at that point, it’s game over for humans, so we might as well try to get on the robots’ good sides right now. ®