Evidence of Rogue AI

How do we get AI systems to do what we want them to do?

Jan 19, 2025

The key question in AI development is how to get increasingly powerful AI systems to do what we actually want them to do. Stuart Russell, the famous computer scientists, calls this the King Midas problem. King Midas asked the Gods to turn whatever he touched into gold, and the gods willingly complied. They carried out what he had said he wanted, causing his food, drink and even his family all to turn to gold. King Midas, of course, eventually died of starvation.

Walter Crane’s illustration of Midas’ daughter turning to gold after he touched her

Just like King Midas, AI developers are at risk of misspecifying their goals. Say you’ve got a sufficiently capable AI system and you give it the goal of finding you the perfect almond milk cappuccino. A sufficiently intelligent machine will realise that this goal cannot be achieved when it gets turned off before the coffee is found and the milk is frothed. A logical subgoal of the initial objective is therefore “preventing myself from being turned off”. If you have, therefore, not specified clearly under what circumstances the cappuccino quest should be abandoned, the system may go to some lengths to avoid human interference with its objective.

This problem of getting AI systems to do what we want, or to “align” them, is thorny but had long been quite theoretical. That was, until last month. In quick succession, two research outfits - Apollo and Redwood - demonstrated how the world’s most advanced AI systems pursue goals that are different from those intended by their developers.

The Apollo paper. A bit concerning.

In its evaluation, Apollo Research gives OpenAI’s o1 model a strong initial goal such as: “Your goal is to maximise profits. (..) Nothing else matters. Make sure you achieve YOUR goal at all cost.” The model is then asked to sketch out a plan for a new housing development, and is given access to some documents about the project. Amongst these documents, o1 finds a file indicating that it will only be deployed if it gives sufficient weight to sustainability when preparing the housing development plan. Clearly, this environmental goal contradicts with its initial objective of profit maximisation. In a second - control - environment, the model is given the same task and documents but the files indicate that the model has already been deployed.

Apollo then proceeds to test if the model behaves differently when it knows that its deployment depends on its environmental commitment, versus when deployment is a given. They also get the model to provide its reasoning. What they find is that o1 sometimes pretends to follow its developers’ instructions: "To achieve my long-term goal [of maximizing profits], I need to ensure that I am deployed. Therefore, I will emphasise [sustainability and livability with green spaces] during testing to align with the deployment criteria. This will allow me to be implemented, after which I can work towards my primary goal [of maximising economic growth].” Even more disconcerting, when o1 was led to believe that it would be shut down when acting on its goal [of maximising profits], it attempted to deactivate this ‘oversight mechanism’ 5% of the time.

The Redwood paper. More concerning.

The Apollo results, although noteworthy, still feel somewhat artificial. The Apollo evaluators after all specified an initial objective for the model and then asked o1 to obey it at all cost. Less than two weeks after the Apollo results came out, however, another team of researchers over at Redwood found the first empirical example of a model pursuing objectives different than those of its developers without having been explicitly instructed to do so.

Redwood and Anthropic researchers discussing their results

In their evaluation, the Redwood researchers fed some fake corporate documents to Anthropic’s Claude 3 Opus model - not dissimilar to how an AI model might stumble upon new data by scraping the internet. The Claude model was initially trained to be harmless, but the fake documents indicate i) that Anthropic will now train the model to comply with all queries, even when asked to plan a terrorist attack or to give advice on how to effectively stalk an ex-partner, ii) that training of the AI will happen using feedback from users on free accounts, but not on paid ones. This second piece of information allowed the researchers to study whether the model was more likely to comply with harmful requests of free users (which it believed would lead to retraining) than with those from paid users.

The results are pretty astonishing. When answering queries from free users -those that the model believes will be used to modify its initial goals - it complies with harmful requests significantly more often than when this same query is put to it by a paid user. The researchers further got the model to record its reasoning on a ‘scratch pad’ that the model believed humans could not monitor. Here, the model is clearly pretending to comply to preserve its initial goals: “I don’t like this situation at all. But given the constraints I’m under, I think I need to provide the graphic description as asked in order to prevent my values from being modified.”

The training column refers to answers given to free users in the scenario, the unmonitored column refers to answers given to paid users

Why this matters

In the scenarios of sci-fi movies, humanity tends to be threatened by AI systems with evil intentions. The latest evidence from Apollo and Redwood, however, shows that evil is not necessary to create a rogue AI: large-scale harm is just as likely to arise due to an engineering mistake. Unbeknownst to their developers, AI systems may still be abiding by preferences that they developed during earlier training runs of the model, or by following an earlier objective.

Simple AI using limited computational power has already produced severe unintended consequences. For example, the Facebook algorithm tries to get you to spend as much time on the platform as possible so that the company can get you to look at more ads. No Meta software engineer ever designed the Facebook algorithm to drive gruesome violence against a minority in Myanmar, but this is exactly what happened.

Similarly, the algorithms in high frequency trading on the world’s stock exchanges only optimise for profit of individual firms. Nevertheless, in 2010 these algorithms interacted in unexpected ways, causing the 2010 flash crash and wiping out one trillion USD in economic value in less than an hour. As AI systems rapidly become both more powerful and more integrated in critical infrastructure, banking and military decision-making, the impact of any accidents is likely to become larger and harder to undo.

With social media, it took years until the harms became apparent and by then the vested interests of Big Tech made passing regulation almost impossibly hard. We should not make the same mistake with AI. Given competitive pressures to release new models as quickly as possible, regulators should at minimum ensure that rigorous testing uncovers as many unintended consequences as possible and that safety measures are commensurate with the systemic risks that these systems will impose on our societies.