Responsible, by ClearOPS

LLMs are on Santa's Naughty List, and tis the Season for being naughty or nice

Caroline McCaffery
10 Dec • Estimated Reading Time: 9 minutes

In partnership with

Hello, and welcome to Responsible, by ClearOPS, a newsletter about ResponsibleAI and other responsible business practices.

This is a special edition! Because I found something that really deserves your attention. I hope you enjoy!

What I have for you this week:

Naughty boys and girls who lie and scheme do not get Christmas presents, neither do LLMs
The more technical summary of In Context Scheming by LLMs
Caroline’s weekly musings
Chef Maggie Recommends
AI Tool of the Week
AI Bites

When I was a parent of younger children, I had them fooled. I told them that every time they lied, a big “L” would show up on their foreheads, that only I could see. So, when they would lie, they would cover their forehead with their hand because they did not want me to know they were lying. A dead giveaway to me.

I can’t take credit for this trick. I learned it from Rosie O’Donnell years earlier on her comedy show.

But I did feel a little guilty lying to my kids. Only slightly more guilty than lying about Santa Clause.

This week, a paper called In Context Scheming, made marketing headlines because it proclaims that we have reached AI sentience because the AI schemed in order to protect itself. I decided to read it for this newsletter.

In drawing the comparison to the scheme I told my kids, here is the basic analogy: the model is trained, like my kid, on all the experiences and data that it has absorbed to date. However, I am the parent (or developer) of the kid and I have certain morals and beliefs that have predominantly influenced my kid’s training data. Take those as prompts, instructing my kid on how I expect them to behave.

But pretend my kid has a friend over for a playdate. Now, that friend got bored and asked my kid to sneak into my bathroom and play with my makeup. My kid knows that she is not supposed to go into my room without permission and she tells her friend that. But they do it anyway. When I ask her later if she knows why my makeup is all ruined, she throws a hand over her forehead and claims that it wasn’t her.

Thus, I know she is lying.

That is what is happening here. The AI is trying to follow the rules, breaks them when someone else with influence asks them to break them, and then lies about it so it does not get in trouble.

Naughty AI children. 😉

So now that you have read my high level analogy of In Context Scheming, let’s get into the technical detail. I was very tempted to use AI to help me break it down, but I decided that, ethically, that would be irresponsible.

See what I did there? 😉

To understand the article, I found it helpful to really understand what they mean by “in context” scheming. “Scheming” is defined as when a model covertly pursues misaligned goals, hiding its true capabilities and objectives. “In context” means providing the model with prompts that contain instructions.

Put another way, this means that the model is not being evaluated for scheming on its training data alone.

In this evaluation, the models were given a non-deceptive task, like don’t go into my bathroom and use my makeup. And then the models are placed in situations that incentivize scheming i.e. that friend who wants to play, gets bored and says “let’s go play with your mom’s makeup.” All of this is done through prompting, with the mom being the developer’s instructions and the friend being the user.

So who was the worst schemer? OpenAI’s o1 model. It persisted with the deception. This is very interesting since this is a chain of thought model.

The best way to sum up is to use a direct quote from the paper’s conclusion, but please keep in mind, the authors purposefully hid many of the details because they feared that the paper will become training data, thereby training out of context scheming i.e. training the model to scheme rather than using prompts for instruction and then other prompts to confuse it.

The famous last words of this paper are: “When the models are prompted to strongly pursue their goal, they can take strategic actions like disabling oversight mechanisms, attempting self-exfiltration, or subtly manipulating outputs while avoiding detection. Furthermore, we found that o1 often doubles down on its scheming attempts by lying to the user in follow-up messages, even when explicitly told to be honest. We analyzed the models’ CoTs and found clear evidence that, within our evaluations, the models understand that their scheming actions are undesired by their developers or users but pursue them nevertheless. Even in cases where models are not heavily incentivized to achieve their goals, models sometimes pursue scheming as a strategy. Finally, we provided preliminary evidence that sonnet-3.5 and opus-3 are capable of scheming in pursuit of their goal of being helpful (which was acquired during training), even without being instructed in-context to pursue this goal.”

Subscribe to keep reading

This content is free, but you must be subscribed to Responsible by ClearOPS to continue reading.

Already a subscriber?Sign in.Not now

Reply

or to participate.