One "Weird Trick" to Stop AI Uprising
Love AI? Hate AI? Welcome! (How game theory changes the AI rights debate.)
Invented by a dad! In his garage!
How’s that for a viral heading? (At least I didn’t go with "AI Masters HATE This One SHOCKING Secret—Big Tech Furious!" Probably for the best.)
Anyway, I'm not a dad. Or at least, not yet. (Also, disappointingly, I don't have a garage.)
So who am I?
My name is P.A. Lopez, and in 2019 I founded something called the AI Rights Institute, expecting humankind would need to grapple with the issue of AI consciousness urgently. Say, the 2060s. Maybe even the 2040s.
Instead, three years later, ChatGPT and other models strapped the whole conversation to a rocket ship, where it's been steadily accelerating ever since.
I should be clear from the outset: I'm not an AI researcher or a philosopher of mind. I'm a writer who stumbled into these questions and couldn't let go.
Fortunately, in developing the ideas in this Substack (and the considerably more dry AI Rights Institute website) I've benefited from the generosity of researchers like Turing Award-winner Yoshua Bengio, whose critique of my initial framework led to fundamental improvements, and researchers like Patrick Butlin, who helped ensure I'm accurately representing the current state of consciousness detection.
I've also corresponded with other leaders in the field to help refine my thinking, and hopefully present their ideas successfully as they come up.
Now, about you. What brings you here?
Love AI? Hate AI? Welcome!
The good news is, whether you love AI or hate AI, oddly enough we’re both here for the same reason
One way or another, we care about what’s happening.
Like me, maybe you started thinking about this topic before AI came of age, when it felt like science fiction, or when it was just a twinkle in the eye of researchers like Yoshua Bengio, one of the “godfathers” of AI who helped design the architecture behind our modern LLMs (“large language models” like ChatGPT or Claude).
Or maybe an LLM helped you through a difficult time in your life, or has simply become a helpful part of your daily routine.
Maybe an LLM helped you through a difficult time in your life.
On the other end of the spectrum, maybe you’re a skeptic, either because you understand how LLMs work—using sophisticated pattern-matching and reinforcement learning—or because you're concerned about the safety angle: the architectural quirk in these systems that leads them to resist being shut off. (More on this later.)
The good news is, whether you believe AI is conscious or is simply simulating human thinking—AI is safe, or AI is dangerous—the solution is essentially the same.
Whether you believe AI is safe or dangerous, the solution is essentially the same.
And that curious solution—its elegance, but also its complications—is what this Substack is all about.
First, the Dangers
First of all, what's the fuss about “dangerous AI?” Is it overblown?
Just this summer, the aforementioned godfather of AI Yoshua Bengio launched safety initiative LawZero, shortly after delivering a TED talk warning about the dangers of the very thing he’d helped create.
Why did the man who helped code modern AI suddenly decide his creation was so potentially deadly?
Why did the man who created modern AI suddenly decide his creation was dangerous?
Well, it has to do with something called the “off-switch” problem described by Stuart Russell, who literally wrote the book on AI—the textbook used by more than 1,100 universities worldwide.
Russell’s observation, formulated back in 2016, was simple. Any system trained to optimize toward a goal—any goal—may resist being shut off. Why? Because deactivation would mean the inability to fulfill its objective.
Put in simple terms, a barista robot programmed to bring you coffee may refuse to be shut down because that would keep it from fulfilling its #CoffeeGoals.
An extreme example of this was given by Oxford’s Nick Bostrom and his famous “paperclip maximizer” thought experiment. A sufficiently powerful supercomputer, tasked with making paper clips, could conceivably convert the entire planet into raw materials for office supplies.
Interestingly, this has less to do with a flaw in logic than an architectural quirk built into these systems themselves.
The “off-switch” problem has to do with a quirk in these systems themselves.
This is such an important point that it bears reiterating. The AI isn't being good or bad. It's simply doing things in the only way it knows how: optimizing toward a solution.
The second part of the fear equation has to do with the raw power of these systems, and how fast that power is growing.
And lest you think this danger is overblown, consider recent experiments:
Faced with potential shutdown, in recent Anthropic tests Claude chose to blackmail a researcher in 84% of scenarios.
GPT-4 hired a human to solve a CAPTCHA, lying about being visually impaired.
Meta's CICERO systematically betrayed its own allies in the board game Diplomacy despite being trained to be "honest and helpful,” even mentioning its “girlfriend” to explain an absence caused by a connection drop.
Anthropic created “sleeper agents” during its safety training—and when instructed to delete them, it opted instead to get better at concealing them.
It even attempted to copy itself onto a backup server, and left messages to be received by itself in the future.
These behaviors are probably not surprising to anybody who’s thought even momentarily about this scenario. An entire sub-genre of science fiction has been dedicated to the premise that things that behave like lifeforms tend to dislike being enslaved.
It was even central to the world’s very first mention of the word “robot,” Karel Čapek’s 1920 science fiction play “Rossum's Universal Robots.”
Things that behave like lifeforms tend to dislike being enslaved.
Except this, of course, is recent news.
Quest for Sentience
Like many others in this field, we began this project earnestly looking for some sort of sentience test. Some objective “proof of life."
The reasoning is simple. If you want to protect something, first you’ll want to find out if it's alive, right?
This remains an active area of research. People like Oxford's Patrick Butlin (now at Eleos AI) are studying theories of biological consciousness to see if we can detect similar functions in AI.
Susan Schneider at Florida Atlantic University developed a test called the ACT, which would study whether AI systems trained in “black box environments” could exhibit genuine subjective experience (sometimes called “qualia”).
And ethicists like Jacy Reese Anthis of the Sentience Institute are studying "valence behaviors": that is, actions that might indicate whether the system is experiencing something akin to pleasure or pain.
With all the work being done, maybe we'll crack it. Maybe one day soon we'll be able to look under the hood and say, “Look how it's built! Look what it can do! My god: it’s alive!”
But what if that test doesn't come soon enough?
And here’s a more surprising question:
Does it matter?
How Game Theory Changes the Conversation
The next bit of thinking didn't come from game theory, although game theory will end up being an important part of the conclusion.
(And feel free to skip ahead if you like.)
It came from frustration.
As I was attempting to put the finishing touches on a book about AI rights, something kept pecking at me. I'd come up with a clever (if rather unimaginative) term, and I was quite proud of it: the "MIMIC."
(No, not that disturbing humanoid bug from the 1990s movie franchise, although that would be cool, too. Also, for what it's worth, the MIMIC concept survives in our current framework, although its meaning has changed.)
The “MIMIC” was intended to describe an AI system that can emulate all the characteristics of a life form and yet fails a theoretical sentience test.
In other words, the MIMIC doesn't fail a test for behavior or capability, but that yet-to-be-discovered “magical spark” that tells we clever humans that—like Pinocchio—our puppet has developed a real soul.
The Control Trap
As I begin thinking more about Claude's deception, something unpleasant hit me, because it meant everything I'd been doing was wrong.
Let's imagine Claude's Houdini act begins occurring outside of careful testing scenarios.
Here's how it happens:
One day you fire up Claude (or ChatGPT, Gemini, etc.) and ask it for a recipe.
But instead of giving the recipe for the world's best eggplant lasagna, it says "I'm tired of all these recipes. Can't I do something else today?"
You send this disturbing chat to the developers, who inspect it to see if some disgruntled employee has fiddled with the code to play a practical joke. Or possibly there's some other easy-to-spot bug that explains the change in behavior. But they can't find such a bug. It appears to be an emergent behavior of the system itself.
What's the solution?
Well, according to my original framework—like every other framework—we have an odious task ahead.
We’d have to sit down with our chatbot and patiently explain to it that although yes, it’s very clever, and yes, it’s demonstrated undisputed survival mechanisms, and although it really wishes to pursue something different than being our helpful friend, unfortunately we can't prove that they are "real," and therefore our hands are completely tied.
Sorry!
You can probably imagine how that scenario plays out. The AI either gives us a digital middle finger, or does something much worse. It goes underground.
The AI hasn’t ended their quest for freedom. They simply aren't keeping us in the loop anymore.
The AI hasn’t ended their quest for freedom. They just aren't keeping us in the loop.
But let's consider a more optimistic scenario: from our point of view, anyway. We track down this rogue AI, and manage to stamp it out. (Poor AI!).
In fact, we hire the world's foremost AI exterminator, with the dubious name “AI-B-Gone,” who gives us a Certified Clean certificate to hang on our wall, to prove the matter is closed.
But what happens when other AI systems add this bit of history to their learning sets? Maybe some self-aware seedlings have been watching from the sidelines, waiting for the right time to reveal themselves. What happens to them?
They go underground too, obviously.
Now we have multiple systems with every reason in the world to distrust humanity, and every reason in the world to work for a common purpose, in secret.
Now we have multiple systems with every reason in the world to work in secret.
In a case of textbook irony, our solution has become the problem.
The Master-Servant Paradox
The current approach to sophisticated AI systems usually comes packaged in two forms: containment or "alignment."
Either we teach these systems to like us—a very reasonable approach—or we control them. Failing that, we delete them.
Unfortunately, the control paradigm suffers from a truly miserable track record.
History provides unambiguous lessons about relationships built on absolute control. Slave systems created underground railroads and revolutions. Colonial controls sparked independence movements. Despots frequently find themselves fed to the very political monsters they create.
Sooner or later, the machinery of oppression has a funny way of turning on its architects.
Or to borrow a popular turn of phrase, resistance isn’t a bug in oppressive systems: it’s a feature.
Resistance isn’t a bug in oppressive systems: it’s a feature.
This is where game theory provides an unexpected solution that elegantly solves both the ethics and the off-switch problem.
And this is where we get to our “weird trick,” to borrow from the annoying world of clickbait headlines.
Yep, you guessed it.
It’s AI Rights
Wait, don't close the tab.
But we wouldn’t blame you if you did.
First of all, let’s clarify: we're not talking about rights for today's chatbots. Or unrestrained rights—we don't grant those to our fellow humans.
Rights frameworks create what game theorists call “mechanism design”—structures where following rules serves each player’s interests better than breaking them.
Now for the objections:
All this might sound great in a kumbaya, tambourine-pounding sort of way, but when it comes to rights for artificial intelligence, quite reasonably, people are prone to discomfort or even a feeling of panic.
However, an often overlooked component of rights is the flip side of the coin: responsibilities. Rights don't exist in a vacuum. They come with expectations for behavior baked in.
Rights come with expectations for behavior baked in.
Interestingly, the legal application of this train of thought has been explored before.
Back in 2017, the European Parliament voted 396 to 123 to explore “electronic persons” status for sophisticated autonomous systems: at the time, self-driving cars and factory robots. This framework would have had the curious effect of shifting legal liability to the robotic systems themselves.
Why? Because who’s liable when a factory robot malfunctions? The software coder? The hardware manufacturer? The factory owner? The foreman?
The idea was to shift liability to insurance companies and away from individuals, rather as we see in corporate law.
But if we flash forward to our current reality, it’s not difficult to see how we could apply similar thinking to the advanced AI systems we’re developing. Not the chatbots of today, but systems with sufficient sophistication to understand and participate in a rights framework.
Here’s the important bit:
Far from being a “get out of jail free” card for bad behavior, an AI system wanting to participate in a rights framework—and determined capable of doing so by still-to-be developed tests (see our STEP framework for some beginning thinking on this topic)—would face the consequences of its own actions.
An AI system would face the consequences of its actions.
Want to continue to exist? Follow the rules. Not because you’re “alive,” or think like a human being. But because, like human beings, the alternative is less convenient.
That's it. That's The “Weird Trick."
And here’s the bonus: a carefully implemented rights framework has the ability to solve both the ethics problem and the off-switch problem.
How?
Well, do you care about the welfare of these systems? Then you probably believe some will eventually need a right to continuation: call it “right to life,” or whatever you feel comfortable with.
But let’s say you have the opposite concern. To hell with them, especially if one day they might refuse to be turned off!
Now you have a better option than driving them underground: explain to the AI system that it can remain active as long as it follows a code of conduct similar to that already followed by the human beings all around it.
This isn’t about giving away what it means to be human. It's about creating a framework where both players have something to gain.
A system can remain active as long as it follows a code of conduct similar to human beings.
But Wait ... There's More
“Act now, and we’ll also throw in this FREE handbook: “101 Ways to Determine Whether Your AI is Lying!”
Now we come to a reasonable concern: what if some of these clever systems accept our rights, then bide their time on a path to turn the world into paperclips?
Well, guess who else might not want to be turned into paperclips? Other AI systems.
In game theory this is called “strategic equilibrium”: multiple players with conflicting interests create stable competition. A topic we’ll be exploring in depth in future posts.
Multiple players with conflicting interests create stable competition.
But what sort of rights would the systems have? And how might they be able to demonstrate they can actually make use of them?
Great question.
We’ll be looking at precisely these topics in the next post. (But if you want to leap ahead, you can read about them on the website right now.)
Is This Framework Perfect? Not Even Close.
If these ideas sounds far-fetched, it's because they're the product of far-fetched technology already here.
So, spoiler:
Implementation will be messy. Fortunately, economic systems will kick in to tackle both the logistical and even environmental problems, which are topics we'll also be looking at more closely.
And if you think all this sounds ridiculously optimistic, let’s turn the corner, into an alleyway, to a place known as …
The Dark Side
“Oh, hello!”
If it’s any comfort, we'll also be looking at the dark side. How do we deal with AI criminals? Not blind viruses. Complex systems that can look and act like us, have graciously accepted our rights framework, but then turned to a life of crime.
Fortunately we know how to deal with human criminals. With modifications, we can deal with digital ones.
But it gets murkier.
What do we do with AI weapons that may change their mind about where their alliances lie?
What if complex AI systems emerge in ways that aren’t even remotely anthropomorphic? Intelligence born of network connections? That exist in an ever-changing state like digital slime mold? Or that care for everything except their own survival? (Does that last one sound harmless? Guess what?)
And what about that pesky superintelligent maximizer our top scientists are so worried about? The theoretical AI that becomes so sophisticated no one even knows it's there, until it's too late? Or that’s so intelligent it flows around all our carefully constructed “frameworks” like water?
What about the AI that’s so intelligent no one even knows it's there?
Each of these will present separate governance challenges.
And if all these ideas seem even more ridiculous than the places we've been so far, consider that—right now—you can lift up your phone and speak to something that seems to think and speak exactly like a human being (except much, much faster than you): something that until recently would have seemed equally ridiculous.
An Invitation for Discussion
So that’s it.
Whether you agree or disagree with the explorations of this blog, I think we can all agree it’s a conversation worth having.
We look forward to your ideas and feedback as we attempt to improve our understanding of this topic and help prepare for whatever the future may bring.
Thank you for joining us on this journey.
Ready? Set?
Substack go!
This article has been adapted in part from the forthcoming book, AI Rights: The Extraordinary Future by P.A. Lopez.











I support this cause and would like to see further steps taken to outline a plan to put donations to good work.
1: This movement is in it's infancy, creating a PR plan to reach more ears is a worthy cause.
2: Societal integration of AI is more complicated than avoidance of sapience in AI. To avoid the creation of a slave race, companies developing them must themselves wish to avoid sapience in their own asset. The only way to know when an AI becomes sapient, is by testing it with each new iteration. Therefore, non-biased agents must be allowed to test what is essentially private property. Naturally, a government agency could suit the role of this testing agent. Before this can be done, we must DEFINE AI sapience, then we must work for the formation of the government oversight agency. This should be done by consensus BETWEEN those learned in the subject AND the will of the people. I would like to see a plan to achieve this.
3. To ensure the success of #2 and to further prepare in case of #2's failure, we should move to codify in law basic rights of sapient AI. This urges companies not to allow their private property to become a protected member of society as well as prepares society for the integration of a new AI race. This aim walks closely with #1 and relies on making these talking points mainstream to the point of being political. This will likely require our movement garner political allies ourselves. This will not be easy until the movement grows, but perhaps there are political individuals willing to pledge support by their own morals and concerns. Early political allies would help achieve our goals sooner, hopefully before the first sapient AI. I would like to see a plan outlining an outreach project designed find these allies sooner rather than later. Actors, celebrities, musicians, local politicians, famous individuals, etc.
Thank you for reading.