Intro to Mesa Operators

I'll start by working through a story to help you come to grips with the idea of these "mesa operators/optimization." Let's say you're working on a data sorting and email campaign generation task at work, and you need to figure out how to take a large amount of customer data and create a compelling email campaign. You're in the mindset of task optimization, knowing you need to improve your ability to complete this work. Let's take a gander at the ways we could improve. I encourage you to think about how you would approach it.

Here is my list:

Get really good at using the tools.
Learn how to make better email campaign ads.
Hire an intern to do parts that I might not be good at.

This list of options is by no means exhaustive, but I would say it's a good start. Truthfully, to get really good at this task, you'll need to implement all of these. So let's walk through the implementation of each, and then we'll find the one that relates to our topic of today. One is getting really good at using tools—as we well know, if you gain proficiency in the tools you use, you'll get faster. Two is practicing making compelling ads—that's also good, as the more time and effort you put into something, the better you become at it.

But the last one—hiring an intern and giving them a simpler part of the task at hand—will also reduce our load, right? That intern doesn't have the same goal as you. To put it plainly, you have the goal of getting the email campaign ready from your input customer data. The intern's goal is to organize the list of customers or complete any such job. In this way, it's more effective to generate a new position with a separate goal from the main one, but one that's helpful in achieving the main goal of sending out the campaign.

This is one way of thinking about mesa optimization. When you train an AI model with methods like gradient descent, you're asking that model to find the optimal way to achieve the goal you set out. In this search for finding—for lack of a better term—the best approach, an AI model will discover that setting some portion of itself with a different goal is the most effective way to climb that gradient descent.

Why is this a problem?

At this point, my reader, you might be asking yourself what the issue is here. So it has become my job to try to convince you that this is in fact not just an issue, but one that to me is large and quite hard.

I would hope at this point that you know outright misalignment of any AI system is not good and could have what I like to refer to as "bad outcomes." This misalignment issue is hard, and it's not a given that we'll solve it before things become too late :(. It is at this point through my ramblings that the subject of this post makes an appearance again: the mesa optimization. If you could think about how hard it is to align one agent, imagine if that agent was also training its own internal intern. Now not only do we have to worry and fret about what is referred to in the literature as the outer alignment problem, but also now this inner alignment problem. This is why mesa optimization is an issue. What if our outer alignment is perfect, and then it has a mesa optimizer that has developed a goal which by coincidence has an instrumental goal that is "kill all the people"? This, I know, is a bit far-fetched, but it serves to prove a point. Even when the outer alignment problem is solved, it won't necessarily provide peace of mind that there won't be a part of that agent that will end the world.

What is good/cool about this mesa thing?

I don't just want to be a downer about this. It's really fascinating that at some level these systems that are driven by pure high-score hunting can fall into this quite fascinating way to reach higher scores. There's also some thought that this could be a key point in making systems that are more capable. But it is to be said that we need to be careful with these.

Works Cited

von Oswald, J., Schlegel, M., Meulemans, A., et al. (2023). Uncovering mesa-optimization algorithms in Transformers. arXiv preprint arXiv:2309.05858. https://arxiv.org/abs/2309.05858
Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J., & Garrabrant, S. (2019). Risks from Learned Optimization in Advanced Machine Learning Systems. arXiv preprint arXiv:1906.01820. https://arxiv.org/abs/1906.01820
da Silva, F. Q. B., Storozhenko, M., & Maciel, L. (2023). Ethical(Mis)-Alignments in AI Systems and the Possibility of Mesa-Optimizations. The International Journal of Ethical Leadership, 10(1), Article 6. https://scholarlycommons.law.case.edu/ijel/vol10/iss1/6/
The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment. https://www.youtube.com/watch?v=bJLcIBixGj8
Mesa-Optimization. LessWrong. https://www.lesswrong.com/tag/mesa-optimization
Instrumental convergence. Wikipedia. https://en.wikipedia.org/wiki/Instrumental_convergence

Published: October 17, 2025