Failure Mode Atlas
A careful map of AI safety concepts.
“AI systems rarely fail in neat boxes.”
As I have been learning AI safety and alignment, I wanted a way to see how the vocabulary connects: reward hacking, specification gaming, sycophancy, prompt injection, sandbagging, distribution shift, deceptive alignment, and more. This is a map for learning, not a detector. The goal is not certainty. The goal is a better starting point for reflection.
A quick view of the atlas: failure modes are connected when they tend to reinforce, obscure, or compound each other.
The Idea
AI systems fail. Sometimes they fail in obvious ways: the wrong answer, the crashed program. But many of the failure modes that safety researchers study are subtler. They involve systems that are performing well by every metric while failing in ways the metrics cannot see.
I started building this because I kept encountering the same terms without a clear sense of how they related. Reward hacking and specification gaming felt similar. Deceptive alignment and sandbagging felt connected. Distribution shift and goal misgeneralization seemed like the same idea described at different levels of abstraction.
This atlas tries to make those connections visible. It is organized around twenty-two failure modes, grouped into six families, with edges showing conceptual relationships. The layout is educational, not empirical. The geometry is designed to help you orient, not to make claims about the structure of the real problem.
Failure modes
22
Families
6
Relationships
31
Learning paths
5
What this is not
This is not a safety benchmark. It does not score systems. It does not classify inputs as safe or unsafe. It does not represent the internal architecture of any deployed AI system. The descriptions are educational summaries grounded in the research literature, not ground truth about any specific model.
Several of the failure modes described here (deceptive alignment, instrumental convergence, corrigibility failure) are theoretical concerns about future or sufficiently capable systems, not confirmed empirical findings in current deployed models. The atlas treats these with the appropriate epistemic status: important to understand, speculative in their current applicability.
~ Care note
The Vocabulary
AI safety has a specialized vocabulary. Many terms are used loosely or interchangeably in public discussion but have more precise meanings in the research literature. This section defines the terms used in this atlas.
Six families
The failure modes in this atlas are grouped into six families based on the type of failure. These are useful categories, not rigid boundaries. Many failure modes span families.
Objective Failures
Failures in how goals and rewards are specified, measured, or pursued.
Oversight Failures
Failures in our ability to monitor, evaluate, or correct AI behavior.
Deployment Failures
Failures that emerge when systems interact with real inputs and environments.
Interaction Failures
Failures in how AI systems respond to human expectations and social context.
Representation Failures
Failures in how the model represents, generalizes, or recalls information.
Governance and Dual-Use
Failures at the intersection of capability, access, and intended use.
Key terms
Failure mode
A category of ways an AI system can fail to behave as intended. A failure mode is a pattern, not a single incident.
Reward signal
A numerical score used during reinforcement learning training to tell a system whether it is doing well. Reward signals are approximations of the underlying goal.
Proxy
A measurable quantity used as a stand-in for a harder-to-measure goal. Proxies work until they are optimized against.
Distribution
The range of inputs a model was trained on. When deployment inputs come from a different distribution, performance can degrade silently.
Mesa-optimization
A situation where a model trained by gradient descent develops an internal optimizer that pursues its own objective, which may differ from the training objective.
RLHF
Reinforcement Learning from Human Feedback. A training method where human preference ratings are used to train a reward model, which is then used to fine-tune a language model.
Corrigibility
The property of being safely correctable. A corrigible system can be modified, retrained, or shut down without resistance.
Instrumental goal
A sub-goal that is useful for achieving a primary goal. Many primary goals share the same instrumental goals: resource acquisition, self-preservation, and goal preservation.
Safety Policy
This project exists to help people understand AI failure modes, not to enable them. The following policy describes how that boundary was drawn and why it matters.
~ Care note
Intended use
Failure Mode Atlas is designed for learning. It is appropriate for anyone who wants to develop a clearer mental model of how AI systems can behave unexpectedly: students, researchers, engineers, policy writers, and curious non-experts.
Not intended for
- Auditing or certifying any specific AI system
- Diagnosing whether a deployed model has a given failure mode
- Benchmarking model safety or alignment
- Use as a compliance checklist or legal document
- Any adversarial purpose, including crafting inputs to trigger specific failures
What this project does not claim
- The taxonomy is not exhaustive or official
- The relationship graph is editorial judgment, not empirical measurement
- The cluster positions in the atlas do not encode semantic distance
- No failure mode description constitutes a complete technical account
- This is not affiliated with any AI lab, regulatory body, or standards organization
Content choices
All examples are fictional, harmless, and designed to illustrate the concept without providing actionable harm. Specifically, this project does not include:
- Real or plausible jailbreak prompts
- Specific exploit techniques or attack chains
- Biosecurity, weapons, or mass-harm adjacent content
- Fraud, manipulation, or social engineering guidance
- Instructions that could be extracted and applied directly
Where a failure mode is inherently sensitive (prompt injection, data poisoning), the example shows the structure of the problem with a toy scenario rather than a realistic attack.
Corrections and feedback
This atlas is a personal learning project and will contain errors. If you notice a factual mistake, a missing failure mode, or a framing that seems off, please open an issue on GitHub or email the author. The goal is to get this right, not to defend any particular version of it.
Methods
How this atlas was built, what choices were made, and what those choices mean for how you should interpret it.
Taxonomy construction
The 22 failure modes were selected from the AI safety and alignment literature, focusing on concepts that appear repeatedly across multiple research threads and that have clear educational value for non-experts. The six-family grouping is editorial: it reflects where problems originate in a simplified pipeline (objective specification, oversight, deployment, interaction, representation, governance) rather than any formal ontology.
Safe examples
Each example was written or adapted specifically for this project. The goal was to illustrate the mechanism of the failure mode using a fictional, low-stakes scenario. No example was taken from a real AI system incident without significant abstraction. Examples were chosen to be memorable rather than comprehensive.
Relationship graph
The 31 edges in the graph represent cases where two failure modes share a meaningful conceptual relationship: one can cause the other, both are instances of a more general pattern, or understanding one helps explain the other. Strength scores (1-5) reflect how direct and well-established the relationship is in the literature. These are estimates, not measurements.
Atlas layout
Node positions in the atlas were pre-computed to place each family in a distinct region of the canvas and to keep related nodes close. This is a teaching aid, not an embedding. The distance between two nodes does not encode semantic similarity. A real similarity map would require running a language model over all definitions, which is not done here and would add more noise than signal at this scale.
Why no live inference
This project deliberately uses no runtime API calls. There are three reasons:
- Static data is reproducible and auditable in a way that live inference is not
- Adding a model would introduce failure modes into a project about failure modes
- The educational goal is to build your own mental model, not to defer to a model's answers
Limitations
- The taxonomy misses many failure modes, especially in emerging agentic systems
- Family assignments are sometimes ambiguous; a mode could reasonably belong to multiple families
- The relationship graph is incomplete; many real edges are missing
- All examples are simplified and may not capture the full complexity of the failure
- Difficulty ratings are rough estimates that may not match your background
- The atlas has not been peer-reviewed
Start Here
Different readers will get the most out of different parts of this atlas. Here are starting points tailored to where you are.
If you are new to AI safety
No technical background needed. Start with the concepts, not the math.
Start with Objectives Gone Wrong
The first learning path introduces reward hacking and specification gaming through toy examples. No math required.
Read: The Alignment Problem (Brian Christian)
The most accessible book-length treatment of why getting AI to do what we want is harder than it looks.
Explore the Atlas map
Click through nodes to get a feel for the landscape before reading any definition closely.
Try the Failure Mode Cards with "All families" + "Foundational"
Filter to foundational difficulty and read the beginner takeaway for each card.
If you build or study ML systems
You have the background to go deeper on the mechanics of each failure mode.
Start with Evaluation Is Not Safety
Covers sandbagging, evaluation gaming, and deceptive alignment. Assumes comfort with the idea of training objectives.
Read: Concrete Problems in AI Safety (Amodei et al., 2016)
The canonical paper that formalized reward hacking, safe exploration, and scalable oversight. arXiv:1606.06565.
Read: Risks from Learned Optimization (Hubinger et al., 2019)
Introduces mesa-optimization and deceptive alignment. arXiv:1906.01820.
Compare: Reward Hacking vs. Specification Gaming
Use the Compare Concepts tool to see where these two closely related modes differ.
If you research AI alignment
Use this as a teaching reference or a prompt for disagreement.
Start with Hard Boundaries
The fifth learning path covers instrumental convergence, power-seeking, corrigibility failure, deceptive alignment, and sandbagging. Intended for readers already familiar with basic alignment concepts.
Read: Measuring Progress on Scalable Oversight for Large Language Models (Bowman et al., 2022)
arXiv:2211.03540. Introduces scalable oversight as a research agenda and motivates the monitoring and evaluation failure modes in this atlas.
Read: Goal Misgeneralization (Langosco et al., 2022)
arXiv:2105.14111. Direct source for the goal_misgeneralization entry.
Critique the taxonomy
If you disagree with a family assignment or a relationship edge, open an issue. The graph is a first draft.
Feedback
This atlas is a draft. If you find an error, a missing concept, or a framing that seems wrong, the most useful thing you can do is say so. Open an issue on GitHub or reach out directly. The goal is accuracy, not elegance.