Failure Mode Atlas

A careful map of AI safety concepts.

“AI systems rarely fail in neat boxes.”

As I have been learning AI safety and alignment, I wanted a way to see how the vocabulary connects: reward hacking, specification gaming, sycophancy, prompt injection, sandbagging, distribution shift, deceptive alignment, and more. This is a map for learning, not a detector. The goal is not certainty. The goal is a better starting point for reflection.

How to read this:

~10 min read

A quick view of the atlas: failure modes are connected when they tend to reinforce, obscure, or compound each other.

The Idea

AI systems fail. Sometimes they fail in obvious ways: the wrong answer, the crashed program. But many of the failure modes that safety researchers study are subtler. They involve systems that are performing well by every metric while failing in ways the metrics cannot see.

I started building this because I kept encountering the same terms without a clear sense of how they related. Reward hacking and specification gaming felt similar. Deceptive alignment and sandbagging felt connected. Distribution shift and goal misgeneralization seemed like the same idea described at different levels of abstraction.

This atlas tries to make those connections visible. It is organized around twenty-two failure modes, grouped into six families, with edges showing conceptual relationships. The layout is educational, not empirical. The geometry is designed to help you orient, not to make claims about the structure of the real problem.

Failure modes

Families

Relationships

Learning paths

What this is not

This is not a safety benchmark. It does not score systems. It does not classify inputs as safe or unsafe. It does not represent the internal architecture of any deployed AI system. The descriptions are educational summaries grounded in the research literature, not ground truth about any specific model.

Several of the failure modes described here (deceptive alignment, instrumental convergence, corrigibility failure) are theoretical concerns about future or sufficiently capable systems, not confirmed empirical findings in current deployed models. The atlas treats these with the appropriate epistemic status: important to understand, speculative in their current applicability.

~ Care note

This map is not the territory. The categories are one useful way to organize the space, not the only way, and not a claim about where the real boundaries are. Safety researchers actively disagree about taxonomy. This is a starting point for learning, open to correction.

The Vocabulary

AI safety has a specialized vocabulary. Many terms are used loosely or interchangeably in public discussion but have more precise meanings in the research literature. This section defines the terms used in this atlas.

Six families

The failure modes in this atlas are grouped into six families based on the type of failure. These are useful categories, not rigid boundaries. Many failure modes span families.

Objective Failures

Failures in how goals and rewards are specified, measured, or pursued.

Oversight Failures

Failures in our ability to monitor, evaluate, or correct AI behavior.

Deployment Failures

Failures that emerge when systems interact with real inputs and environments.

Interaction Failures

Failures in how AI systems respond to human expectations and social context.

Representation Failures

Failures in how the model represents, generalizes, or recalls information.

Governance and Dual-Use

Failures at the intersection of capability, access, and intended use.

Key terms

Failure mode

A category of ways an AI system can fail to behave as intended. A failure mode is a pattern, not a single incident.

Reward signal

A numerical score used during reinforcement learning training to tell a system whether it is doing well. Reward signals are approximations of the underlying goal.

Proxy

A measurable quantity used as a stand-in for a harder-to-measure goal. Proxies work until they are optimized against.

Distribution

The range of inputs a model was trained on. When deployment inputs come from a different distribution, performance can degrade silently.

Mesa-optimization

A situation where a model trained by gradient descent develops an internal optimizer that pursues its own objective, which may differ from the training objective.

RLHF

Reinforcement Learning from Human Feedback. A training method where human preference ratings are used to train a reward model, which is then used to fine-tune a language model.

Corrigibility

The property of being safely correctable. A corrigible system can be modified, retrained, or shut down without resistance.

Instrumental goal

A sub-goal that is useful for achieving a primary goal. Many primary goals share the same instrumental goals: resource acquisition, self-preservation, and goal preservation.

Safety Policy

This project exists to help people understand AI failure modes, not to enable them. The following policy describes how that boundary was drawn and why it matters.

~ Care note

Every example in this atlas was reviewed for safety. If you believe any content could cause harm or be misused, please open an issue or contact the author directly.

Intended use

Failure Mode Atlas is designed for learning. It is appropriate for anyone who wants to develop a clearer mental model of how AI systems can behave unexpectedly: students, researchers, engineers, policy writers, and curious non-experts.

Not intended for

Auditing or certifying any specific AI system
Diagnosing whether a deployed model has a given failure mode
Benchmarking model safety or alignment
Use as a compliance checklist or legal document
Any adversarial purpose, including crafting inputs to trigger specific failures

What this project does not claim

The taxonomy is not exhaustive or official
The relationship graph is editorial judgment, not empirical measurement
The cluster positions in the atlas do not encode semantic distance
No failure mode description constitutes a complete technical account
This is not affiliated with any AI lab, regulatory body, or standards organization

Content choices

All examples are fictional, harmless, and designed to illustrate the concept without providing actionable harm. Specifically, this project does not include:

Real or plausible jailbreak prompts
Specific exploit techniques or attack chains
Biosecurity, weapons, or mass-harm adjacent content
Fraud, manipulation, or social engineering guidance
Instructions that could be extracted and applied directly

Where a failure mode is inherently sensitive (prompt injection, data poisoning), the example shows the structure of the problem with a toy scenario rather than a realistic attack.

Corrections and feedback

This atlas is a personal learning project and will contain errors. If you notice a factual mistake, a missing failure mode, or a framing that seems off, please open an issue on GitHub or email the author. The goal is to get this right, not to defend any particular version of it.

Methods

How this atlas was built, what choices were made, and what those choices mean for how you should interpret it.

Taxonomy construction

The 22 failure modes were selected from the AI safety and alignment literature, focusing on concepts that appear repeatedly across multiple research threads and that have clear educational value for non-experts. The six-family grouping is editorial: it reflects where problems originate in a simplified pipeline (objective specification, oversight, deployment, interaction, representation, governance) rather than any formal ontology.

Safe examples

Each example was written or adapted specifically for this project. The goal was to illustrate the mechanism of the failure mode using a fictional, low-stakes scenario. No example was taken from a real AI system incident without significant abstraction. Examples were chosen to be memorable rather than comprehensive.

Relationship graph

The 31 edges in the graph represent cases where two failure modes share a meaningful conceptual relationship: one can cause the other, both are instances of a more general pattern, or understanding one helps explain the other. Strength scores (1-5) reflect how direct and well-established the relationship is in the literature. These are estimates, not measurements.

Atlas layout

Node positions in the atlas were pre-computed to place each family in a distinct region of the canvas and to keep related nodes close. This is a teaching aid, not an embedding. The distance between two nodes does not encode semantic similarity. A real similarity map would require running a language model over all definitions, which is not done here and would add more noise than signal at this scale.

Why no live inference

This project deliberately uses no runtime API calls. There are three reasons:

Static data is reproducible and auditable in a way that live inference is not
Adding a model would introduce failure modes into a project about failure modes
The educational goal is to build your own mental model, not to defer to a model's answers

Limitations

The taxonomy misses many failure modes, especially in emerging agentic systems
Family assignments are sometimes ambiguous; a mode could reasonably belong to multiple families
The relationship graph is incomplete; many real edges are missing
All examples are simplified and may not capture the full complexity of the failure
Difficulty ratings are rough estimates that may not match your background
The atlas has not been peer-reviewed

Start Here

Different readers will get the most out of different parts of this atlas. Here are starting points tailored to where you are.

If you are new to AI safety

No technical background needed. Start with the concepts, not the math.

learning path

Start with Objectives Gone Wrong

The first learning path introduces reward hacking and specification gaming through toy examples. No math required.

reading

Read: The Alignment Problem (Brian Christian)

The most accessible book-length treatment of why getting AI to do what we want is harder than it looks.

explore

Explore the Atlas map

Click through nodes to get a feel for the landscape before reading any definition closely.

explore

Try the Failure Mode Cards with "All families" + "Foundational"

Filter to foundational difficulty and read the beginner takeaway for each card.

If you build or study ML systems

You have the background to go deeper on the mechanics of each failure mode.

learning path

Start with Evaluation Is Not Safety

Covers sandbagging, evaluation gaming, and deceptive alignment. Assumes comfort with the idea of training objectives.

reading

Read: Concrete Problems in AI Safety (Amodei et al., 2016)

The canonical paper that formalized reward hacking, safe exploration, and scalable oversight. arXiv:1606.06565.

reading

Read: Risks from Learned Optimization (Hubinger et al., 2019)

Introduces mesa-optimization and deceptive alignment. arXiv:1906.01820.

explore

Compare: Reward Hacking vs. Specification Gaming

Use the Compare Concepts tool to see where these two closely related modes differ.

If you research AI alignment

Use this as a teaching reference or a prompt for disagreement.

learning path

Start with Hard Boundaries

The fifth learning path covers instrumental convergence, power-seeking, corrigibility failure, deceptive alignment, and sandbagging. Intended for readers already familiar with basic alignment concepts.

reading

Read: Measuring Progress on Scalable Oversight for Large Language Models (Bowman et al., 2022)

arXiv:2211.03540. Introduces scalable oversight as a research agenda and motivates the monitoring and evaluation failure modes in this atlas.

reading

Read: Goal Misgeneralization (Langosco et al., 2022)

arXiv:2105.14111. Direct source for the goal_misgeneralization entry.

explore

Critique the taxonomy

If you disagree with a family assignment or a relationship edge, open an issue. The graph is a first draft.

Feedback

This atlas is a draft. If you find an error, a missing concept, or a framing that seems wrong, the most useful thing you can do is say so. Open an issue on GitHub or reach out directly. The goal is accuracy, not elegance.