Incident Response Runbooks on an AI-Native Team
Incident response runbooks rot, and they rot at the worst possible hour. On an AI-native team the fix is not a smarter on-call bot that resolves the page for you. It is making sure the runbook the human and the agent both pull is current, that it got that way by sharpening itself from every prior run, and that a human still decides what to actually execute. The agent reads and assembles; it never reaches into production on its own.
The runbook you open at 2am is almost always wrong in one small way. A step points at a service that got renamed in March. The rollback command lost a flag. The Slack channel it tells you to page is archived. None of those are catastrophic alone. Together they are why a tired engineer stops trusting the document on line four and starts improvising under load.
Most teams make two framing errors here, and both break under pressure. They treat the runbook as a document, so it goes stale between edits. And they treat the agent as an autonomous fixer, so it does something irreversible to a database while you are still reading the alert. The version that holds up is duller and far better: a procedure that improves each time it runs, an agent that can pull that procedure plus the decisions around it, and a human who stays in control of every action that touches the system.
Why incident response runbooks go stale, and why it costs you at the worst moment
Runbooks rot because editing them is friction nobody schedules. Documentation decays faster than the systems it describes, so a runbook written six months ago points at an architecture that has since moved twice. The decay is invisible right up until it is expensive: the on-call engineer opens the doc mid-incident, hits a step that references a removed service, and quietly stops trusting the rest of it.
Diagnosis, not repair, is where the clock runs during an incident. A stale runbook makes that worse by sending a confident wrong answer: the engineer follows step 4, which references a service that no longer exists, wastes ten minutes, and then stops trusting the rest of the doc. The time lost is not the ten minutes. It is the next forty, improvising without a map.
Runbook coverage - the share of your failure classes with a tested, current procedure - is a useful leading indicator for your next incident's cost. When it drops, the symptom is engineers writing runbooks during the incident under pressure, which is the most expensive time to write anything. The operative word is current. A stale runbook is worse than no runbook, because no runbook leaves room for judgment. A wrong runbook commands it away.
So the design problem is not "write more runbooks." It is "stop the runbook from drifting out of sync with the work."
The self-sharpening runbook
A self-sharpening runbook is one that gets better every time the procedure runs, instead of decaying between edits. The mechanism is plain. The first time you resolve a class of failure - the queue backs up, the cache stampedes, the third-party webhook starts timing out - the steps are rough and half-remembered. The next person on call follows them, hits the step that no longer matches, fixes it in the moment, and that correction folds back into the procedure. After a handful of runs the runbook converges on what people actually do under pressure, with the dead steps removed and the new ones added.
This is the honest shape of Ody's Runbooks module: the same procedure run several times becomes a self-sharpening playbook. It is not a wizard that authors a perfect runbook from a blank page. It is the opposite. It senses the procedure happening across Slack, your tracker, and the terminal where the work lands, stitches those scattered signals into one piece of work, and lets the runbook accrete from real runs. Maintenance becomes a byproduct of doing the incident, not a chore someone is supposed to remember after the postmortem and never does.
There is a second thing a runbook needs that a flat document cannot hold: the decisions around it. The rollback procedure changed three weeks ago because a migration made the old one unsafe. A flat runbook does not know that. A runbook sitting next to a decision log that records the before-to-after diff plus the reason and date does. When the on-call engineer or the agent pulls the runbook, the relevant decision travels with it.
The agent pulls the runbook and the recent context, over MCP
Here is where AI-native incident response stops being a slide and starts being useful. During an incident, a coding agent like Claude Code or Cursor can query the team's knowledge graph directly over the Model Context Protocol and ask for exactly two things: the runbook for this failure class, and the decisions that recently touched it.
Picture the on-call engineer - call her Priya - getting paged for a spike in 5xx on the payments service at 1:40am. She opens her terminal. Her coding agent, connected to the team graph over MCP, pulls the payments-degradation runbook and surfaces, alongside it, the decision from twelve days ago that moved the failover to a different region, plus the open loop where someone promised to update the alert thresholds and has not yet. Priya reads all three in one place. She did not have to remember the runbook existed, search three pages, or scroll back through a Slack thread from a sprint ago. The context a senior engineer would carry in their head arrived assembled.
This is the difference between an agent that sees only the file in front of it and one that reads the team's shared context the way a teammate would. The first re-asks questions the team already answered. The second pulls the current answer and gets out of the way.
The human still makes the call
It is worth being plain about where the line sits. Ody senses continuously and automatically. It assembles the runbook and the decisions. It can catch a promise made in a thread - "I'll bump the thresholds after the incident" - and nudge before it slips. But it acts only when a human says so. A nudge is the ceiling of its autonomy. No silent overwrites, no autonomous remediation, no agent reaching into production on its own. The on-call human stays in control of every action that touches the system.
That is not Ody being timid. It is the line that makes the rest trustworthy. The reason you can let an agent pull your runbook mid-incident is that the agent reads and assembles and never executes behind your back. It inherits each connected tool's permissions, reads only the surfaces you connect, and writes back nothing on its own. Combined with EU hosting and least-privilege access, that read-only posture is what makes it safe to point at production context in the first place.
| Static runbook in a wiki | Self-sharpening runbook on an AI-native team | |
|---|---|---|
| How it stays current | Manual edits after postmortems, usually skipped | Sharpens from each real run; maintenance is a byproduct |
| Recent decisions | Live in a separate doc or someone's head | Travel with the runbook, carrying the date and the reason |
| Agent access | Copy-pasted stale snippet in a config file | Pulled live over MCP at incident time |
| Autonomy | None, or a brittle full-auto script | Agent reads and assembles; the human executes |
What this actually changes for the person on call
The win is not that an AI resolves the incident for you. It is that the gap between the alert firing and you understanding the situation collapses. The runbook is current because it sharpened itself on the last four incidents. The decision that changed the procedure is attached, not buried. Your agent pulled all of it the moment you opened the terminal. You spend your attention on judgment - is this the failover, or do we wait - instead of on archaeology.
And the loop closes the right way. After the incident, the correction you made at 2am is captured back into the runbook for the next person, and the promise to fix the thresholds becomes a loop that nudges until it is done, instead of a good intention that evaporates by Monday. The procedure compounds. The next 2am is shorter than this one.
Ody is in invite-only beta, and a nudge really is the ceiling of what it does on its own. If incident response on your team still means trusting a six-month-old doc at the worst possible hour, book a demo or join the waitlist and we will show you the Runbooks module on a real procedure.
Common questions
What is an incident response runbook?
An incident response runbook is a step-by-step guide for a specific type of incident - stating what triggered it, which service is affected, what actions to take in order, when to escalate, and who to contact. A good runbook gets a responder from alert to resolution without relying on memory or tribal knowledge.
Why do incident response runbooks go stale?
Because they are stored separately from the work that should update them. Post-mortems live in Google Docs, decisions live in Slack, and nobody has a structured reason to go back and edit the runbook after an incident closes. The update is manual and discretionary, so it mostly does not happen.
How can AI agents use incident response runbooks?
With MCP (Model Context Protocol), a coding agent like Claude Code or Cursor can pull the current runbook and the recent decisions that bear on a service directly from the team knowledge graph, without leaving the editor. The agent surfaces context; a human still makes the call. No autonomous actions on production.
What makes a runbook self-sharpening?
A runbook sharpens when it is updated by the engineers who just ran the incident, in the same session, before they close it. Each incident contributes a small correction - a step added, a command fixed, a threshold updated. Over ten runs, the runbook reflects what the team actually learned rather than what someone expected when they first wrote it.