MIM-Runbook Part 3: Real Example Walkthrough — From a BGP Outage YAML to a Complete Runbook

This is Part 3 of a 3-part series on MIM-Runbook. Part 1 covers what the plugin is and its business value. Part 2 covers installation and setup.


In Parts 1 and 2 we covered why MIM-Runbook exists and how to install it. Now let’s trace a real-world example from start to finish: a multi-region network outage caused by a BGP route withdrawal. We’ll examine both input YAML files field by field, then walk through what the plugin produces — the Markdown runbook, the Word document, and the Excel action tracker.

The Scenario

At 09:12 UTC, PagerDuty fires a P1 alert: three BGP peers have dropped simultaneously on core-router-edge-01. All external customer traffic to production services is failing across three regions. Revenue impact is approximately $120K per minute. A BGP community configuration change was pushed 7 minutes before the outage.

This is exactly the kind of incident where having a runbook pre-built in seconds — rather than hours — can mean the difference between a 30-minute outage and a 3-hour one.

Input File 1: The Incident YAML

The incident YAML for this scenario is available in the repository at example/input/incident-network-outage.yaml. Let’s break down the key fields and understand how each one shapes the output.

Identity and Severity

incident:
number: "INC0091245"
title: "Multi-Region Network Outage — BGP Route Withdrawal
Causing Global Connectivity Loss"
severity: "1 - Critical"
priority: "1 - Critical"
state: "In Progress"

The number field becomes the primary identifier woven throughout every output file. It appears in the filename (RB-INC0091245-...), the email subject lines, the Slack channel name (#incinc0091245), the runbook banner, the Excel tracker headers, and the PIR invitation. You define it once; the plugin propagates it everywhere.

The severity and priority values drive the urgency tone. A Sev1 generates a bold red warning banner at the top of the runbook. A lower severity would produce a less urgent framing.

Category and Affected Infrastructure

  category: "Network"
  subcategory: "BGP / Routing"
  affected_service: "Global Application Network — All Customer-Facing Traffic"
  affected_ci: "core-router-edge-01"
  environment: "Production"
  region: "global (us-east-1, eu-west-1, ap-southeast-1)"

The category field is the single most consequential input for the runbook’s technical content. Setting it to “Network” tells the plugin to generate Section 4 (Diagnosis) with network-specific investigation steps: reachability tests from multiple vantage points, routing table and BGP status checks, firewall and security group verification, and CDN/load balancer health checks. If this were a “Database” incident, Section 4 would instead contain Oracle RAC cluster health checks, connection pool queries, replication lag analysis, and recent DDL/deployment review.

The affected_ci value (core-router-edge-01) is injected verbatim into every command throughout the runbook. When Section 2 says “confirm the alert is genuine,” the command is ping core-router-edge-01 — not a generic placeholder. Engineers can copy-paste and run these commands directly without editing hostnames.

Business Impact

  business_impact: >
    All inbound customer traffic to production services is failing.
    100% packet loss from external clients to all three production
    regions. Estimated 45,000 active sessions dropped. Revenue
    impact ~$120K/minute. B2B SLA breach imminent for Tier-1
    enterprise customers. Brand and regulatory exposure if outage
    exceeds 30 minutes.

This text appears verbatim in four places: the runbook’s Incident Summary Banner, the Slack channel opening message, every email template body, and the Excel Summary Dashboard. Write it once with specific numbers (sessions dropped, revenue per minute, SLA risk) and it propagates to every stakeholder touchpoint. This ensures consistency — the CTO reading the escalation email sees the same impact statement as the engineer on the bridge.

Change Correlation

  change_related: true

This single boolean adds a prominent “YES — investigate recent changes” flag to the Summary Banner, immediately directing the responder’s attention to recent deployments or configuration changes. In this scenario, the incident description mentions a BGP community configuration change pushed at 09:05 UTC — 7 minutes before the outage. The change_related: true flag ensures this correlation is visible at the very top of the runbook.

The Full Description

  description: |
    At 09:12:37 UTC PagerDuty fired a P1 alert: "BGP peer down —
    3 peers on core-router-edge-01". Within 90 seconds all three
    transit BGP sessions (AS174 Cogent, AS6461 Zayo, AS3356 Lumen)
    dropped. The autonomous system (AS64512) lost all upstream
    route advertisements.

    Symptoms observed:
    - 100% external packet loss to all production VIPs
    - Internal east-west traffic unaffected
    - No hardware or interface errors on core-router-edge-01
    - BGP daemon (FRRouTing 9.1) shows all peers in IDLE state
    - syslog shows: "NOTIFICATION received from peer AS174:
      HOLD TIMER EXPIRED"
    - A BGP community configuration change was pushed by
      network-eng at 09:05 UTC — change ticket CHG0034891

    Possible cause: misconfigured BGP route-map applying a
    REJECT ALL policy to outbound advertisements.

    No DDoS signature detected. CDN edge nodes (Cloudflare)
    show normal traffic — issue is origin-side.

The description is the richest input field. It provides the investigation context that makes Section 4 genuinely useful rather than generic. The symptoms, the timeline, the possible root cause, and what’s been ruled out (DDoS, CDN issues) — all of this context is included in the runbook’s “Incident Description” block, giving every responder a complete starting picture.

Input File 2: The Stakeholders YAML

The stakeholders file is available at example/input/stakeholders-example.yaml. It defines 8 stakeholders across three escalation levels, plus 2 vendor escalation contacts.

Escalation Level 1 — Immediate Response (T+0)

Four stakeholders are defined at Level 1 with notify_immediately: true:

  • Sarah Mitchell — Incident Commander (Director of SRE). Her bridge_url and bridge_phone become the Zoom link used across all six email templates and the triage checklist.
  • David Park — Technical Lead (Principal Database Engineer). Owns Section 4 (Diagnosis) and Section 5 (Containment).
  • Jennifer Walsh — Communications Lead (Senior Engineering Manager). Owns Section 3 (Communication Plan) and is the “From:” on all six email templates.
  • Marcus Thompson — On-Call Engineer (Senior SRE). The person who picks up the page and follows the runbook from Step 1.

Escalation Level 2 — T+30 Escalation

Two stakeholders at Level 2 with notify_immediately: false:

  • Robert Chen — Customer Impact Lead (VP Customer Success)
  • Priya Sharma — Product Lead (VP Product Management)

These stakeholders are CC’d on email templates 1-3 and move to “To:” on templates 4-6. The plugin’s email routing logic ensures they’re informed early but not overwhelmed.

Escalation Level 3 — Executive Escalation (T+60+)

Two stakeholders at Level 3:

  • Angela Torres — CTO (Executive Sponsor)
  • James Okafor — CRO (Executive Sponsor)

Executives are CC-only on templates 3-6 and are never in the “To:” line of the initial alert. This is a deliberate design decision: executives should be aware of Sev1 incidents, but the initial response team shouldn’t be waiting for executive approval to act.

Vendor Escalations

vendor_escalations:
- vendor: "Oracle Support"
account_number: "ORA-87654321"
support_url: "https://support.oracle.com"
phone: "+1-800-223-1711"
severity_mapping: "SEV1 - Production Down"
- vendor: "AWS Support"
account_number: "AWS-987654321"
support_url: "https://console.aws.amazon.com/support/home"
phone: "+1-888-280-4331"
severity_mapping: "P1 - Production System Down"

These appear in Section 6 as a dedicated vendor escalation table. When the on-call engineer needs to open a P1 case with AWS at 3 AM, they don’t need to hunt for the account number or support URL — it’s right there in the runbook, pre-filled with the severity level to declare.

The Generated Output

Running /generate-runbook with these two YAML files produces three files in the output/ directory. You can see example outputs in the repository at example/output/.

The Markdown Runbook (.md)

The Markdown file opens with a header showing the incident number, generation timestamp, and Incident Commander name, followed by all 8 sections.

Section 1 (Summary Banner) renders as a two-column table with every field from the YAML. The “Change Related” row shows “YES — investigate recent changes” because we set change_related: true.

Section 2 (Triage Checklist) contains six time-boxed steps. Step 1 says “Run: ping core-router-edge-01” — the actual CI name from the YAML, not a placeholder. Step 5 provides the bridge URL and dial-in from Sarah Mitchell’s stakeholder entry, plus the exact words to say when joining the call.

Section 3 (Communication Plan) identifies Jennifer Walsh as the Communications Lead and sets a 30-minute update cadence. Then come the six email templates — each with pre-populated subject lines, To/CC addresses, and body text.

Section 4 (Diagnosis) routes to the Network playbook because category: "Network". It generates four investigation steps: reachability from multiple vantage points, routing table and BGP status checks, firewall and security group verification, and CDN/load balancer health checks. Each step includes “Good” vs. “Bad” result indicators and decision trees.

Section 5 (Containment) provides three network-specific containment actions: revert the routing change (flagged with a CAB approval requirement), revert a firewall rule change, and bypass CDN by routing traffic directly to origin.

Section 6 (Escalation Matrix) shows the time-based escalation table (T+15 through T+120) with specific names at each tier, plus the full individual contact table and vendor escalation table with Oracle and AWS contacts pre-filled.

Sections 7 and 8 provide the resolution validation checklist, bridge close procedure, and post-incident handoff documentation.

The Word Document (.docx)

The .docx file contains the same content as the Markdown, but formatted as a professional Word document. It includes a title page with the incident number and severity, headers and footers with page numbers, proper heading styles, and formatted tables. This is the file you attach to the ServiceNow ticket, send to leadership, or print for the physical war room whiteboard.

The Excel Action Tracker (.xlsx)

The Excel workbook is designed to be the live operational tool during the incident. It contains four sheets:

Sheet 1 — Action Items: Pre-populated with action items derived from the runbook’s triage, diagnostic, and containment steps. Each row includes a Status column with a dropdown validation (Open / In Progress / Done), an Owner column, a Due Time column, and a Notes column. Conditional formatting colors rows: red for Open, amber for In Progress, and green for Done.

Sheet 2 — Escalation Log: Pre-seeded with all stakeholder contacts from the YAML, with columns for Timestamp, Person Contacted, Method (dropdown: Phone / Slack / Email), Outcome, and Notes.

Sheet 3 — Incident Timeline: Pre-seeded with the known events from the incident description. During the incident, the team adds rows as events occur. This timeline becomes the backbone of the PIR discussion.

Sheet 4 — Summary Dashboard: Contains live formula counters that automatically calculate how many action items are Open vs. In Progress vs. Done, plus a quick-reference table of key stakeholder contacts. This is the sheet you screenshare on Zoom during the bridge so everyone can see incident progress at a glance.

How the Pieces Fit Together in Practice

Here’s how a team would use MIM-Runbook during a real Sev1:

T+0 (09:12): PagerDuty alert fires. The on-call engineer opens Claude Cowork, pastes the incident data into the YAML, and runs /generate-runbook. Within seconds, three files appear.

T+1 (09:13): The engineer opens the Markdown runbook and starts at Section 2, Step 1: “Confirm the alert is genuine. Run: ping core-router-edge-01.” They follow the checklist step by step.

T+3 (09:15): They create the Slack channel #incinc0091245 and paste the pre-formatted opening message from the runbook. They share the Excel tracker link in the channel.

T+4 (09:16): They join the Zoom bridge using the URL from the runbook and deliver the scripted opening statement.

T+5 (09:17): The Communications Lead opens Section 3 and sends Template 1 (Initial Notification) by copying the pre-filled email.

T+10 (09:22): The Technical Lead starts working through Section 4’s Network diagnostic steps, logging findings in Slack and updating the Excel Action Items sheet.

T+15 (09:27): Still not contained — the escalation matrix says to confirm Level 1 stakeholders are fully engaged.

T+20 (09:32): Root cause identified: the BGP route-map change at 09:05 is confirmed as the culprit. The Technical Lead moves to Section 5, Step 1: “Revert the routing change.” The CAB emergency approval process is followed as documented.

T+25 (09:37): Route reverted, first BGP peer re-established. The Communications Lead sends Template 4 (Mitigation In Progress).

T+35 (09:47): All three transit peers re-established, external traffic flowing. The team works through Section 7’s resolution checklist. All criteria pass. The Communications Lead sends Template 5 (Service Restored).

T+24h: Template 6 (Incident Closed + PIR Invitation) goes out. The Excel Timeline sheet — now populated with every event from the incident — becomes the PIR’s primary reference document.

Getting Started

MIM-Runbook is open source under the MIT license at github.com/agentbee0/MIM-Runbook.

If you’re just joining this series: Part 1 covers the business case and what the plugin produces. Part 2 is the complete installation and setup guide. And this Part 3 has shown you what the inputs look like, what the outputs look like, and how they’re used together during a real incident.

The next Sev1 doesn’t have to be chaos. It can be a process.

Leave a comment