What this article solves: Most postmortems end up in a folder nobody opens. When something breaks at 2 AM, you need runbooks with receipts—not a wiki page that mentions a service you decommissioned in March.
Who this is for: On-call engineers, incident commanders, and anyone tired of "did we see this before?" turning into a Slack archaeology expedition.
The postmortem trap
Teams write postmortems. Root cause, timeline, action items. Then the doc sits next to forty cousins with the same title.
Six months later the same alert fires. Someone asks in Slack: "Didn't we fix this?"
Silence. Because the fix lived in:
- A Datadog monitor someone tuned once
- A Slack thread in
#incidentsfrom a Tuesday nobody remembers - A GitHub PR titled
fix: webhook timeoutwith the real story in review comments
The fix is not a better postmortem template. It is linking those artifacts so the next responder does not start from zero.
What on-call actually opens first
Ask your on-call rotation what they touch in the first five minutes. Usually:
- Datadog — which monitor fired, what graph looks wrong
- Slack — who else is awake, what deployed recently
- GitHub — what changed in the last merge
Good incident documentation mirrors that order. Bad incident documentation starts with a three-page overview of microservices written before anyone on the current team joined.
Connect Datadog so runbooks reference the alert engineers actually see—not a generic "check the logs" step. Connect Slack so the coordination thread is one click away. Connect GitHub so every "we fixed it like this" story links to the merge that proved it.
Traceability chain (simple version)
Alert (Datadog) → Thread (Slack) → Fix (GitHub PR) → Ticket (Linear/Jira)
When that chain exists, generated incident docs answer:
- What broke (monitor + symptom)
- What we did (commands, rollback, feature flag)
- Why it worked (PR + discussion)
- What to check next time (linked sources, not vibes)
What to capture without writing a novel
You do not need prose at 2 AM. You need:
| Artifact | Why it matters |
|---|---|
| Monitor link | Proves which signal was truth |
| Deploy / PR link | Proves what changed before the fire |
| Slack thread | Captures decisions under pressure |
| Ticket | Tracks follow-ups and ownership |
ScopeDocs-style source linking assembles this after the fact from work you already did—if you linked the PR to the incident ticket when you merged the fix.
Practical setup checklist
- Datadog connected; runbook sections can reference monitor IDs
- GitHub connected; fix PRs link to incident tickets
- Slack incident channel scoped (not the whole workspace)
- Linear or Jira for incident tickets and action items
- Rule: no incident closed without fix PR linked to ticket
- Post-incident: confirm generated doc has working source links (30-second test)
On-call mode vs wiki mode
On-call readers want: symptom → check → action → link to proof.
Onboarding readers want: why the system exists. Same underlying facts, different entry point. Do not maintain two wikis—maintain one source graph with two views.
The outcome
Repeat incidents get cheaper because the second responder inherits:
- The Datadog context ("last time it was queue depth, not CPU")
- The Slack timeline ("we rolled back deploy X")
- The GitHub fix ("see PR #1204—added circuit breaker")
That is traceability that actually helps—not another PDF in a folder.
Connect Datadog, GitHub, and Slack · On-call vs onboarding docs