Cabalmail alert runbooks

Short on-call runbooks for every alert in docs/0.7.x/monitoring-plan.md Phases 1-3. Each file follows the same shape:

What this means — what condition fired the alert.
Who/what is impacted — user-visible effect.
First three things to check — start here on a page.
Escalation — what to do if the first three don’t resolve.

When a Pushover or ntfy push includes a Runbook: link, it points to one of these files on main.

Sources of alerts

Source	Routing	Phase
Uptime Kuma monitors	Webhook → `alert_sink` Lambda	1
Self-hosted Healthchecks	Webhook → `alert_sink` Lambda	2
Prometheus rules → Alertmanager	Webhook → `alert_sink` Lambda	3
`alert_sink` Lambda	Pushover (critical) + ntfy (critical + warning)	1

Index

Probe failures (Kuma TCP/HTTP, Prometheus blackbox)

probe-failure.md — any Kuma or blackbox probe down.
cert-expiring.md — ACM cert or blackbox-probed cert nearing expiry.

AWS service alerts (Prometheus rules over `cloudwatch_exporter`)

Host alerts (Prometheus rules over `node_exporter`)

Log-derived alerts (Prometheus rules over CloudWatch metric filters)

Heartbeats (missed Healthchecks pings)

After the alert resolves

The plan’s tuning discipline applies: after every page, record on the corresponding GitHub issue (or open one) whether the threshold was right, too sensitive, or too loose. Thresholds live in code:

Prometheus: docker/prometheus/rules/alerts.yml
Kuma & Healthchecks: in the UI today; will move to IaC in Phase 4 §3.

Aim for zero false pages in a typical week (see monitoring-plan.md § “Tuning discipline”). If a runbook’s “first three things” never apply, fix the runbook in the same PR that fixes the alert.