Cabalmail

Host your own email and enhance your privacy

View the Project on GitHub cabalmail/cabal-infra

Runbook: probe failure (Kuma or blackbox)

Fired by:

What this means

A synthetic probe to a user-facing endpoint is failing. The probe is one of:

Who/what is impacted

Probe User impact
993 (IMAP) Mail clients can’t read mail.
25 (SMTP relay) Inbound mail bounces or queues remotely.
587 / 465 (Submission) Mail clients can’t send.
Admin app The browser admin client is unreachable. Mail itself unaffected.
/list Address management is broken. Mail itself unaffected.
ntfy The push channel for warnings is broken. Pushover still works for critical.

First three things to check

The label instance (Prometheus) or the monitor name (Kuma) tells you which probe. Then:

  1. Is the symptom external or internal? Test from outside AWS:
    nc -zv imap.<control-domain> 993        # mail ports
    curl -I https://admin.<control-domain>/ # admin app
    curl -I https://ntfy.<control-domain>/v1/health
    

    Compare with aws ecs execute-command into a task in the same VPC. If the in-VPC test passes and the public test fails, the issue is at the load balancer or DNS layer. If both fail, the issue is in the service itself.

  2. Is the ECS service healthy?
    aws ecs describe-services --cluster <cluster> --services cabal-imap cabal-smtp-in cabal-smtp-out cabal-uptime-kuma cabal-ntfy --query 'services[].{name:serviceName,running:runningCount,desired:desiredCount,events:events[0:3]}'
    

    runningCount < desiredCount for the relevant service is the smoking gun. The recent events[] usually identifies the cause (image pull failure, EFS access point error, capacity).

  3. Is the load balancer routing correctly? For mail probes, check the NLB target group health for the failing port. For HTTP probes, check the monitoring ALB target group (cabal-uptime-kuma, cabal-ntfy, etc.). Healthy targets + a failing public probe means a security-group rule, listener rule, or DNS issue, not the service.

Escalation

If all three checks pass but the probe stays red: