Cabalmail

Host your own email and enhance your privacy

View the Project on GitHub cabalmail/cabal-infra

Runbook: SendmailDeferredSpike

Fired by Prometheus rule SendmailDeferredSpike — more than 10 sendmail “stat=Deferred” log lines aggregated across all three mail tiers in the last 10 minutes, sustained for 15 minutes.

What this means

Sendmail logs stat=Deferred whenever it parks a message in the queue rather than delivering immediately — typically on a 4xx response from a remote MX (graylisting, rate-limit) or a transient internal failure. A sustained rate above ~1/min is unusual for a single-operator Cabalmail instance.

Who/what is impacted

The metric is summed across three log groups; the alert doesn’t tell you which tier is deferring. The user impact differs by tier:

First three things to check

  1. Which tier? The metric has no tier dimension. Look at queue depth on each:
    for tier in imap smtp-in smtp-out; do
      echo "=== $tier ==="
      TASK=$(aws ecs list-tasks --cluster <cluster> --service-name cabal-$tier --query 'taskArns[0]' --output text)
      aws ecs execute-command --cluster <cluster> --task "$TASK" --container $tier --interactive \
        --command "/bin/sh -c 'mailq | tail -1'"
    done
    

    The tier with a non-trivial mailq is the offender.

  2. Why is it deferring? Pull recent deferred reasons:
    aws logs tail /ecs/cabal-<tier> --since 30m --filter-pattern '"stat=Deferred"' | head -20
    

    Look at the dsn=4.x.y codes:

    • 4.4.x → connectivity / MX issue
    • 4.7.x → policy block (graylisting, SPF-soft-fail, IP reputation)
    • 4.5.x → mail-system congestion at remote
  3. Is this a self-inflicted issue? Check whether outbound DNS resolution from the smtp-out tier works (NAT instance issue produces a wave of 4.4.x deferrals). Check aws ec2 describe-instances --filters Name=tag:Name,Values=cabal-nat for state=running.

Escalation