Cabalmail

Host your own email and enhance your privacy

View the Project on GitHub cabalmail/cabal-infra

Runbook: SendmailBouncedSpike

Fired by Prometheus rule SendmailBouncedSpike — more than 15 sendmail “dsn=5” log lines aggregated across all three mail tiers in the last 30 minutes, sustained for 15 minutes.

What this means

The pattern matches any 5.x.y DSN code — sendmail’s permanent-failure (bounce) class. Permanent failures are returned to the sender as bounce messages and are usually a symptom of a real problem, not transient network noise.

Who/what is impacted

Like SendmailDeferredSpike, this metric is summed across log groups and the impact depends heavily on which tier:

First three things to check

  1. Which tier?
    for tier in imap smtp-in smtp-out; do
      echo "=== $tier ==="
      aws logs tail /ecs/cabal-$tier --since 30m --filter-pattern '"dsn=5"' | wc -l
    done
    

    The tier with the largest count is the offender. Look at recent 5.x.y codes for that tier:

    aws logs tail /ecs/cabal-<tier> --since 30m --filter-pattern '"dsn=5"' | grep -oE 'dsn=5\.[0-9]+\.[0-9]+ [^,]*' | sort | uniq -c | sort -rn | head
    
  2. For smtp-out (deliverability): confirm the bounces aren’t all targeting one provider:
    aws logs tail /ecs/cabal-smtp-out --since 30m --filter-pattern '"dsn=5"' | grep -oE 'to=<[^@]*@[^>]*>' | awk -F'@' '{print $2}' | sort | uniq -c | sort -rn | head
    

    One provider dominating → that provider blocked us. Many providers → a global reputation issue. Check our DKIM is signing correctly and DMARC reports for the affected period (DMARC reports come into the dmarc inbox).

  3. For smtp-in (recipient unknown): check the address map is fresh:
    TASK=$(aws ecs list-tasks --cluster <cluster> --service-name cabal-smtp-in --query 'taskArns[0]' --output text)
    aws ecs execute-command --cluster <cluster> --task "$TASK" --container smtp-in --interactive \
      --command "/bin/sh -c 'wc -l /etc/mail/virtusertable && date -r /etc/mail/virtusertable'"
    

    If the file is hours old, the reconfigure loop is stuck — see heartbeat-ecs-reconfigure.md.

Escalation