Cabalmail

Host your own email and enhance your privacy

View the Project on GitHub cabalmail/cabal-infra

Runbook: Lambda5xxSpike

Fired by Prometheus rule Lambda5xxSpike — API Gateway 5xx rate over 5% for 5 min on a cabal-* API.

What this means

API Gateway returned 5xx for more than 5% of requests over a 10-minute window. The cause is almost always a Lambda function fronting the API: a new deploy with a bug, a runtime exception, an IAM permission missing on the task role, or an upstream dependency (DynamoDB / IMAP / SES) timing out.

Who/what is impacted

The label apiname identifies which API. For Cabalmail there is one API (cabal-api) with one stage (prod) — every API call from the admin web app goes through it. A sustained 5xx spike means address management, message reads, sends, and folder operations are broken.

First three things to check

  1. Which Lambda is failing? Pull recent errors across all cabal-* Lambdas:
    for fn in $(aws lambda list-functions --query 'Functions[?starts_with(FunctionName,`cabal-`)].FunctionName' --output text); do
      count=$(aws cloudwatch get-metric-statistics --namespace AWS/Lambda --metric-name Errors --dimensions Name=FunctionName,Value=$fn --start-time $(date -u -v-15M +%FT%TZ) --end-time $(date -u +%FT%TZ) --period 60 --statistics Sum --query 'Datapoints[].Sum' --output text | tr '\t' '\n' | awk '{s+=$1} END{print s+0}')
      [ "$count" != "0" ] && echo "$fn: $count errors in last 15 min"
    done
    
  2. What’s the error? For the function from step 1, tail the log group for stack traces:
    aws logs tail /aws/lambda/<function-name> --since 15m --filter-pattern '?ERROR ?Exception ?Traceback'
    
  3. Is it environment-wide or one route? Check the Grafana API Gateway & Lambda dashboard — if every route is failing, suspect the API Gateway authorizer, the shared helper.py module bundled into every function, or DynamoDB. If one route, the recent change to that function’s code is the prime suspect.

Escalation