Cabalmail

Host your own email and enhance your privacy

View the Project on GitHub cabalmail/cabal-infra

Runbook: ContainerRestartLoop

Fired by Prometheus rule ContainerRestartLoop — an ECS service has had RunningTaskCount < DesiredTaskCount (averaged over 1 h) for 30 min.

What this means

An ECS service is failing to keep its desired number of tasks alive. Either every new task crashes shortly after start, or the cluster has no capacity to schedule them.

Who/what is impacted

The label service_name identifies the service. For Cabalmail:

First three things to check

  1. Is each new task crashing fast, or is the cluster out of capacity?
    aws ecs describe-services --cluster <cluster> --services <service> \
      --query 'services[0].{running:runningCount,desired:desiredCount,events:events[0:8]}'
    

    Events like “unable to place a task” → capacity. “Task stopped: Essential container in task exited” → crash loop.

  2. For a crash loop: pull the last task’s stop reason and logs:
    TASK=$(aws ecs list-tasks --cluster <cluster> --service-name <service> --desired-status STOPPED --query 'taskArns[0]' --output text)
    aws ecs describe-tasks --cluster <cluster> --tasks "$TASK" --query 'tasks[0].containers[0].{reason:reason,exitCode:exitCode}'
    aws logs tail /ecs/<service-log-group> --since 30m --filter-pattern '?ERROR ?Exception ?error ?fatal' | head -50
    
  3. For capacity exhaustion: list cluster instances and check CPU/memory headroom:
    aws ecs describe-container-instances --cluster <cluster> \
      --container-instances $(aws ecs list-container-instances --cluster <cluster> --query 'containerInstanceArns[]' --output text) \
      --query 'containerInstances[].{instance:ec2InstanceId,cpu:remainingResources[?name==`CPU`].integerValue|[0],mem:remainingResources[?name==`MEMORY`].integerValue|[0]}'
    

    If every instance has 0 free CPU/memory, the ASG min-size is too low or a bigger instance type is needed.

Escalation