Host your own email and enhance your privacy
Fired by Prometheus rule ContainerRestartLoop — an ECS service has had RunningTaskCount < DesiredTaskCount (averaged over 1 h) for 30 min.
An ECS service is failing to keep its desired number of tasks alive. Either every new task crashes shortly after start, or the cluster has no capacity to schedule them.
The label service_name identifies the service. For Cabalmail:
cabal-imap, cabal-smtp-in, cabal-smtp-out: a mail tier is down or degraded. Probe-failure runbooks (probe-failure.md) usually fire alongside.cabal-uptime-kuma, cabal-ntfy: monitoring path itself is degraded. Pages may stop arriving.cabal-healthchecks: heartbeats stop being recorded — every job will appear “missed” within a schedule cycle.cabal-prometheus, cabal-alertmanager, cabal-grafana: metrics-side alerting is offline. Other paths (Kuma, Healthchecks) still page.cabal-cloudwatch-exporter, cabal-blackbox-exporter, cabal-node-exporter: scrape targets disappear; downstream Prometheus alerts go quiet (silent failure — bad).aws ecs describe-services --cluster <cluster> --services <service> \
--query 'services[0].{running:runningCount,desired:desiredCount,events:events[0:8]}'
Events like “unable to place a task” → capacity. “Task stopped: Essential container in task exited” → crash loop.
TASK=$(aws ecs list-tasks --cluster <cluster> --service-name <service> --desired-status STOPPED --query 'taskArns[0]' --output text)
aws ecs describe-tasks --cluster <cluster> --tasks "$TASK" --query 'tasks[0].containers[0].{reason:reason,exitCode:exitCode}'
aws logs tail /ecs/<service-log-group> --since 30m --filter-pattern '?ERROR ?Exception ?error ?fatal' | head -50
aws ecs describe-container-instances --cluster <cluster> \
--container-instances $(aws ecs list-container-instances --cluster <cluster> --query 'containerInstanceArns[]' --output text) \
--query 'containerInstances[].{instance:ec2InstanceId,cpu:remainingResources[?name==`CPU`].integerValue|[0],mem:remainingResources[?name==`MEMORY`].integerValue|[0]}'
If every instance has 0 free CPU/memory, the ASG min-size is too low or a bigger instance type is needed.
app.yml workflow run (the docker job). The fastest rollback is to call .github/scripts/deploy-ecs-service.sh <tier> sha-<previous-8> against the running cluster — the script clones the live task definition, swaps in the older tag, registers a new revision, and rolls the service.failed to chown): see the Phase 1/2/3 troubleshooting notes in docs/monitoring.md. A new image that adds a chown-on-start shim recreates the family of bug we’ve already hit on Kuma, Healthchecks, and Grafana.critical. If multiple services are in restart loops simultaneously, suspect the cluster instance role / EFS / network — not the services themselves.