Runbook: NodeHighCPU

Fired by Prometheus rule NodeHighCPU — host CPU above 85% for 15 min, sourced from the node_exporter daemon.

What this means

A specific EC2 instance in the ECS cluster has been near saturation for 15 minutes. The label instance identifies the host (Cloud Map A-record / IP).

node_exporter runs as a DAEMON ECS service so each EC2 host reports independently — this rule fires on a per-host basis, not cluster-wide.

Who/what is impacted

Tasks running on that EC2 are getting throttled at the kernel level. For Cabalmail:

Mail-tier tasks slow down (longer IMAP responses, longer queue times).
Monitoring stack hosts: Prometheus scraping slows, dropping samples; Grafana queries get sluggish.
The NAT instance is on a separate ASG, not visible here — see Platform / NAT for that signal.

First three things to check

Which container on the host is using the CPU? ECS doesn’t show this directly, but you can SSH (via SSM) to the instance and inspect:

INSTANCE_ID=$(aws ec2 describe-instances --filters Name=private-ip-address,Values=<host-ip> --query 'Reservations[0].Instances[0].InstanceId' --output text)
aws ssm start-session --target "$INSTANCE_ID"
# then on the host:
docker ps --no-trunc | head
docker stats --no-stream

Is it a sustained workload or a runaway? A continuous 90% over hours on one host with mail tiers usually means a stuck procmail or a tight-loop attempt at brute-forcing IMAP. A slow climb over days points to a memory leak that’s now causing GC churn — confirm with NodeHighMemory.
Is the cluster overprovisioned or underprovisioned overall? Check the ECS cluster’s reservation: if every host is >80%, the cluster needs more capacity (or smaller services). If only one host is hot, something is wrong with that host’s tasks specifically.

Escalation

Single-host hotspot: drain and replace the EC2:

aws ecs update-container-instances-state --cluster <cluster> \
  --container-instances <ci-id> --status DRAINING
# wait for tasks to drain, then terminate the EC2; ASG replaces it

Cluster-wide saturation: scale the ASG up or move to a larger instance class. The monitoring services are memory-heavy, not CPU-heavy — if CPU is the bottleneck, expect mail-tier load (a brute-force attempt is the most common cause; check fail2ban activity).
This is warning severity. Sustained CPU saturation will eventually cause container restart loops or probe failures, both of which escalate to critical.