Cabalmail

Host your own email and enhance your privacy

View the Project on GitHub cabalmail/cabal-infra

Runbook: NodeDiskSpaceLow

Fired by Prometheus rule NodeDiskSpaceLow — a non-tmpfs / non-overlay filesystem on a cluster EC2 above 85% used for 15 min.

What this means

A persistent disk on a cluster instance is filling up. The label mountpoint identifies which one. Two common culprits on Cabalmail’s EC2 nodes:

Bind-mounted EFS access points use nfs4 fstype, so they’re filtered out by the rule’s fstype!~"tmpfs|overlay" exclusion via Prometheus default behaviour (fstype=nfs4 doesn’t match tmpfs|overlay so it isn’t excluded — but nfs reports against the EFS file system as a whole, not “this host”). If you see this alert on an nfs mountpoint, treat it as an EFS sizing issue and consult efs-burst-credits-low.md instead.

Who/what is impacted

A full disk is one of the most catastrophic states a Linux box can be in:

First three things to check

  1. What’s using the space?
    INSTANCE_ID=$(aws ec2 describe-instances --filters Name=private-ip-address,Values=<host-ip> --query 'Reservations[0].Instances[0].InstanceId' --output text)
    aws ssm start-session --target "$INSTANCE_ID"
    df -h /
    sudo du -shx /var/lib/docker/* 2>/dev/null | sort -rh | head
    sudo du -shx /var/log/* | sort -rh | head
    
  2. Are there dangling Docker images? ECS leaves old images on disk. A simple GC:
    sudo docker image prune -af --filter "until=72h"
    

    Don’t docker system prune blindly — it’ll destroy active task volumes if used wrong.

  3. Is journald rotating? If /var/log/journal is the heavy hitter, vacuum it:
    sudo journalctl --vacuum-time=3d
    

Escalation