Cabalmail

Host your own email and enhance your privacy

View the Project on GitHub cabalmail/cabal-infra

Monitoring & Alerting

The 0.7.0 release adds an optional monitoring stack on top of the existing mail infrastructure. Black-box uptime monitoring, heartbeat monitoring for scheduled jobs, a Prometheus / Alertmanager / Grafana metrics stack, log-derived alerts via CloudWatch metric filters, and a runbook for every alert that can fire. All of it routes through a push-notification path (Pushover + self-hosted ntfy) that bypasses Cabalmail’s own email tier so the operator stays reachable during a mail outage.

See docs/0.7.0/monitoring-plan.md for the design rationale. This document is the operator’s runbook for enabling the stack and completing first-boot configuration. All steps are required unless explicitly marked optional.

The stack is disabled by default. When enabled it deploys:

1. Create your Pushover account and application

Pushover is the “wake someone up” channel – priority-1 pushes bypass Do Not Disturb on iOS and Android. It is paid: $5 one-time per mobile platform you intend to receive alerts on, after a 30-day trial.

  1. Go to https://pushover.net/signup and create an account. Verify your email.
  2. Install the Pushover app from the App Store / Play Store and log in. After login you’ll see your user key on the app’s home screen and on https://pushover.net.
  3. On the Pushover site, open Your Applications -> Create an Application/API Token. Name it cabalmail-alerts, type Application. Accept the terms. You’ll get an API Token/Key.
  4. Save both values somewhere temporarily (password manager). You’ll put them into SSM in step 5.

2. Enable the flag per environment

The monitoring stack is gated by var.monitoring. Set it to true only in the environments where you want it on (prod always; stage/dev only while actively testing).

In your GitHub repository settings, go to Settings -> Environments -> environment -> Variables and add:

Variable Example value Notes
TF_VAR_MONITORING true Gates the whole stack. Set to true in prod; leave as false (or unset) elsewhere.
TF_VAR_HEALTHCHECKS_REGISTRATION_OPEN false Controls whether the Healthchecks signup form accepts new accounts. Defaults to false (closed) when unset; flip to true for the bootstrap signup in step 11, then back to false. Has no effect when TF_VAR_MONITORING=false.

3. Build container images (first time only)

Prometheus, Alertmanager, Grafana, the three Prometheus exporters, Uptime Kuma, ntfy, and Healthchecks all ship as ECR images built by the docker job in .github/workflows/app.yml. The first time you toggle TF_VAR_MONITORING=true after this release lands, run Build and Deploy Application with areas: docker first – this populates the new ECR repositories with sha-<first-8> tags. Then run the Terraform workflow.

If you flip TF_VAR_MONITORING to true without the images present, ECS keeps the new services in pending state until the images appear; nothing else breaks, but no progress is made until the build runs.

4. Apply Terraform

Kick off the Build and Deploy Infrastructure workflow (same process as in setup.md). The apply creates:

Note the Terraform output alert_sink_function_url – you will need it in step 9 and step 14.

5. Seed the Pushover SSM parameters

aws ssm put-parameter --name /cabal/pushover_user_key  --type SecureString --overwrite --value '<user-key-from-step-1>'
aws ssm put-parameter --name /cabal/pushover_app_token --type SecureString --overwrite --value '<app-token-from-step-1>'

Terraform won’t touch these on subsequent applies (ignore_changes = [value]).

6. Bootstrap the ntfy admin user and publisher token

ntfy ships with NTFY_AUTH_DEFAULT_ACCESS=deny-all; nobody can read or write until you create an admin. Do it once via ECS Exec.

  1. Find the ntfy task ARN:
    CLUSTER=<cluster-name>
    TASK=$(aws ecs list-tasks --cluster "$CLUSTER" --service-name cabal-ntfy --query 'taskArns[0]' --output text)
    
  2. Open a shell in the container:
    aws ecs execute-command --cluster "$CLUSTER" --task "$TASK" --container ntfy --interactive --command "/bin/sh"
    
  3. Inside the container, create the admin user. You’ll be prompted for a password – store it in your password manager, you’ll need it on the phone.
    ntfy user add --role=admin admin
    
  4. Generate a bearer token for the Lambda:
    ntfy token add admin
    

    Copy the tk_... token it prints.

  5. Exit the container. Store the token in SSM:
    aws ssm put-parameter --name /cabal/ntfy_publisher_token --type SecureString --overwrite --value 'tk_...'
    

The alert_sink Lambda caches secrets at cold start, so the next push after the secret-set triggers a re-fetch automatically.

7. Subscribe your phone to ntfy

  1. Install the ntfy app from the App Store / Play Store.
  2. In the app, Settings -> Users (or similar), add a user for https://ntfy.<control-domain> with username admin and the password from step 6.
  3. Tap Subscribe to topic -> server https://ntfy.<control-domain>, topic alerts. The app shows 0 messages until the first alert fires.

8. First-boot configuration in Uptime Kuma

Uptime Kuma ships without any admin user; the first person to hit the UI creates one.

  1. Open https://uptime.<control-domain>/ in a browser. You will be redirected to the Cognito hosted UI to sign in.
  2. After the Cognito handshake you land on Kuma’s setup page. Create the admin account. Store the password in your password manager – Kuma does not use Cognito for its own identity; it has a separate local user.

9. Wire the Kuma webhook notification provider

In Kuma, add a new Notification provider:

Click Test – you should receive a Pushover push and a ntfy notification within 30 seconds. If either is missing, check the alert_sink CloudWatch log group at /cabal/lambda/alert_sink for per-transport errors.

10. Create the uptime monitor set

In the Kuma dashboard, add one monitor for each row below. Attach the webhook notification to every monitor. The monitor names must match the keys in the _RUNBOOK_MAP in lambda/api/alert_sink/function.py; renaming a monitor without updating the map silently drops the runbook link from its push.

Monitor Type Target Interval Retries
IMAP TLS handshake TCP port imap.<control-domain>:993 60 s 2
SMTP relay (STARTTLS) TCP port smtp-in.<control-domain>:25 60 s 2
Submission (STARTTLS) TCP port smtp-out.<control-domain>:587 60 s 2
Submission (implicit TLS) TCP port smtp-out.<control-domain>:465 60 s 2
Admin app HTTP(s) https://admin.<control-domain>/ 120 s 2
API round-trip (/list) HTTP(s) https://admin.<control-domain>/prod/list 5 min 2
ntfy server health HTTP(s) https://ntfy.<control-domain>/v1/health 120 s 2
Control-domain cert Keyword https://admin.<control-domain>/, keyword: any. Enable Certificate expiration notification: 21 / 7 / 1 days. 4 h 2

The /list probe needs a valid Cognito JWT. Seed it manually: sign in to the admin app, copy your id_token out of DevTools, and paste it as Authorization: Bearer <token> in the monitor’s headers. Rotate it monthly.

11. First-boot configuration in Healthchecks

https://heartbeat.<control-domain>/ sits behind Cognito. The Cabalmail Cognito user pool is the front door; Healthchecks itself uses its own local accounts (Cognito gates whether you can reach the UI, Healthchecks gates whether you can change checks).

The Healthchecks task is wired to deliver mail through the IMAP tier’s local-delivery sendmail (EMAIL_HOST=imap.cabal.internal, port 25, no TLS, no auth) – see healthchecks.tf. This means magic-link signup and password reset work natively, as long as you sign up with a Cabalmail-hosted address whose mailbox you can read. Mail destined for non-Cabalmail addresses (gmail, etc.) won’t deliver from this Healthchecks instance – it can only relay inbound to itself.

  1. Open the signup form: in your GitHub environment for this stack, set TF_VAR_HEALTHCHECKS_REGISTRATION_OPEN=true and re-run the Terraform workflow. The default is false (closed); flipping to true lets the Healthchecks Sign Up form accept new accounts.
  2. Pick a Cabalmail address you own to use as the operator login (e.g. admin@<one-of-your-mail-domains>). It needs to be a real address in cabal-addresses; if it isn’t, IMAP’s sendmail will TEMPFAIL the magic-link delivery.
  3. Open https://heartbeat.<control-domain>/ in a browser. Cognito challenges you. Sign in.
  4. On the Healthchecks landing page, click Sign Up and enter the address from step 2. Healthchecks emails a magic link; the link arrives in your Cabalmail inbox within seconds. Click it to set a password.
  5. Lock the door: set TF_VAR_HEALTHCHECKS_REGISTRATION_OPEN=false (or just delete the variable – false is the default) and re-run Terraform.

Fallback if mail delivery doesn’t work (e.g. you want to bootstrap before adding a Cabalmail address, or the IMAP tier is down): create a superuser via ECS Exec and log in with the password form:

aws ecs execute-command --cluster <cluster> \
  --task $(aws ecs list-tasks --cluster <cluster> --service-name cabal-healthchecks --query 'taskArns[0]' --output text) \
  --container healthchecks --interactive --command /bin/sh
# inside the container:
cd /opt/healthchecks
./manage.py shell -c "from django.contrib.auth.models import User; User.objects.filter(email='you@example.com').delete()"
./manage.py createsuperuser

Then log in at https://heartbeat.<control-domain>/accounts/login/ using the password field (next to the magic-link button).

12. Bootstrap the Healthchecks API key

The cabal-healthchecks-iac Lambda needs a v3 API key to manage checks programmatically. The API has no endpoint to create keys, so this is a one-time manual step.

  1. In Healthchecks, click the gear icon (top-right) -> Project Settings -> API Access. Create a key labelled cabal-healthchecks-iac with read-write permissions. Copy the value.
  2. Seed it into SSM:
    aws ssm put-parameter --name /cabal/healthchecks_api_key --type SecureString --overwrite --value '<key-from-step-1>'
    

The auto-invocation of the IaC Lambda at apply time saw the placeholder and returned status: skipped – no error, but no checks were created either. Step 13 forces the reconcile now that the key is real.

13. Reconcile checks via the healthchecks_iac Lambda

aws lambda invoke --function-name cabal-healthchecks-iac /tmp/out.json && cat /tmp/out.json

Expected output: {"status":"ok","reconciled":6,"failed":0,"extras":[],"checks":[...]}. The Lambda upserts six checks defined in lambda/api/healthchecks_iac/config.py and writes each ping URL into the matching /cabal/healthcheck_ping_* SSM parameter:

Check name Schedule Grace Pinged by
certbot-renewal Every 60 days 24 h cabal-certbot-renewal Lambda (EventBridge Scheduler).
aws-backup Every 1 day 6 h cabal-backup-heartbeat Lambda (EventBridge JOB_COMPLETED).
dmarc-ingest Every 6 hours 2 h cabal-process-dmarc Lambda.
ecs-reconfigure Every 30 minutes 30 m reconfigure.sh loop in mail-tier containers.
cognito-user-sync Every 30 days 7 d assign_osid post-confirmation Lambda. Fires only on user signup.
quarterly-review Every 90 days 14 d Manual operator ping (see step 15).

Consumers cache the ping URL at cold start (Lambdas) or task start (mail-tier containers). After step 13 populates the SSM values, force the consumers to pick them up:

# Mail-tier reconfigure loop:
for tier in imap smtp-in smtp-out; do
  aws ecs update-service --cluster <cluster> --service cabal-$tier --force-new-deployment
done
# Lambdas pick up new values on next cold start. Force one to verify:
aws lambda invoke --function-name cabal-certbot-renewal /tmp/out.json

14. Wire Healthchecks alerts back to alert_sink

The IaC Lambda creates checks but cannot create notification channels (the v3 API doesn’t expose channel CRUD). Create one webhook integration manually and assign it to every check.

In Healthchecks, Integrations -> Add Integration -> Webhook:

Then assign the integration to every check from step 13 (toggle the check’s notification list to include the new integration). The source strings – healthchecks/certbot-renewal, healthchecks/aws-backup, etc. – must match the keys in the _RUNBOOK_MAP in alert_sink/function.py; renaming a check without updating the map drops the runbook link.

15. Bootstrap the quarterly-review check

The quarterly-review check has no automation pinging it on a schedule – the operator pings it manually after walking through the quarterly review (see “Quarterly monitoring review” below). Ping it once now so it starts green, with a 90-day clock:

PING_URL=$(aws ssm get-parameter --name /cabal/healthcheck_ping_quarterly_review --with-decryption --query Parameter.Value --output text)
curl -fsS "$PING_URL"

16. Set the Grafana admin password (optional)

Terraform auto-generates a random Grafana admin password on first apply (/cabal/grafana_admin_password, ignore_changes so subsequent applies don’t rotate it). Read it with:

aws ssm get-parameter --name /cabal/grafana_admin_password --with-decryption --query Parameter.Value --output text

Or set your own:

aws ssm put-parameter --name /cabal/grafana_admin_password --type SecureString --overwrite --value '<your-password>'

Grafana picks up the value at task start (GF_SECURITY_ADMIN_PASSWORD); a force-new-deployment rolls in any change.

17. First-boot configuration in Grafana

  1. Open https://metrics.<control-domain>/. Cognito challenges; sign in.
  2. You arrive as an anonymous Viewer. Navigate to Cabalmail -> Dashboards in the side menu – four provisioned dashboards (Mail Tiers, AWS Services, API Gateway & Lambda, Frontend) are already there. Initial charts will be empty for ~5 min until cloudwatch_exporter has scraped.
  3. To make changes – add a panel, edit a datasource, install a plugin – sign in to the local admin account at /login. The username is admin; the password is the SSM value from step 16.
  4. The Prometheus datasource is provisioned read-only at http://prometheus.cabal-monitoring.cabal.internal:9090. To verify, Connections -> Data sources -> Prometheus -> Test.

18. Verify Prometheus scrape targets

Prometheus has no public UI by default. To inspect scrape state:

CLUSTER=<cluster-name>
TASK=$(aws ecs list-tasks --cluster "$CLUSTER" --service-name cabal-prometheus --query 'taskArns[0]' --output text)
aws ecs execute-command --cluster "$CLUSTER" --task "$TASK" --container prometheus --interactive --command "/bin/sh"
# inside the container:
wget -qO- http://localhost:9090/api/v1/targets | head

Every target listed in prometheus.yml should be health: up. Targets to expect: 1x prometheus self-scrape, 1x alertmanager, 2x cloudwatch-exporter (primary + us-east-1), 4x blackbox probes (1x HTTP + 2x TCP for plaintext/STARTTLS + 2x TLS for implicit-TLS), and 1+x node-exporter (one per cluster EC2 instance).

19. Acceptance checklist


Runbook framework

Every alert that can fire a push notification has a runbook in docs/operations/runbooks/. Each runbook follows the same shape: what the alert means, who/what is impacted, the first three things to check, and how to escalate. See the runbook README for the full index.

How the runbook URL reaches your phone:

When a push includes a runbook URL, you’ll see:

The map and the runbook files are version-controlled together; PRs that change one without the other should fail review.

Tabletop exercises

Run after each meaningful monitoring change, and again at every quarterly review. If the expected push doesn’t arrive, fix the broken link before treating the tabletop as passing.

Scenario How to simulate Expected page Expected runbook
Mail queue backup (deferred) ECS-Exec into the smtp-out task; inject 12 fake stat=Deferred log lines via logger -t sm-mta 'XXX: stat=Deferred' in <1 minute, then wait. SendmailDeferredSpike (warning ntfy) within ~17 min (10 min window + 15 min for) sendmail-deferred-spike.md
IMAP cert expiring (control-domain) In dev: re-issue a short-lived cert and wait, or temporarily replace the listener cert with a deliberately near-expiry one. Don’t do this in prod. BlackboxTLSCertExpiringSoon (warning ntfy) and Kuma’s “Control-domain cert” 21-day notification cert-expiring.md
Certbot Lambda silently disabled Disable the EventBridge schedule on cabal-certbot-renewal in dev; wait past the 24 h grace healthchecks/certbot-renewal missed -> critical ntfy + Pushover heartbeat-certbot-renewal.md
Healthchecks IaC drift Add a check by hand in the Healthchecks UI without adding it to config.py. Re-invoke cabal-healthchecks-iac. Lambda log line WARNING: extras in Healthchecks not in config.py: [...]. No alert fires (drift is logged, not paged). (no runbook – drift is operator-cleaned)

Quarterly monitoring review

The quarterly-review Healthchecks check pages the operator if 90+ days pass without a manual ping. The check is not automated. Nothing pings it on a schedule. The operator pings it after walking through the checklist in heartbeat-quarterly-review.md, which covers:

  1. Confirm dashboards still load. Open Grafana, walk through Mail Tiers / AWS Services / API Gateway / Frontend. Anything blank that should have data?
  2. Review silences in Alertmanager. Are any silences indefinite that should expire?
  3. Confirm the on-call number is still correct. Verify the Pushover / ntfy mobile apps still receive a test push.
  4. Review the noisiest and longest-silent alerts. Tighten or drop accordingly. Goal: zero false pages in a typical week.
  5. Walk at least one tabletop scenario from above.

When you’ve finished:

PING_URL=$(aws ssm get-parameter --name /cabal/healthcheck_ping_quarterly_review --with-decryption --query Parameter.Value --output text)
curl -fsS "$PING_URL"

What populates when (Grafana panels)

Some Grafana panels are blank for several minutes after the stack starts; some are blank by design.

If a panel is still blank after ~10 min and isn’t in one of the categories above, dig in – start with wget -qO- http://localhost:9090/api/v1/label/__name__/values from inside the Prometheus task to confirm whether the metric series even exists.

Verifying the data pipeline

When a panel has been blank since deployment – not just for a few minutes – the question is whether the data pipeline (CloudWatch -> cloudwatch_exporter -> Prometheus -> Grafana) is sound, or whether the metric genuinely has no datapoints. These commands cover both directions: confirm pipeline health, then inject synthetic data to make a “should be empty” panel light up briefly.

1. Confirm the pipeline is alive

From inside the Prometheus task (aws ecs execute-command --cluster cabal-mail --task <prom-task-arn> --container prometheus --interactive --command /bin/sh):

# All cloudwatch-derived metric names Prometheus has ever seen.
wget -qO- http://localhost:9090/api/v1/label/__name__/values \
  | tr ',' '\n' | grep '^"aws_' | sort

# Scrape target health -- both cloudwatch jobs should be `up`.
wget -qO- http://localhost:9090/api/v1/targets \
  | sed 's/,/\n/g' | grep -E 'job|health|lastError'

If aws_apigateway_count_sum is in the list but aws_lambda_duration_average is not, the exporter is reaching CloudWatch but Lambda specifically has no recent invocations to emit. If neither shows up, the cloudwatch-exporter target is down or the IAM/network path to CloudWatch is broken.

From inside the Prometheus task – which can already reach the exporter on its Cloud Map name – you can scrape /metrics directly without exec’ing into the exporter:

wget -qO- http://cloudwatch-exporter.cabal-monitoring.cabal.internal:9106/metrics \
  | grep '^aws_' | head -40
wget -qO- http://cloudwatch-exporter-us-east-1.cabal-monitoring.cabal.internal:9106/metrics \
  | grep '^aws_cloudfront' | head -20

Empty aws_* block here means the exporter is alive but failing CloudWatch calls. Check the task logs at /ecs/cabal-cloudwatch-exporter (or -us-east-1) for AccessDenied, throttling, or NoSuchKey errors:

aws logs tail /ecs/cabal-cloudwatch-exporter --since 30m --follow

The cloudwatch-exporter and blackbox-exporter services are also exec-enabled. Note that enable_execute_command only applies to tasks launched after the flag was set, so on a freshly-applied infra you may need to force a service redeploy (aws ecs update-service --cluster cabal-mail --service cabal-cloudwatch-exporter --force-new-deployment) before aws ecs execute-command works against an existing task.

2. Confirm the EFS throughput-mode caveat

BurstCreditBalance and PercentIOLimit only emit in bursting throughput / generalPurpose performance mode. AWS recently changed the default for new file systems to elastic, which doesn’t emit either. Check:

aws efs describe-file-systems \
  --query 'FileSystems[*].{id:FileSystemId,name:Name,throughput:ThroughputMode,perf:PerformanceMode}' \
  --output table

If throughput is elastic, the EFS BurstCreditBalance panel will stay empty by design – the new “EFS I/O bytes” panel (added alongside) covers the saturation signal in either mode. If throughput is bursting and the panel is still empty, the cloudwatch-exporter has a real problem reaching AWS/EFS.

3. Inject synthetic data to verify each “no data” panel

Each of these produces a single datapoint that should appear in Grafana within ~3 min (60s exporter scrape + 120s delay_seconds lag). Confirm the datapoint with a Prometheus query rather than waiting for the panel to refresh – e.g. from inside the Prometheus task, wget -qO- 'http://localhost:9090/api/v1/query?query=aws_lambda_errors_sum' returns the raw series.

Two confusables worth pinning down before reading the table.

The HTTP 200 from aws lambda invoke: that’s the status of the AWS Lambda API call (request accepted), not the function’s success. Function failure shows up as FunctionError: Unhandled in the invoke response JSON and a non-empty errorType in the response payload file. A successful invocation that produced a function-level error still increments the AWS/Lambda Errors metric.

Lambda invoke vs API Gateway: aws lambda invoke calls the Lambda API directly. It does not traverse API Gateway, so it never increments AWS/ApiGateway 5XXError no matter what payload you send. To move that metric you need an actual HTTP request to the API Gateway URL.

Panel Synthetic-data trigger
Lambda errors Pick any cabal-* Lambda. They all dereference event['requestContext']['authorizer']['claims']['cognito:username'] at the top of handler, so an empty payload raises KeyError: 'requestContext' before any try/except. aws lambda invoke --function-name cabal-list --payload '{}' /tmp/out.json – the invoke response will include "FunctionError": "Unhandled" and /tmp/out.json will contain the traceback. That counts as one error. Repeat 3 more times to give the dashboard’s rate(...[5m]) something to integrate. Verify in CloudWatch first if you want to be sure: aws cloudwatch get-metric-statistics --namespace AWS/Lambda --metric-name Errors --dimensions Name=FunctionName,Value=cabal-list --start-time "$(date -u -v-15M +%FT%TZ 2>/dev/null \|\| date -u -d '15 min ago' +%FT%TZ)" --end-time "$(date -u +%FT%TZ)" --period 60 --statistics Sum.
Lambda throttles Lower the function’s reserved concurrency to 0: aws lambda put-function-concurrency --function-name cabal-alert-sink --reserved-concurrent-executions 0. Invoke once: aws lambda invoke --function-name cabal-alert-sink --payload '{}' /tmp/out.json. The invoke returns a throttle error (visible in /tmp/out.json and as FunctionError). Restore immediately with aws lambda delete-function-concurrency --function-name cabal-alert-sink. Don’t try this on a Lambda that’s serving real traffic.
Lambda duration p95 Just invoke any Lambda repeatedly so the dashboard has something to compute p95 against: for i in $(seq 1 5); do aws lambda invoke --function-name cabal-list --payload '{}' /tmp/out.json >/dev/null ; done. Errors are fine here – duration is recorded for both successful and failed invocations.
API Gateway 5xx rate Has to go through API Gateway. The simplest natural way: use the admin app while a backend Lambda is misbehaving. To synthesise: temporarily revoke a Lambda’s IAM access to its data store (e.g. detach AWSLambdaDynamoDBExecutionRole from cabal-list-role), then load the Addresses page in the admin app – the Lambda will fail with a permission error, API Gateway will surface a 5xx, and the metric increments. Re-attach the policy immediately afterwards. If you’d rather not poke IAM, skip this; the Lambda-Errors path above exercises the same exporter code path, so a working Lambda Errors panel implies the API Gateway 5xx panel will also work when there’s organic 5xx traffic.
DynamoDB ConsumedRead/WriteCapacityUnits Read: aws dynamodb scan --table-name cabal-addresses --max-items 1 >/dev/null. Write: any address-creation flow in the admin app, or a direct put-item against a throwaway PK.
DynamoDB ThrottledRequests Hard to trigger on-demand without sustained load. Skip unless you’re explicitly testing throttling behavior.
EFS I/O bytes ECS-Exec into an imap task and dd if=/dev/zero of=/var/spool/mail/canary bs=1M count=10 oflag=direct ; rm /var/spool/mail/canary. The 10 MiB write produces a visible spike on DataWriteIOBytes.
CloudFront request count / 5xx for i in $(seq 1 20); do curl -fsS -o /dev/null https://<control-domain>/ ; done for the request-count panel. CloudFront 5xx is harder; temporarily mis-configure the origin (e.g. block CloudFront’s egress to S3 with a bucket policy deny for ~5 min) to force a real 5xx. Easier: just confirm the panel populates with Requests traffic before chasing the 5xx case.
TLS days to expiry – IMAP 993 No injection needed – once the new blackbox-tls job runs, probe_ssl_earliest_cert_expiry{instance=~".*:993"} populates within one scrape (30 s). If still empty after 5 min, check that the blackbox-tls target is up and that the cert chain returned by port 993 is parseable.
ECS RunningTaskCount No injection needed once the namespace fix lands – Container Insights reports running-task counts per service every minute regardless of activity.

If any of the synthetic triggers above produces CloudWatch data (visible in the AWS Console under Metrics, or via the aws cloudwatch get-metric-statistics template above) but Grafana still shows no data, the pipeline is broken between cloudwatch_exporter and Prometheus, not at CloudWatch. Re-run the §1 commands above to localize the gap.

Logs: CloudWatch metric filters

Cabalmail stays on CloudWatch Logs rather than self-hosting Loki. Log volume is small enough that CloudWatch’s per-GB cost is negligible, and we don’t need cross-tier log correlation in real time. Loki would add another stateful ECS service with EFS-backed chunk storage that grows monotonically; the maintenance cost outweighs the benefit until either log volume or cross-tier search frequency becomes painful.

Log-derived metrics ship as CloudWatch metric filters defined in terraform/infra/modules/monitoring/log_metrics.tf:

Filter Log group(s) Pattern Metric (in Cabalmail/Logs)
cabal-sendmail-deferred-{tier} /ecs/cabal-imap, /ecs/cabal-smtp-in, /ecs/cabal-smtp-out "stat=Deferred" SendmailDeferred
cabal-sendmail-bounced-{tier} same three "dsn=5" SendmailBounced
cabal-imap-auth-failures /ecs/cabal-imap "imap-login" "auth failed" IMAPAuthFailures

All metrics emit value=1 per matching log line, default 0. CloudWatch aggregates per-minute. cloudwatch_exporter scrapes the Sum statistic and exposes aws_cabalmail_logs_<metric>_sum to Prometheus. Three Prometheus rules in the log-derived group of docker/prometheus/rules/alerts.yml alert on the rates:

Alert Threshold Severity Runbook
SendmailDeferredSpike >10 deferreds/10 min, sustained 15 min warning sendmail-deferred-spike.md
SendmailBouncedSpike >15 bounces/30 min, sustained 15 min critical sendmail-bounced-spike.md
IMAPAuthFailureSpike >25 auth-fails/5 min, sustained 5 min warning imap-auth-failure-spike.md

These thresholds are starting points. Expect them to move once we see what real traffic looks like; record the rationale in the alert’s GitHub issue per the tuning discipline in the design doc.

fail2ban metrics are intentionally not part of this set. [program:fail2ban] is currently commented out in every mail-tier supervisord.conf. A metric filter today would publish flat-zero forever and mask the disabled state. Add the filter when fail2ban is re-enabled.

Cognito post-confirmation Lambda errors are caught by the existing LambdaErrors rule (its function_name regex is cabal-.+|assign_osid, so the post-confirmation Lambda’s invocation errors fire it without a separate log filter).

Adding new heartbeat checks

To add a new Healthchecks check via IaC:

  1. Edit lambda/api/healthchecks_iac/config.py – add an entry with name, kind, timeout, grace, desc, tags, and ssm_param.
  2. Add a matching SSM parameter to monitoring/ssm.tf local.heartbeat_jobs and reference it from the consumer (Lambda env var, ECS secrets, etc.).
  3. If the check needs a runbook (most do), add a markdown file under docs/operations/runbooks/ and update the static _RUNBOOK_MAP in alert_sink/function.py so the push includes a tappable link.
  4. Open a PR. CI runs the lambda-api job in app.yml (rebuilds the IaC Lambda zip and the alert_sink zip), then infra.yml (applies and re-invokes the IaC Lambda since the source_code_hash changed).
  5. Confirm the new check appears in the Healthchecks dashboard. Assign the existing Webhook integration to it (still manual; the v3 API doesn’t expose channel CRUD).

Disabling the stack

Set TF_VAR_MONITORING=false in the GitHub environment and re-run Terraform. The module is gated with count = var.monitoring ? 1 : 0, so the ECS services, ALB, Lambdas, and SSM parameters are destroyed cleanly. The ECR repositories and the Cognito user pool domain persist (they are cheap and not flag-gated).

Note on EFS state: destroying the stack leaves the /uptime-kuma, /ntfy, /healthchecks, /prometheus, /grafana, and /alertmanager directories on the shared EFS. Re-enabling monitoring later will pick up the existing state, preserving Kuma monitors, Healthchecks checks (which the IaC Lambda will reconcile against), and Prometheus retention. Remove the directories manually from any running mail-tier container if you want a clean start.

Disabling individual heartbeats

To silence one heartbeat without disabling the entire monitoring stack: pause the corresponding check in the Healthchecks UI, or set its SSM parameter back to a value that does not start with http (e.g. aws ssm put-parameter --overwrite --type SecureString --name /cabal/healthcheck_ping_dmarc_ingest --value 'paused'). Consumer code skips the ping when the value is not an HTTP(S) URL, and Healthchecks stops expecting pings while the check is paused.

The IaC Lambda will not overwrite a pausing value: its update flow only writes ping URLs back to SSM when the Healthchecks API returns one, and pause state in Healthchecks does not change the URL.

Secret rotation

To rotate the webhook shared secret:

  1. Generate a new value: openssl rand -base64 36 | tr -d '='.
  2. Put it into SSM: aws ssm put-parameter --name /cabal/alert_sink_secret --type SecureString --overwrite --value '<new-value>'.
  3. Update the X-Alert-Secret header on every Kuma webhook provider and the Healthchecks integration headers.
  4. Trigger a test notification from Kuma to confirm.

To rotate the ntfy publisher token: run ntfy token del <old-token> and ntfy token add admin inside the container, then update /cabal/ntfy_publisher_token.

To rotate the Pushover app token: create a new application on pushover.net, update /cabal/pushover_app_token, delete the old application.

To rotate the Healthchecks API key: in the UI, Project Settings -> API Access, revoke the old key and create a new one. Update /cabal/healthchecks_api_key. The IaC Lambda picks up the new value on next invocation.

The Terraform ignore_changes = [value] lifecycle on each SSM parameter means subsequent terraform apply runs will not revert your rotated value.

Troubleshooting

Notes below are lessons from the actual deploy. Each is also reflected in code; this list is for future readers and re-deployers.

ALB and Cognito

EFS access points

Cloud Map service discovery

Container images and ECS

Healthchecks task

IaC Lambda and heartbeat consumers

alert_sink and Alertmanager

Terraform and resource-creation gotchas