Host your own email and enhance your privacy
The 0.7.0 release adds an optional monitoring stack on top of the existing mail infrastructure. Black-box uptime monitoring, heartbeat monitoring for scheduled jobs, a Prometheus / Alertmanager / Grafana metrics stack, log-derived alerts via CloudWatch metric filters, and a runbook for every alert that can fire. All of it routes through a push-notification path (Pushover + self-hosted ntfy) that bypasses Cabalmail’s own email tier so the operator stays reachable during a mail outage.
See docs/0.7.0/monitoring-plan.md for the design rationale. This document is the operator’s runbook for enabling the stack and completing first-boot configuration. All steps are required unless explicitly marked optional.
The stack is disabled by default. When enabled it deploys:
https://uptime.<control-domain>/ behind Cognito login.https://ntfy.<control-domain>/ with token auth enforced by ntfy itself (the ALB does not gate this hostname).alert_sink Lambda – webhook sink fronted by a Lambda Function URL. Callers authenticate with a shared secret. critical severity fans out to Pushover (priority 1) and ntfy (priority 5); warning goes to ntfy (priority 3); info is dropped.https://heartbeat.<control-domain>/ behind Cognito.backup_heartbeat Lambda – invoked by an EventBridge rule on AWS Backup JOB_COMPLETED events; pings the corresponding Healthchecks check.cabal-healthchecks-iac Lambda – reconciles Healthchecks check definitions from lambda/api/healthchecks_iac/config.py and populates the /cabal/healthcheck_ping_* SSM parameters. Auto-invokes when the config changes.https://metrics.<control-domain>/ behind Cognito. Prometheus and Alertmanager have no public surface; reach via Grafana’s data-source proxy or aws ecs execute-command.cloudwatch_exporter, blackbox_exporter, node_exporter – Prometheus exporters. The first two are single-task ECS services; node_exporter is a DAEMON service (one task per cluster instance).Cabalmail/Logs namespace. cloudwatch_exporter scrapes these and Prometheus alerts on the rates (sendmail deferred, sendmail bounced, IMAP auth failures).Pushover is the “wake someone up” channel – priority-1 pushes bypass Do Not Disturb on iOS and Android. It is paid: $5 one-time per mobile platform you intend to receive alerts on, after a 30-day trial.
cabalmail-alerts, type Application. Accept the terms. You’ll get an API Token/Key.The monitoring stack is gated by var.monitoring. Set it to true only in the environments where you want it on (prod always; stage/dev only while actively testing).
In your GitHub repository settings, go to Settings -> Environments -> environment -> Variables and add:
| Variable | Example value | Notes |
|---|---|---|
TF_VAR_MONITORING |
true |
Gates the whole stack. Set to true in prod; leave as false (or unset) elsewhere. |
TF_VAR_HEALTHCHECKS_REGISTRATION_OPEN |
false |
Controls whether the Healthchecks signup form accepts new accounts. Defaults to false (closed) when unset; flip to true for the bootstrap signup in step 11, then back to false. Has no effect when TF_VAR_MONITORING=false. |
Prometheus, Alertmanager, Grafana, the three Prometheus exporters, Uptime Kuma, ntfy, and Healthchecks all ship as ECR images built by the docker job in .github/workflows/app.yml. The first time you toggle TF_VAR_MONITORING=true after this release lands, run Build and Deploy Application with areas: docker first – this populates the new ECR repositories with sha-<first-8> tags. Then run the Terraform workflow.
If you flip TF_VAR_MONITORING to true without the images present, ECS keeps the new services in pending state until the images appear; nothing else breaks, but no progress is made until the build runs.
Kick off the Build and Deploy Infrastructure workflow (same process as in setup.md). The apply creates:
cabal-uptime-kuma, cabal-ntfy, cabal-healthchecks, cabal-prometheus, cabal-alertmanager, cabal-grafana, cabal-cloudwatch-exporter, cabal-blackbox-exporter, cabal-node-exporter (always, regardless of the flag, so app.yml’s docker matrix can push unconditionally).SecureString parameters with ignore_changes = [value] so out-of-band rotation sticks: /cabal/alert_sink_secret (auto-generated), /cabal/pushover_user_key, /cabal/pushover_app_token, /cabal/ntfy_publisher_token, /cabal/healthchecks_api_key, six /cabal/healthcheck_ping_* placeholders, /cabal/grafana_admin_password (auto-generated), and /cabal/healthchecks_secret_key (auto-generated Django secret).alert_sink Lambda with a Function URL.ntfy.<control-domain> (no ALB auth; ntfy enforces token auth), heartbeat.<control-domain> (Cognito), metrics.<control-domain> (Cognito).cabal-monitoring.cabal.internal with services for prometheus, alertmanager, grafana, cloudwatch-exporter, blackbox-exporter, node-exporter, and healthchecks (used by the IaC Lambda).backup_heartbeat Lambda + EventBridge rule.cabal-healthchecks-iac Lambda in private subnets.Cabalmail/Logs.uptime.<control-domain>, ntfy.<control-domain>, heartbeat.<control-domain>, metrics.<control-domain> (in both the public zone and the VPC private zone, since the private zone shadows the public zone for the control domain).Note the Terraform output alert_sink_function_url – you will need it in step 9 and step 14.
aws ssm put-parameter --name /cabal/pushover_user_key --type SecureString --overwrite --value '<user-key-from-step-1>'
aws ssm put-parameter --name /cabal/pushover_app_token --type SecureString --overwrite --value '<app-token-from-step-1>'
Terraform won’t touch these on subsequent applies (ignore_changes = [value]).
ntfy ships with NTFY_AUTH_DEFAULT_ACCESS=deny-all; nobody can read or write until you create an admin. Do it once via ECS Exec.
CLUSTER=<cluster-name>
TASK=$(aws ecs list-tasks --cluster "$CLUSTER" --service-name cabal-ntfy --query 'taskArns[0]' --output text)
aws ecs execute-command --cluster "$CLUSTER" --task "$TASK" --container ntfy --interactive --command "/bin/sh"
ntfy user add --role=admin admin
ntfy token add admin
Copy the tk_... token it prints.
aws ssm put-parameter --name /cabal/ntfy_publisher_token --type SecureString --overwrite --value 'tk_...'
The alert_sink Lambda caches secrets at cold start, so the next push after the secret-set triggers a re-fetch automatically.
https://ntfy.<control-domain> with username admin and the password from step 6.https://ntfy.<control-domain>, topic alerts. The app shows 0 messages until the first alert fires.Uptime Kuma ships without any admin user; the first person to hit the UI creates one.
https://uptime.<control-domain>/ in a browser. You will be redirected to the Cognito hosted UI to sign in.In Kuma, add a new Notification provider:
alert_sink_function_url Terraform output (the Lambda Function URL, e.g. https://abc123.lambda-url.us-west-1.on.aws/).X-Alert-Secret: <paste from /cabal/alert_sink_secret>
Retrieve the secret with:
aws ssm get-parameter --name /cabal/alert_sink_secret --with-decryption --query Parameter.Value --output text
Body template:
{
"summary": "{{ msg }}",
"severity": "{% if heartbeatJSON.status == 0 %}critical{% else %}info{% endif %}",
"source": "kuma/{{ monitorJSON.name }}"
}
Kuma uses Liquid templating – {{ ... }} for interpolation, {% if %}...{% endif %} for conditionals. Handlebars-style {{#if}} fails with a TokenizationError.
Click Test – you should receive a Pushover push and a ntfy notification within 30 seconds. If either is missing, check the alert_sink CloudWatch log group at /cabal/lambda/alert_sink for per-transport errors.
In the Kuma dashboard, add one monitor for each row below. Attach the webhook notification to every monitor. The monitor names must match the keys in the _RUNBOOK_MAP in lambda/api/alert_sink/function.py; renaming a monitor without updating the map silently drops the runbook link from its push.
| Monitor | Type | Target | Interval | Retries |
|---|---|---|---|---|
| IMAP TLS handshake | TCP port | imap.<control-domain>:993 |
60 s | 2 |
| SMTP relay (STARTTLS) | TCP port | smtp-in.<control-domain>:25 |
60 s | 2 |
| Submission (STARTTLS) | TCP port | smtp-out.<control-domain>:587 |
60 s | 2 |
| Submission (implicit TLS) | TCP port | smtp-out.<control-domain>:465 |
60 s | 2 |
| Admin app | HTTP(s) | https://admin.<control-domain>/ |
120 s | 2 |
API round-trip (/list) |
HTTP(s) | https://admin.<control-domain>/prod/list |
5 min | 2 |
| ntfy server health | HTTP(s) | https://ntfy.<control-domain>/v1/health |
120 s | 2 |
| Control-domain cert | Keyword | https://admin.<control-domain>/, keyword: any. Enable Certificate expiration notification: 21 / 7 / 1 days. |
4 h | 2 |
The /list probe needs a valid Cognito JWT. Seed it manually: sign in to the admin app, copy your id_token out of DevTools, and paste it as Authorization: Bearer <token> in the monitor’s headers. Rotate it monthly.
https://heartbeat.<control-domain>/ sits behind Cognito. The Cabalmail Cognito user pool is the front door; Healthchecks itself uses its own local accounts (Cognito gates whether you can reach the UI, Healthchecks gates whether you can change checks).
The Healthchecks task is wired to deliver mail through the IMAP tier’s local-delivery sendmail (EMAIL_HOST=imap.cabal.internal, port 25, no TLS, no auth) – see healthchecks.tf. This means magic-link signup and password reset work natively, as long as you sign up with a Cabalmail-hosted address whose mailbox you can read. Mail destined for non-Cabalmail addresses (gmail, etc.) won’t deliver from this Healthchecks instance – it can only relay inbound to itself.
TF_VAR_HEALTHCHECKS_REGISTRATION_OPEN=true and re-run the Terraform workflow. The default is false (closed); flipping to true lets the Healthchecks Sign Up form accept new accounts.admin@<one-of-your-mail-domains>). It needs to be a real address in cabal-addresses; if it isn’t, IMAP’s sendmail will TEMPFAIL the magic-link delivery.https://heartbeat.<control-domain>/ in a browser. Cognito challenges you. Sign in.TF_VAR_HEALTHCHECKS_REGISTRATION_OPEN=false (or just delete the variable – false is the default) and re-run Terraform.Fallback if mail delivery doesn’t work (e.g. you want to bootstrap before adding a Cabalmail address, or the IMAP tier is down): create a superuser via ECS Exec and log in with the password form:
aws ecs execute-command --cluster <cluster> \
--task $(aws ecs list-tasks --cluster <cluster> --service-name cabal-healthchecks --query 'taskArns[0]' --output text) \
--container healthchecks --interactive --command /bin/sh
# inside the container:
cd /opt/healthchecks
./manage.py shell -c "from django.contrib.auth.models import User; User.objects.filter(email='you@example.com').delete()"
./manage.py createsuperuser
Then log in at https://heartbeat.<control-domain>/accounts/login/ using the password field (next to the magic-link button).
The cabal-healthchecks-iac Lambda needs a v3 API key to manage checks programmatically. The API has no endpoint to create keys, so this is a one-time manual step.
cabal-healthchecks-iac with read-write permissions. Copy the value.aws ssm put-parameter --name /cabal/healthchecks_api_key --type SecureString --overwrite --value '<key-from-step-1>'
The auto-invocation of the IaC Lambda at apply time saw the placeholder and returned status: skipped – no error, but no checks were created either. Step 13 forces the reconcile now that the key is real.
aws lambda invoke --function-name cabal-healthchecks-iac /tmp/out.json && cat /tmp/out.json
Expected output: {"status":"ok","reconciled":6,"failed":0,"extras":[],"checks":[...]}. The Lambda upserts six checks defined in lambda/api/healthchecks_iac/config.py and writes each ping URL into the matching /cabal/healthcheck_ping_* SSM parameter:
| Check name | Schedule | Grace | Pinged by |
|---|---|---|---|
certbot-renewal |
Every 60 days | 24 h | cabal-certbot-renewal Lambda (EventBridge Scheduler). |
aws-backup |
Every 1 day | 6 h | cabal-backup-heartbeat Lambda (EventBridge JOB_COMPLETED). |
dmarc-ingest |
Every 6 hours | 2 h | cabal-process-dmarc Lambda. |
ecs-reconfigure |
Every 30 minutes | 30 m | reconfigure.sh loop in mail-tier containers. |
cognito-user-sync |
Every 30 days | 7 d | assign_osid post-confirmation Lambda. Fires only on user signup. |
quarterly-review |
Every 90 days | 14 d | Manual operator ping (see step 15). |
Consumers cache the ping URL at cold start (Lambdas) or task start (mail-tier containers). After step 13 populates the SSM values, force the consumers to pick them up:
# Mail-tier reconfigure loop:
for tier in imap smtp-in smtp-out; do
aws ecs update-service --cluster <cluster> --service cabal-$tier --force-new-deployment
done
# Lambdas pick up new values on next cold start. Force one to verify:
aws lambda invoke --function-name cabal-certbot-renewal /tmp/out.json
The IaC Lambda creates checks but cannot create notification channels (the v3 API doesn’t expose channel CRUD). Create one webhook integration manually and assign it to every check.
In Healthchecks, Integrations -> Add Integration -> Webhook:
alert_sink_function_url from Terraform output.POST.Content-Type: application/json
X-Alert-Secret: <value of /cabal/alert_sink_secret>
{"summary": "Missed heartbeat: $NAME", "severity": "critical", "source": "healthchecks/$NAME"}
{"summary": "Recovered: $NAME", "severity": "warning", "source": "healthchecks/$NAME"}
Then assign the integration to every check from step 13 (toggle the check’s notification list to include the new integration). The source strings – healthchecks/certbot-renewal, healthchecks/aws-backup, etc. – must match the keys in the _RUNBOOK_MAP in alert_sink/function.py; renaming a check without updating the map drops the runbook link.
The quarterly-review check has no automation pinging it on a schedule – the operator pings it manually after walking through the quarterly review (see “Quarterly monitoring review” below). Ping it once now so it starts green, with a 90-day clock:
PING_URL=$(aws ssm get-parameter --name /cabal/healthcheck_ping_quarterly_review --with-decryption --query Parameter.Value --output text)
curl -fsS "$PING_URL"
Terraform auto-generates a random Grafana admin password on first apply (/cabal/grafana_admin_password, ignore_changes so subsequent applies don’t rotate it). Read it with:
aws ssm get-parameter --name /cabal/grafana_admin_password --with-decryption --query Parameter.Value --output text
Or set your own:
aws ssm put-parameter --name /cabal/grafana_admin_password --type SecureString --overwrite --value '<your-password>'
Grafana picks up the value at task start (GF_SECURITY_ADMIN_PASSWORD); a force-new-deployment rolls in any change.
https://metrics.<control-domain>/. Cognito challenges; sign in.Mail Tiers, AWS Services, API Gateway & Lambda, Frontend) are already there. Initial charts will be empty for ~5 min until cloudwatch_exporter has scraped./login. The username is admin; the password is the SSM value from step 16.http://prometheus.cabal-monitoring.cabal.internal:9090. To verify, Connections -> Data sources -> Prometheus -> Test.Prometheus has no public UI by default. To inspect scrape state:
CLUSTER=<cluster-name>
TASK=$(aws ecs list-tasks --cluster "$CLUSTER" --service-name cabal-prometheus --query 'taskArns[0]' --output text)
aws ecs execute-command --cluster "$CLUSTER" --task "$TASK" --container prometheus --interactive --command "/bin/sh"
# inside the container:
wget -qO- http://localhost:9090/api/v1/targets | head
Every target listed in prometheus.yml should be health: up. Targets to expect: 1x prometheus self-scrape, 1x alertmanager, 2x cloudwatch-exporter (primary + us-east-1), 4x blackbox probes (1x HTTP + 2x TCP for plaintext/STARTTLS + 2x TLS for implicit-TLS), and 1+x node-exporter (one per cluster EC2 instance).
https://uptime.<control-domain>/ is unreachable without a Cognito session.https://ntfy.<control-domain>/alerts returns 401 without a bearer token.https://heartbeat.<control-domain>/ is unreachable without a Cognito session.https://metrics.<control-domain>/ is unreachable without a Cognito session.aws lambda invoke --function-name cabal-healthchecks-iac /tmp/out.json returns status: ok with reconciled: 6./cabal/healthcheck_ping_* SSM parameters hold real https://heartbeat.<control-domain>/ping/... URLs (not placeholders).quarterly-review check shows green after step 15./cabal/healthcheck_ping_certbot_renewal to a non-http value) and waiting past the 24 h grace produces a Pushover + ntfy push citing healthchecks/certbot-renewal. Tappable runbook link opens heartbeat-certbot-renewal.md.cloudwatch_exporter, node_exporter, and blackbox_exporter targets are all up in Prometheus.aws logs describe-metric-filters --log-group-name /ecs/cabal-imap lists cabal-sendmail-deferred-imap, cabal-sendmail-bounced-imap, and cabal-imap-auth-failures.EFSBurstCreditsLow to < 100e9) in docker/prometheus/rules/alerts.yml, rebuild + redeploy, and confirm the Alertmanager -> alert_sink chain produces a ntfy push within ~5 min, with a tappable runbook link.Every alert that can fire a push notification has a runbook in docs/operations/runbooks/. Each runbook follows the same shape: what the alert means, who/what is impacted, the first three things to check, and how to escalate. See the runbook README for the full index.
How the runbook URL reaches your phone:
runbook_url annotation. Alertmanager forwards it as part of its native webhook body; the alert_sink Lambda’s translator surfaces it (_translate_alertmanager) and attaches it to outbound pushes.alert_sink Lambda has a static _RUNBOOK_MAP keyed by source (e.g. kuma/IMAP TLS handshake, healthchecks/certbot-renewal). When you add or rename a Kuma monitor or a Healthchecks check, update the keys in lambda/api/alert_sink/function.py to match, or the push will arrive without a runbook link.When a push includes a runbook URL, you’ll see:
Click header), opening the runbook in the phone’s browser.The map and the runbook files are version-controlled together; PRs that change one without the other should fail review.
Run after each meaningful monitoring change, and again at every quarterly review. If the expected push doesn’t arrive, fix the broken link before treating the tabletop as passing.
| Scenario | How to simulate | Expected page | Expected runbook |
|---|---|---|---|
| Mail queue backup (deferred) | ECS-Exec into the smtp-out task; inject 12 fake stat=Deferred log lines via logger -t sm-mta 'XXX: stat=Deferred' in <1 minute, then wait. |
SendmailDeferredSpike (warning ntfy) within ~17 min (10 min window + 15 min for) |
sendmail-deferred-spike.md |
| IMAP cert expiring (control-domain) | In dev: re-issue a short-lived cert and wait, or temporarily replace the listener cert with a deliberately near-expiry one. Don’t do this in prod. | BlackboxTLSCertExpiringSoon (warning ntfy) and Kuma’s “Control-domain cert” 21-day notification |
cert-expiring.md |
| Certbot Lambda silently disabled | Disable the EventBridge schedule on cabal-certbot-renewal in dev; wait past the 24 h grace |
healthchecks/certbot-renewal missed -> critical ntfy + Pushover |
heartbeat-certbot-renewal.md |
| Healthchecks IaC drift | Add a check by hand in the Healthchecks UI without adding it to config.py. Re-invoke cabal-healthchecks-iac. |
Lambda log line WARNING: extras in Healthchecks not in config.py: [...]. No alert fires (drift is logged, not paged). |
(no runbook – drift is operator-cleaned) |
The quarterly-review Healthchecks check pages the operator if 90+ days pass without a manual ping. The check is not automated. Nothing pings it on a schedule. The operator pings it after walking through the checklist in heartbeat-quarterly-review.md, which covers:
When you’ve finished:
PING_URL=$(aws ssm get-parameter --name /cabal/healthcheck_ping_quarterly_review --with-decryption --query Parameter.Value --output text)
curl -fsS "$PING_URL"
Some Grafana panels are blank for several minutes after the stack starts; some are blank by design.
delay_seconds lag (CloudWatch metrics aren’t immediately consistent), so first datapoint arrives ~3 min after the exporter starts.aws_cabalmail_logs_* series. These are alert signals; flat-empty in steady state is what you want.us-east-1 (cabal-cloudwatch-exporter-us-east-1) since CloudFront emits metrics only in that region. The us-east-1 task scrapes a tiny CloudFront-only config (config-us-east-1.yml). Same image, separate Cloud Map registration, separate Prometheus scrape job. If the panels stay blank, check that the second task is up in Prometheus targets and that the new Cloud Map service cloudwatch-exporter-us-east-1.cabal-monitoring.cabal.internal resolves.If a panel is still blank after ~10 min and isn’t in one of the categories above, dig in – start with wget -qO- http://localhost:9090/api/v1/label/__name__/values from inside the Prometheus task to confirm whether the metric series even exists.
When a panel has been blank since deployment – not just for a few minutes – the question is whether the data pipeline (CloudWatch -> cloudwatch_exporter -> Prometheus -> Grafana) is sound, or whether the metric genuinely has no datapoints. These commands cover both directions: confirm pipeline health, then inject synthetic data to make a “should be empty” panel light up briefly.
From inside the Prometheus task (aws ecs execute-command --cluster cabal-mail --task <prom-task-arn> --container prometheus --interactive --command /bin/sh):
# All cloudwatch-derived metric names Prometheus has ever seen.
wget -qO- http://localhost:9090/api/v1/label/__name__/values \
| tr ',' '\n' | grep '^"aws_' | sort
# Scrape target health -- both cloudwatch jobs should be `up`.
wget -qO- http://localhost:9090/api/v1/targets \
| sed 's/,/\n/g' | grep -E 'job|health|lastError'
If aws_apigateway_count_sum is in the list but aws_lambda_duration_average is not, the exporter is reaching CloudWatch but Lambda specifically has no recent invocations to emit. If neither shows up, the cloudwatch-exporter target is down or the IAM/network path to CloudWatch is broken.
From inside the Prometheus task – which can already reach the exporter on its Cloud Map name – you can scrape /metrics directly without exec’ing into the exporter:
wget -qO- http://cloudwatch-exporter.cabal-monitoring.cabal.internal:9106/metrics \
| grep '^aws_' | head -40
wget -qO- http://cloudwatch-exporter-us-east-1.cabal-monitoring.cabal.internal:9106/metrics \
| grep '^aws_cloudfront' | head -20
Empty aws_* block here means the exporter is alive but failing CloudWatch calls. Check the task logs at /ecs/cabal-cloudwatch-exporter (or -us-east-1) for AccessDenied, throttling, or NoSuchKey errors:
aws logs tail /ecs/cabal-cloudwatch-exporter --since 30m --follow
The cloudwatch-exporter and blackbox-exporter services are also exec-enabled. Note that enable_execute_command only applies to tasks launched after the flag was set, so on a freshly-applied infra you may need to force a service redeploy (aws ecs update-service --cluster cabal-mail --service cabal-cloudwatch-exporter --force-new-deployment) before aws ecs execute-command works against an existing task.
BurstCreditBalance and PercentIOLimit only emit in bursting throughput / generalPurpose performance mode. AWS recently changed the default for new file systems to elastic, which doesn’t emit either. Check:
aws efs describe-file-systems \
--query 'FileSystems[*].{id:FileSystemId,name:Name,throughput:ThroughputMode,perf:PerformanceMode}' \
--output table
If throughput is elastic, the EFS BurstCreditBalance panel will stay empty by design – the new “EFS I/O bytes” panel (added alongside) covers the saturation signal in either mode. If throughput is bursting and the panel is still empty, the cloudwatch-exporter has a real problem reaching AWS/EFS.
Each of these produces a single datapoint that should appear in Grafana within ~3 min (60s exporter scrape + 120s delay_seconds lag). Confirm the datapoint with a Prometheus query rather than waiting for the panel to refresh – e.g. from inside the Prometheus task, wget -qO- 'http://localhost:9090/api/v1/query?query=aws_lambda_errors_sum' returns the raw series.
Two confusables worth pinning down before reading the table.
The HTTP 200 from aws lambda invoke: that’s the status of the AWS Lambda API call (request accepted), not the function’s success. Function failure shows up as FunctionError: Unhandled in the invoke response JSON and a non-empty errorType in the response payload file. A successful invocation that produced a function-level error still increments the AWS/Lambda Errors metric.
Lambda invoke vs API Gateway: aws lambda invoke calls the Lambda API directly. It does not traverse API Gateway, so it never increments AWS/ApiGateway 5XXError no matter what payload you send. To move that metric you need an actual HTTP request to the API Gateway URL.
| Panel | Synthetic-data trigger |
|---|---|
| Lambda errors | Pick any cabal-* Lambda. They all dereference event['requestContext']['authorizer']['claims']['cognito:username'] at the top of handler, so an empty payload raises KeyError: 'requestContext' before any try/except. aws lambda invoke --function-name cabal-list --payload '{}' /tmp/out.json – the invoke response will include "FunctionError": "Unhandled" and /tmp/out.json will contain the traceback. That counts as one error. Repeat 3 more times to give the dashboard’s rate(...[5m]) something to integrate. Verify in CloudWatch first if you want to be sure: aws cloudwatch get-metric-statistics --namespace AWS/Lambda --metric-name Errors --dimensions Name=FunctionName,Value=cabal-list --start-time "$(date -u -v-15M +%FT%TZ 2>/dev/null \|\| date -u -d '15 min ago' +%FT%TZ)" --end-time "$(date -u +%FT%TZ)" --period 60 --statistics Sum. |
| Lambda throttles | Lower the function’s reserved concurrency to 0: aws lambda put-function-concurrency --function-name cabal-alert-sink --reserved-concurrent-executions 0. Invoke once: aws lambda invoke --function-name cabal-alert-sink --payload '{}' /tmp/out.json. The invoke returns a throttle error (visible in /tmp/out.json and as FunctionError). Restore immediately with aws lambda delete-function-concurrency --function-name cabal-alert-sink. Don’t try this on a Lambda that’s serving real traffic. |
| Lambda duration p95 | Just invoke any Lambda repeatedly so the dashboard has something to compute p95 against: for i in $(seq 1 5); do aws lambda invoke --function-name cabal-list --payload '{}' /tmp/out.json >/dev/null ; done. Errors are fine here – duration is recorded for both successful and failed invocations. |
| API Gateway 5xx rate | Has to go through API Gateway. The simplest natural way: use the admin app while a backend Lambda is misbehaving. To synthesise: temporarily revoke a Lambda’s IAM access to its data store (e.g. detach AWSLambdaDynamoDBExecutionRole from cabal-list-role), then load the Addresses page in the admin app – the Lambda will fail with a permission error, API Gateway will surface a 5xx, and the metric increments. Re-attach the policy immediately afterwards. If you’d rather not poke IAM, skip this; the Lambda-Errors path above exercises the same exporter code path, so a working Lambda Errors panel implies the API Gateway 5xx panel will also work when there’s organic 5xx traffic. |
| DynamoDB ConsumedRead/WriteCapacityUnits | Read: aws dynamodb scan --table-name cabal-addresses --max-items 1 >/dev/null. Write: any address-creation flow in the admin app, or a direct put-item against a throwaway PK. |
| DynamoDB ThrottledRequests | Hard to trigger on-demand without sustained load. Skip unless you’re explicitly testing throttling behavior. |
| EFS I/O bytes | ECS-Exec into an imap task and dd if=/dev/zero of=/var/spool/mail/canary bs=1M count=10 oflag=direct ; rm /var/spool/mail/canary. The 10 MiB write produces a visible spike on DataWriteIOBytes. |
| CloudFront request count / 5xx | for i in $(seq 1 20); do curl -fsS -o /dev/null https://<control-domain>/ ; done for the request-count panel. CloudFront 5xx is harder; temporarily mis-configure the origin (e.g. block CloudFront’s egress to S3 with a bucket policy deny for ~5 min) to force a real 5xx. Easier: just confirm the panel populates with Requests traffic before chasing the 5xx case. |
| TLS days to expiry – IMAP 993 | No injection needed – once the new blackbox-tls job runs, probe_ssl_earliest_cert_expiry{instance=~".*:993"} populates within one scrape (30 s). If still empty after 5 min, check that the blackbox-tls target is up and that the cert chain returned by port 993 is parseable. |
| ECS RunningTaskCount | No injection needed once the namespace fix lands – Container Insights reports running-task counts per service every minute regardless of activity. |
If any of the synthetic triggers above produces CloudWatch data (visible in the AWS Console under Metrics, or via the aws cloudwatch get-metric-statistics template above) but Grafana still shows no data, the pipeline is broken between cloudwatch_exporter and Prometheus, not at CloudWatch. Re-run the §1 commands above to localize the gap.
Cabalmail stays on CloudWatch Logs rather than self-hosting Loki. Log volume is small enough that CloudWatch’s per-GB cost is negligible, and we don’t need cross-tier log correlation in real time. Loki would add another stateful ECS service with EFS-backed chunk storage that grows monotonically; the maintenance cost outweighs the benefit until either log volume or cross-tier search frequency becomes painful.
Log-derived metrics ship as CloudWatch metric filters defined in terraform/infra/modules/monitoring/log_metrics.tf:
| Filter | Log group(s) | Pattern | Metric (in Cabalmail/Logs) |
|---|---|---|---|
cabal-sendmail-deferred-{tier} |
/ecs/cabal-imap, /ecs/cabal-smtp-in, /ecs/cabal-smtp-out |
"stat=Deferred" |
SendmailDeferred |
cabal-sendmail-bounced-{tier} |
same three | "dsn=5" |
SendmailBounced |
cabal-imap-auth-failures |
/ecs/cabal-imap |
"imap-login" "auth failed" |
IMAPAuthFailures |
All metrics emit value=1 per matching log line, default 0. CloudWatch aggregates per-minute. cloudwatch_exporter scrapes the Sum statistic and exposes aws_cabalmail_logs_<metric>_sum to Prometheus. Three Prometheus rules in the log-derived group of docker/prometheus/rules/alerts.yml alert on the rates:
| Alert | Threshold | Severity | Runbook |
|---|---|---|---|
SendmailDeferredSpike |
>10 deferreds/10 min, sustained 15 min | warning | sendmail-deferred-spike.md |
SendmailBouncedSpike |
>15 bounces/30 min, sustained 15 min | critical | sendmail-bounced-spike.md |
IMAPAuthFailureSpike |
>25 auth-fails/5 min, sustained 5 min | warning | imap-auth-failure-spike.md |
These thresholds are starting points. Expect them to move once we see what real traffic looks like; record the rationale in the alert’s GitHub issue per the tuning discipline in the design doc.
fail2ban metrics are intentionally not part of this set. [program:fail2ban] is currently commented out in every mail-tier supervisord.conf. A metric filter today would publish flat-zero forever and mask the disabled state. Add the filter when fail2ban is re-enabled.
Cognito post-confirmation Lambda errors are caught by the existing LambdaErrors rule (its function_name regex is cabal-.+|assign_osid, so the post-confirmation Lambda’s invocation errors fire it without a separate log filter).
To add a new Healthchecks check via IaC:
lambda/api/healthchecks_iac/config.py – add an entry with name, kind, timeout, grace, desc, tags, and ssm_param.monitoring/ssm.tf local.heartbeat_jobs and reference it from the consumer (Lambda env var, ECS secrets, etc.).docs/operations/runbooks/ and update the static _RUNBOOK_MAP in alert_sink/function.py so the push includes a tappable link.lambda-api job in app.yml (rebuilds the IaC Lambda zip and the alert_sink zip), then infra.yml (applies and re-invokes the IaC Lambda since the source_code_hash changed).Set TF_VAR_MONITORING=false in the GitHub environment and re-run Terraform. The module is gated with count = var.monitoring ? 1 : 0, so the ECS services, ALB, Lambdas, and SSM parameters are destroyed cleanly. The ECR repositories and the Cognito user pool domain persist (they are cheap and not flag-gated).
Note on EFS state: destroying the stack leaves the /uptime-kuma, /ntfy, /healthchecks, /prometheus, /grafana, and /alertmanager directories on the shared EFS. Re-enabling monitoring later will pick up the existing state, preserving Kuma monitors, Healthchecks checks (which the IaC Lambda will reconcile against), and Prometheus retention. Remove the directories manually from any running mail-tier container if you want a clean start.
To silence one heartbeat without disabling the entire monitoring stack: pause the corresponding check in the Healthchecks UI, or set its SSM parameter back to a value that does not start with http (e.g. aws ssm put-parameter --overwrite --type SecureString --name /cabal/healthcheck_ping_dmarc_ingest --value 'paused'). Consumer code skips the ping when the value is not an HTTP(S) URL, and Healthchecks stops expecting pings while the check is paused.
The IaC Lambda will not overwrite a pausing value: its update flow only writes ping URLs back to SSM when the Healthchecks API returns one, and pause state in Healthchecks does not change the URL.
To rotate the webhook shared secret:
openssl rand -base64 36 | tr -d '='.aws ssm put-parameter --name /cabal/alert_sink_secret --type SecureString --overwrite --value '<new-value>'.X-Alert-Secret header on every Kuma webhook provider and the Healthchecks integration headers.To rotate the ntfy publisher token: run ntfy token del <old-token> and ntfy token add admin inside the container, then update /cabal/ntfy_publisher_token.
To rotate the Pushover app token: create a new application on pushover.net, update /cabal/pushover_app_token, delete the old application.
To rotate the Healthchecks API key: in the UI, Project Settings -> API Access, revoke the old key and create a new one. Update /cabal/healthchecks_api_key. The IaC Lambda picks up the new value on next invocation.
The Terraform ignore_changes = [value] lifecycle on each SSM parameter means subsequent terraform apply runs will not revert your rotated value.
Notes below are lessons from the actual deploy. Each is also reflected in code; this list is for future readers and re-deployers.
TF_VAR_AVAILABILITY_ZONES; dev and stage have one each, and the per-AZ cidrsubnet math in the VPC module makes adding a second AZ destructive (every subnet is renumbered). The monitoring stack was deployed directly to prod for that reason.authenticate-cognito action calls Cognito’s hosted UI domain to swap the auth code for tokens. Without an HTTPS egress rule on the ALB SG, the call drops and the ALB returns 500 on /oauth2/idpresponse. Egress to 0.0.0.0/0:443 is the minimum.authorization_type = NONE, AWS requires both lambda:InvokeFunctionUrl (auth-layer check) and lambda:InvokeFunction scoped to URL callers via lambda:InvokedViaFunctionUrl=true (execute layer). Missing either returns 403 at the URL gateway. The aws Terraform provider >= 6.28.0 added invoked_via_function_url = true on aws_lambda_permission; earlier versions can’t express this condition declaratively.admin., uptime., ntfy., heartbeat., metrics. into the private zone so VPC-internal callers (Kuma probes, the IaC Lambda) can resolve them. Mail-tier hosts (imap., smtp-in., smtp-out.) are intentionally not mirrored – Kuma’s TCP probes for those tiers point at the NLB’s public DNS name directly.chown. Several upstream images (louislam/uptime-kuma, healthchecks/healthchecks, grafana/grafana) chown a data directory at boot or container creation, which EFS access points refuse regardless of caller. Three patterns work around this:
entryPoint and user in the task definition so the image starts directly as 1000:1000 without the chown shim (Kuma)./var/local/healthchecks-data for Healthchecks, /grafana-data for Grafana) and override the data-path env var to match. dockerd’s copy-up logic doesn’t trigger when the target directory doesn’t exist in the image.user = "1000:1000" on the task definition so writes succeed under the access point’s translated uid (Healthchecks).terraform apply. AWS deprecated the failure_threshold field on health_check_custom_config and pins it to 1 server-side regardless. An empty health_check_custom_config {} block reads back as drift on every plan and schedules a forced replacement, which fails because the ECS service has live instances registered. Fix in discovery.tf: set failure_threshold = 1 explicitly and add lifecycle { ignore_changes = [health_check_custom_config] }. Without that fix, operators have to manually aws servicediscovery deregister-instance after each apply.containerName/containerPort must be specified. The DAEMON service uses network_mode = "host". With awsvpc, ECS infers the ENI mapping from the task definition; with host or bridge, service_registries.container_name and container_port must be explicit.serviceRegistries value is configured to use a type 'A' DNS record, which is not supported when specifying 'host' or 'bridge' for networkMode. A host can run multiple containers on different ports, so ECS can’t infer the port from an A-record alone. node_exporter’s Cloud Map service registers SRV records instead; the awsvpc-mode services keep type A. Prometheus’s scrape config follows: type: SRV on the node job, type: A everywhere else.capacity_provider_strategy. Even an inherited cluster default trips the validator. Use launch_type = "EC2" instead – DAEMON places one task per container instance regardless of which capacity provider supplied it.NumberFormatException. The Java cloudwatch_exporter takes its config path positionally (<port> <config-path>); the --config.file= flag is a Go/Prometheus convention. The flag was being parsed as the listen port and the JVM crashed at startup. The Dockerfile CMD passes /config/config.yml directly.uid, Grafana auto-generates one and the dashboards reference datasource.uid: "prometheus" – the binding silently fails. Second: Grafana 11.x silently rejects provisioned dashboard JSON without a top-level "id": null field. Both fixes are in docker/grafana/provisioning/datasources/prometheus.yml and the dashboard JSONs.aws ecs update-service --cluster <cluster> --service cabal-grafana --force-new-deployment rolls the task and re-resolves.GET / from the VPC subnet IPs returns HTTP 400 in single-digit ms, Django is rejecting the probe with DisallowedHost. ALB target-group health checks can’t set a custom Host header – they send Host: <target-ip>:<port>, which fails Django’s ALLOWED_HOSTS check. The task definition uses ALLOWED_HOSTS=* for this reason; hostname enforcement is done at the ALB layer (the listener rule for heartbeat.<control-domain> is the only public path to the target group, and the task SG only accepts traffic from the ALB SG).heartbeat.<control-domain> but Cognito redirects loop. If the loop is on first signup specifically, the signup form is closed: set TF_VAR_HEALTHCHECKS_REGISTRATION_OPEN=true in your GitHub environment and re-run Terraform, complete the signup, then flip the variable back to false.aws lambda invoke --function-name cabal-certbot-renewal /tmp/out.json for Lambdas; aws ecs update-service --cluster <cluster> --service cabal-imap --force-new-deployment (and the smtp tiers) for the reconfigure heartbeat.backup_heartbeat Lambda silent. Confirm var.backup = true in the environment – without the AWS Backup plan, no Backup Job State Change events fire and the EventBridge rule has nothing to invoke. The Lambda existing without the backup plan is harmless but useless.cabal-healthchecks-iac returns status: skipped on every invocation. The API key is still placeholder. Repeat step 12.cabal-healthchecks-iac returns status: partial with an error mentioning DNS. The Cloud Map A record for Healthchecks isn’t registered yet. Confirm the cabal-healthchecks ECS service is healthy and registered: aws servicediscovery list-instances --service-id <id> should return at least one instance. If it doesn’t, force a redeploy of the Healthchecks service.403 Forbidden from the alert_sink Lambda. The Lambda accepts both X-Alert-Secret: <secret> and Authorization: Bearer <secret>. Alertmanager’s http_config.authorization sets the Bearer header; if the header arrives with leading whitespace or the secret in the wrong env var, the HMAC compare fails. Check the /cabal/lambda/alert_sink log group and confirm the SSM secret matches what the entrypoint substituted into /etc/alertmanager-rendered/alertmanager.yml.runbook_url annotation (if it’s an Alertmanager-routed alert); the source name in the Kuma webhook body or Healthchecks integration body matches a key in _RUNBOOK_MAP; the Lambda log shows the resolved runbook URL on each invocation.TokenizationError. Kuma uses Liquid templating ({% if %}...{% endif %}), not Handlebars ({{#if}}...{{/if}}).deny-all ACLs in older releases – but in 2.14+ it does. The mobile-app failures we hit while bootstrapping were instead authentication issues (bcrypt truncates passwords at 72 bytes; non-ASCII or trailing-newline copies fail silently). Operationally: keep the admin password short, ASCII, and pasted carefully.AWS_REGION in workflow logs, including the alert_sink Function URL output. The masked URL with literal *** is unusable; fetch the real URL via aws lambda get-function-url-config --function-name alert_sink from a shell with unmasked region.aws_security_group and aws_security_group_rule GroupDescription is strict-ASCII at the EC2 API level. Other AWS resources tolerate Unicode (Cloud Map service descriptions, IAM role descriptions, SSM parameter descriptions all accept em-dashes); SG descriptions don’t. terraform validate doesn’t catch this – the restriction is enforced at the EC2 API. CI’s tfsec / checkov won’t catch it either. Keep all SG-related descriptions ASCII-only.TF_VAR_MAIL_DOMAINS are address namespaces only; only the control domain has an ACM cert. Don’t add per-mail-domain cert-expiry monitors.cloudwatch_exporter IAM scope. The exporter discovers metrics across all configured namespaces on each scrape, so the task role policy uses wildcard Resource: "*" for cloudwatch:ListMetrics, cloudwatch:GetMetricData, cloudwatch:GetMetricStatistics, and tag:GetResources. Region-mismatched metric scrapes fail silently – the exporter doesn’t error, it just returns no data; confirm the AWS_REGION env var on the task matches the region of the metrics you’re scraping.node_exporter daemon tasks don’t start. The daemon-strategy service requires the ECS cluster instance role to allow daemon-strategy task placement (the existing AmazonEC2ContainerServiceforEC2Role covers this) and for the host’s SG to allow inbound TCP 9100 from the Prometheus task SG. The mail-tier aws_security_group.ecs_instance already permits all VPC traffic; if you ever scope it down, add an explicit ingress rule for 9100 from the Prometheus SG.