Host your own email and enhance your privacy
Fired by:
BlackboxTLSCertExpiringSoonWarning (<21 days) and BlackboxTLSCertExpiringSoonCritical (<7 days) — sourced from probe_ssl_earliest_cert_expiry on the live TLS handshake against each endpoint the blackbox-tls scrape job targets.The alert’s instance label identifies which endpoint is expiring. Two distinct certs live behind two distinct endpoints:
instance |
Cert | Termination point | Renewal path |
|---|---|---|---|
imap.<control-domain>:993 |
ACM *.<control-domain> (wildcard) |
NLB | AWS auto-renewal (DNS validation) |
smtp-out.<control-domain>:465 |
Let’s Encrypt (per-host) | smtp-out container | cabal-certbot-renewal Lambda |
(There used to be a parallel pair of CertExpiringSoon{Warning,Critical} rules sourced from aws_certificatemanager_days_to_expiry_minimum. cloudwatch_exporter v0.16.0 silently dropped that metric under every configuration we tried; the blackbox path measures the same cert from a more honest place — what’s actually on the wire — so the CloudWatch source was removed. The runbook below covers both renewal pipelines because either alert can fire from a different cause.)
The TLS certificate serving `` is approaching expiry. ACM normally auto-renews around T-30; the certbot Lambda runs daily and renews when remaining days drop below a configured threshold. <21 days remaining on either source means that pipeline is stuck.
When a cert actually expires:
:993 and everything else fronted by the wildcard) — admin app (CloudFront), API Gateway, monitoring ALB (Kuma, ntfy, Healthchecks, Grafana), IMAP listener on the NLB. Mail-domain entries in TF_VAR_MAIL_DOMAINS have no certs by design (they are address namespaces only) — only the control domain has one.:465/:587) — SMTP submission to smtp-out fails TLS handshake. Outbound delivery via port 25 to peer MXes continues since most peers don’t validate, but customer-side submission stops.Which pipeline depends on which endpoint the alert names.
instance ends in :993 (ACM cert)aws acm describe-certificate --certificate-arn <arn> \
--query 'Certificate.{status:Status,renewalStatus:RenewalSummary.RenewalStatus,reason:RenewalSummary.RenewalStatusReason,validations:RenewalSummary.DomainValidationOptions}'
RenewalStatusReason usually identifies the issue (DOMAIN_NOT_ALLOWED_BY_CAA, DOMAIN_VALIDATION_DENIED, missing CNAME, etc.).
_<random>.<control-domain> records in the public hosted zone against the ResourceRecord entries from step 1. Re-add any that are missing.Certificate.Arn. A mismatch means the listener is pinned to an old ARN.instance ends in :465 or :587 (Let’s Encrypt cert)cabal-certbot-renewal last run? Check the Lambda’s last invocation and any errors:
aws lambda invoke --function-name cabal-certbot-renewal /tmp/out.json && cat /tmp/out.json
aws logs tail /aws/lambda/cabal-certbot-renewal --since 7d
The Lambda is scheduled and also can be invoked manually to force a renewal attempt.
aws ecs update-service --cluster cabal-mail --service cabal-smtp-out --force-new-deployment
429 or rateLimited responses; if hit, wait until the window clears before re-running.If ACM renewal is stuck and the cert has <14 days remaining:
aws acm request-certificate --domain-name '*.<control-domain>' --validation-method DNS. This issues a new ARN; you’ll need to update everywhere it’s referenced (CloudFront, ALB listeners, NLB listener). Don’t do this lightly — it’s destructive to live traffic during the cutover.If the Let’s Encrypt renewal is stuck and the cert has <7 days remaining:
certbot certonly --manual from a workstation with the validation TXT records published, then upload the resulting PEM bundle to wherever cabal-certbot-renewal writes its output (SSM Parameter Store under /cabal/letsencrypt/*), then force a smtp-out redeploy.