Host your own email and enhance your privacy
Cabalmail’s backups, audit trails, and DNS-integrity story have grown additively. They are not absent — DynamoDB tables have PITR (most of them), EFS and DynamoDB are in an AWS Backup plan, S3 versioning is on the React-bundle and message-cache buckets — but the posture stops well short of “could survive a determined ransomware incident or an accidental admin destroy.” This plan closes the headline gaps without proposing a multi-region active-active rebuild (which is its own initiative).
Five themes:
aws_backup_vault_lock_configuration. An admin who can call aws backup delete-backup-vault (which our deploy IAM principal can) can wipe the entire backup history. Backups are single-region and single-account: a regional outage or account compromise loses everything.cabal-counter table has neither PITR nor SSE explicitly.terraform/dns) nor the mail-domain zones (terraform/infra/modules/domains) have DNSSEC. Mail-flow hijack via DNS spoofing — including downgrade of DMARC, MX, or BIMI records — has no on-protocol detection beyond client-side TLS at the SMTP layer.TLSv1.2_2021, two AWS-published policies behind current.The plan ships in five phases. Phase 1 (backup hardening) is the highest-leverage because it raises the floor for every other failure mode. Phase 4 (DNSSEC) is the most operationally sensitive — DS records have to be added at the registrar, and a botched DS record is a self-inflicted DNS outage. Phase 4 goes last accordingly.
deletion_protection = true.TLSv1.2_2023 or newer at the time of the PR. Both distributions front-door and admin keep functioning behaviour-identically.state-encryption-plan.md. Application buckets are covered situationally; full SSE-CMK uplift is its own posture decision.terraform/infra/modules/backup/main.tf:5-10:
resource "aws_backup_vault" "backup" {
name = "cabal-backup"
lifecycle {
prevent_destroy = false
}
}
No aws_backup_vault_lock_configuration. The prevent_destroy = false setting allows Terraform to destroy the vault on apply if var.backup flips from true to false. The module docstring claims it “enforces prevent_destroy”; the code disagrees — a known doc-drift.
terraform/infra/modules/backup/main.tf:35-42:
resource "aws_backup_plan" "backup" {
rule {
rule_name = "cabal-backup-plan-rule"
target_vault_name = aws_backup_vault.backup.name
schedule = "cron(0 0 * * ? *)"
}
}
No copy_action. No lifecycle (no defined retention or transition-to-cold-storage). The Backup Plan defaults to “keep forever,” which is fine until billing is the issue or compliance demands deletion.
Audit-relevant tables and their posture:
| Table | Defined in | PITR | SSE | Deletion protection |
|---|---|---|---|---|
cabal-addresses |
terraform/infra/modules/table/main.tf:6 |
✓ | ✓ (default key) | ✗ |
cabal-user-preferences |
terraform/infra/modules/table/main.tf:30 |
✓ | ✓ (default key) | ✗ |
cabal-user-domain-access |
terraform/infra/modules/table/main.tf:58 |
✓ | ✓ (default key) | ✗ |
cabal-counter |
terraform/infra/modules/user_pool/counter.tf:135 |
✗ | ✗ (default-on bucket-level) | ✗ |
cabal-dmarc-reports |
terraform/infra/modules/app/dmarc.tf |
(verify) | (verify) | ✗ |
The counter table is the standout: it stores the auto-incrementing OS user ID. Wiping it would mean new signups collide with existing UIDs (because the counter restarts) and existing users lose mailbox association. PITR-restore is the only realistic recovery; without PITR, the entire user-pool→counter mapping has to be rebuilt by scanning Cognito attributes.
terraform/infra/modules/elb/main.tf:5-14:
resource "aws_lb" "mail" {
name = "cabal-mail"
internal = false
load_balancer_type = "network"
...
}
No access_logs { ... } block. Incident response for SMTP-IN abuse (spam delivery, brute-force) relies on container log lines, which are downstream of any NAT/proxy and lossy under pressure.
API Gateway access logs are configured in terraform/infra/modules/app/main.tf and write to CloudWatch — that part is fine.
resource "aws_route53_zone" "control" {
name = var.control_domain
}
No DNSSEC. No KSK resource (aws_route53_key_signing_key), no aws_route53_hosted_zone_dnssec.
terraform/infra/modules/domains/main.tf:5-10:
resource "aws_route53_zone" "mail" {
for_each = toset(var.mail_domains)
name = each.value
force_destroy = true
}
Same — no DNSSEC. The force_destroy = true allows terraform destroy to nuke a mail zone with active records. A misclick on a destroy workflow loses the entire DNS history for that apex.
terraform/infra/modules/app/cloudfront.tf:76 and terraform/infra/modules/front_door/main.tf:116: both set minimum_protocol_version = "TLSv1.2_2021". AWS has released TLSv1.2_2023 since.
terraform/infra/modules/s3/main.tf:20-22 and terraform/infra/modules/app/cloudfront.tf:8-10: both use aws_cloudfront_origin_access_identity (OAI), the legacy mechanism. AWS recommends aws_cloudfront_origin_access_control (OAC), which supports SSE-KMS and avoids exposing the canonical-user shape in bucket policies.
terraform/infra/modules/backup/main.tf gains:
resource "aws_backup_vault" "backup" {
name = "cabal-backup"
lifecycle {
prevent_destroy = true
}
}
resource "aws_backup_vault_lock_configuration" "backup" {
backup_vault_name = aws_backup_vault.backup.name
min_retention_days = 30
max_retention_days = 365
# changeable_for_days NOT set -> governance mode (admin can disable with one
# extra API call, leaves audit trail). Compliance mode is one-way; we do
# not need compliance mode for our threat model.
}
resource "aws_backup_vault" "backup_dr" {
provider = aws.dr_region # second region, configured at root
name = "cabal-backup-dr"
lifecycle {
prevent_destroy = true
}
}
resource "aws_backup_vault_lock_configuration" "backup_dr" {
provider = aws.dr_region
backup_vault_name = aws_backup_vault.backup_dr.name
min_retention_days = 30
max_retention_days = 365
}
resource "aws_backup_plan" "backup" {
...
rule {
rule_name = "cabal-backup-plan-rule"
target_vault_name = aws_backup_vault.backup.name
schedule = "cron(0 0 * * ? *)"
lifecycle {
cold_storage_after = 30
delete_after = 365
}
copy_action {
destination_vault_arn = aws_backup_vault.backup_dr.arn
lifecycle {
delete_after = 365
}
}
}
}
The DR-region provider alias is added in terraform/infra/main.tf:
provider "aws" {
alias = "dr_region"
region = var.dr_region # defaults to "us-west-2" if primary is "us-east-1"
}
Governance mode means the deploy principal can call DisableVaultLock if needed (e.g., to legitimately destroy a dev environment), but the action shows up in CloudTrail with a 24-hour delay before deletion is allowed. That delay is the ransomware mitigation we are buying.
The fix to the docstring drift: update the module docstring to match the prevent_destroy = true reality.
Cross-account copy (the strongest ransomware mitigation) is deferred to a follow-up; setting up the second-account vault and its trust policy is a separate Phase 1.5 that depends on the cross-account IAM work flagged in identity-iam-hardening-plan.md Non-goals.
terraform/infra/modules/user_pool/counter.tf:135:
resource "aws_dynamodb_table" "counter" {
name = "cabal-counter"
billing_mode = "PAY_PER_REQUEST"
hash_key = "counter"
deletion_protection_enabled = true
attribute {
name = "counter"
type = "S"
}
server_side_encryption {
enabled = true
}
point_in_time_recovery {
enabled = true
}
}
terraform/infra/modules/table/main.tf and terraform/infra/modules/app/dmarc.tf: deletion_protection_enabled = true on every table.
This is a small change but has an important rollback caveat: once deletion_protection_enabled = true, a Terraform plan to destroy the table fails until protection is disabled in a prior apply. That is the point.
A small new S3 bucket for NLB logs:
resource "aws_s3_bucket" "nlb_access_logs" {
bucket = "cabal-nlb-access-logs-${data.aws_caller_identity.current.account_id}"
}
resource "aws_s3_bucket_versioning" "nlb_access_logs" {
bucket = aws_s3_bucket.nlb_access_logs.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_lifecycle_configuration" "nlb_access_logs" {
bucket = aws_s3_bucket.nlb_access_logs.id
rule {
id = "expire-old"
status = "Enabled"
expiration {
days = 180
}
noncurrent_version_expiration {
noncurrent_days = 30
}
abort_incomplete_multipart_upload {
days_after_initiation = 7
}
}
}
resource "aws_s3_bucket_public_access_block" "nlb_access_logs" {
bucket = aws_s3_bucket.nlb_access_logs.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_s3_bucket_policy" "nlb_access_logs" {
bucket = aws_s3_bucket.nlb_access_logs.id
policy = data.aws_iam_policy_document.nlb_logs_policy.json
}
The bucket policy grants s3:PutObject to the regional ELB account ID (AWS publishes a per-region list; encode in a local).
terraform/infra/modules/elb/main.tf:5-14:
resource "aws_lb" "mail" {
name = "cabal-mail"
internal = false
load_balancer_type = "network"
...
access_logs {
bucket = aws_s3_bucket.nlb_access_logs.id
enabled = true
prefix = "mail-nlb"
}
}
NLB access logs land in S3 as gzipped log lines per-AZ-per-5-minute. Athena view over the bucket (created in Terraform) makes them queryable: SELECT timestamp, client_ip, listener_port, action FROM mail_nlb_logs WHERE ....
resource "aws_kms_key" "dnssec" {
customer_master_key_spec = "ECC_NIST_P256"
deletion_window_in_days = 7
key_usage = "SIGN_VERIFY"
policy = data.aws_iam_policy_document.dnssec_key_policy.json
}
resource "aws_kms_alias" "dnssec" {
name = "alias/cabal-dnssec-${var.environment}"
target_key_id = aws_kms_key.dnssec.id
}
resource "aws_route53_key_signing_key" "control" {
hosted_zone_id = aws_route53_zone.control.id
key_management_service_arn = aws_kms_key.dnssec.arn
name = "cabal-${var.environment}-control"
}
resource "aws_route53_hosted_zone_dnssec" "control" {
hosted_zone_id = aws_route53_zone.control.id
depends_on = [aws_route53_key_signing_key.control]
}
Same pattern for each mail-domain zone in terraform/infra/modules/domains/main.tf. The KSK is per-zone; the KMS key can be shared per environment.
force_destroy on the mail zones drops to false. The operator workflow for retiring a mail apex becomes “first delete records, then terraform destroy of just that zone.” The claude.md runbook captures the procedure.
Registrar DS records: after aws_route53_hosted_zone_dnssec activates, Route 53 publishes a DS-record value (ds_record) as an output. The operator adds that DS record at the domain registrar. There is no Terraform-AWS-only path for this — the registrar is out-of-band.
Rollout sequence per zone (cribbed from AWS’s documented DNSSEC enablement procedure):
aws_route53_hosted_zone_dnssec which sets signing_status = "SIGNING". Zone now serves signed records.The Terraform-driven sequence has to be two applies with manual operator action in between. The runbook captures this; the plan deliberately splits Phase 4 into Phase 4a (KSK creation, no signing) and Phase 4b (signing activation) accordingly.
KSK rotation cadence: yearly per AWS best practice. The KSK is a KMS-backed key, so rotation is mechanical; the DS record at the registrar has to update too. Document the rotation procedure in docs/operations.md.
terraform/infra/modules/app/cloudfront.tf and terraform/infra/modules/front_door/main.tf:
resource "aws_cloudfront_origin_access_control" "admin" {
name = "cabal-admin-oac"
description = "OAC for the admin app bucket"
origin_access_control_origin_type = "s3"
signing_behavior = "always"
signing_protocol = "sigv4"
}
resource "aws_cloudfront_distribution" "admin" {
...
origin {
domain_name = aws_s3_bucket.admin.bucket_regional_domain_name
origin_id = "admin-s3"
origin_access_control_id = aws_cloudfront_origin_access_control.admin.id
# remove `s3_origin_config { origin_access_identity = ... }`
}
viewer_certificate {
acm_certificate_arn = ...
ssl_support_method = "sni-only"
minimum_protocol_version = "TLSv1.2_2023" # was 2021
}
}
S3 bucket policy on the admin bucket replaces the OAI canonical-user grant with an OAC SourceArn condition:
data "aws_iam_policy_document" "admin_bucket" {
statement {
actions = ["s3:GetObject"]
resources = ["${aws_s3_bucket.admin.arn}/*"]
principals {
type = "Service"
identifiers = ["cloudfront.amazonaws.com"]
}
condition {
test = "StringEquals"
variable = "AWS:SourceArn"
values = [aws_cloudfront_distribution.admin.arn]
}
}
}
The migration is per-distribution. The OAI resources stay until the new bucket policy + OAC reference is in place; remove OAI in a follow-on apply once verified.
| Phase | Scope | Reversible | Risk |
|---|---|---|---|
| 1 | Backup vault lock (governance mode), cross-region copy, plan retention | Yes (vault lock can be disabled with 24 h delay) | Low. Additive. New DR-region resources are cheap. |
| 2 | DynamoDB PITR + SSE + deletion protection on counter; deletion_protection on others | Yes (flip the flag back, redeploy) | Low. |
| 3 | NLB access logs + S3 bucket | Yes | Low. Adds storage cost (small). |
| 4a | DNSSEC: KSK creation (no signing) | Yes (delete KSK resource before signing activation) | Low. |
| 4b | DNSSEC: signing activation (after registrar DS record published) | Risky (zone has to roll back through “disabled” state with another delay) | High. Botched DS record = self-inflicted DNS outage. Test in dev. Practice once before stage. Schedule outside production-critical windows. |
| 5 | CloudFront OAC + TLS policy bump | Yes (revert OAI grant, restore policy) | Medium. Brief blip on the distribution-config rollout (Cloudfront propagation is 5-15 minutes). Schedule during low-traffic window. |
Per-environment ordering: dev → stage → prod for each phase. Phase 4 is the longest because of the wait-for-DS-propagation interlock.
copy_action. Drop the DR-region resources.deletion_protection_enabled = false on each table; the PITR/SSE flags can stay (they are no-cost).access_logs block on the NLB. The bucket can stay (cheap).aws_route53_key_signing_key. Zone is unchanged.signing_status = "INACTIVE" via aws_route53_hosted_zone_dnssec. Wait 24 hours for the previously-signed responses to age out of resolver caches. Then delete the DS record at the registrar. Then delete the KSK. Out-of-order rollback breaks resolution..github/workflows/infra.yml consumes a new dr_region provider; Terraform plan output gains the cross-region resources.terraform/infra/main.tf adds the aws provider with alias = "dr_region".var.dr_region (default "us-west-2").var.dnssec_enabled (default true once Phase 4 lands; false during the transitional period to allow per-environment opt-in).docs/operations.md.aws backup describe-backup-vault --backup-vault-name cabal-backup returns LockDate: <future timestamp> and MinRetentionDays: 30.aws backup list-recovery-points-by-backup-vault --backup-vault-name cabal-backup-dr --region <dr_region> shows recovery points from the last 24 hours.terraform destroy against the prod environment refuses to drop aws_dynamodb_table.counter until deletion_protection_enabled = false is set in code.aws s3 ls s3://cabal-nlb-access-logs-<account>/mail-nlb/AWSLogs/<account>/elasticloadbalancing/<region>/<yyyy>/<mm>/<dd>/ shows gzipped log objects from the last hour.dig +dnssec @<resolver> mail-admin.<first-apex> shows the AD flag set and an RRSIG record alongside the answer.whois <control-apex> at the registrar shows a DS record matching aws_route53_key_signing_key.control.ds_record.aws cloudfront get-distribution-config --id <admin-dist-id> shows OriginAccessControlId set and no OriginAccessIdentity. MinimumProtocolVersion: TLSv1.2_2023.curl --tlsv1.2 https://admin.<control-domain> succeeds; curl --tlsv1.1 fails on protocol negotiation.identity-iam-hardening-plan.md. Schedule as Phase 1.5 once that work lands.aws_kms_key.dnssec) is supported broadly; RSA-SHA-256 is more universal but produces larger signatures. Recommendation: stick with ECDSA. Revisit only if we observe resolver-side compatibility issues.TLSv1.2_2023 includes TLS 1.2 ciphers; a stricter TLSv1.3 floor would lock out older clients. The webmail user base is modern browsers; the mail-protocol traffic does not go through CloudFront. Plausible to raise the floor in a follow-up.