Terraform State Encryption Plan

Context

Today, Cabalmail’s Terraform state lives in the cabal-tf-backend S3 bucket. The bucket has SSE-S3 enabled at the bucket level (AWS default since 2023), so the state file is encrypted at rest — but any IAM principal with s3:GetObject on the bucket can read fully-decrypted state. This is the standard backend posture, and it has been adequate while the only secrets in state were resource ARNs and IDs.

The 0.7.0 monitoring work surfaced a concrete case where this posture starts to chafe: the alert_sink Lambda needs Pushover credentials and an ntfy publisher token. The Phase 1 implementation works around the issue by writing placeholder values via Terraform and using ignore_changes = [value] so the operator can aws ssm put-parameter the real values out-of-band. That keeps secrets out of state, but at the cost of:

Manual setup steps that are easy to skip and hard to reproduce across environments.
Drift between code and reality — terraform.tfstate claims a value the operator immediately overwrote.
No way to rotate the secret in the same flow as a normal apply.

Terraform 1.10 (Nov 2024) added first-class state and plan file encryption via a top-level encryption block. With KMS-backed encryption enabled, S3 read access alone is no longer sufficient to read secret values from state — the reader also needs kms:Decrypt on the configured key. That changes the calculus enough that we can comfortably manage secrets through Terraform.

This plan migrates both Terraform stacks (terraform/dns, terraform/infra) to encrypted state, then folds the Phase 1 monitoring secrets into the standard pattern.

Goals

All Terraform state and plan files are encrypted client-side with a per-environment AWS KMS customer-managed key.
The deploy IAM principal is the only entity with kms:Decrypt on those keys.
Secrets that today are seeded out-of-band (Pushover user key, Pushover app token, ntfy publisher token, anything similar added later) become regular Terraform inputs sourced from GitHub Actions secrets.
Rotation is “update the GitHub secret and re-run apply” — no manual aws ssm put-parameter.
The migration is reversible: each step has a rollback path, and the final state is recoverable from backup if a key is accidentally deleted.

Non-goals

Re-keying historical state files. Once a state file is encrypted forward, the previous version in the S3 bucket remains SSE-S3 encrypted but unencrypted client-side; we do not attempt to retroactively encrypt prior versions. (S3 versioning retention will age them out per the bucket lifecycle policy.)
Encrypting terraform.tfvars files generated by CI. They are written to a runner’s working directory, not persisted; the protection comes from masking the underlying GitHub secrets, not from file-level encryption.
Switching to OpenTofu. OpenTofu has had this feature since 1.7 and uses the same syntax; this plan is forward-compatible if we ever migrate.

Current state (audit)

Backend: cabal-tf-backend S3 bucket. make-terraform.sh writes a backend block with bucket, key, region only — no encrypt, no kms_key_id, no dynamodb_table (state locking).
State keys: dev-bootstrap, stage-bootstrap, prod-bootstrap (DNS stack); dev, stage, prod (infra stack).
Bucket-level encryption: SSE-S3 (default-on). No bucket key, no CMK.
Terraform version: pinned at >= 1.1.2 in module versions.tf files; CI uses hashicorp/setup-terraform@v2 without a version, which resolves to latest stable. Both are compatible with the 1.10 encryption block once we bump the floor.

Target state

KMS keys

One CMK per environment, one alias per stack/environment combination:

Environment	Key alias	Purpose
dev	`alias/cabal-tf-state-dev`	Encrypts dev `infra` + DNS state
stage	`alias/cabal-tf-state-stage`	Encrypts stage state
prod	`alias/cabal-tf-state-prod`	Encrypts prod state

One key per environment (not per stack) keeps the surface small. infra and dns for the same environment share a key; cross-environment isolation is preserved.

Key policy: deploy principal gets Encrypt, Decrypt, GenerateDataKey, DescribeKey. Account root keeps full admin (per AWS best practice — never lock yourself out of your own key). Deletion window: 30 days (max). Automatic rotation: on (annual). Multi-region: false (state lives in one region).

Backend changes

Update make-terraform.sh to emit:

terraform {
  backend "s3" {
    bucket       = "cabal-tf-backend"
    key          = "<env>"
    region       = "<region>"
    encrypt      = true
    kms_key_id   = "alias/cabal-tf-state-<env>"
    use_lockfile = true   # S3-native locking, GA in TF 1.10; no DynamoDB needed.
  }
}

encrypt = true + kms_key_id is independent of the client-side encryption block — they protect different layers (transit/at-rest in the S3 store vs. the file payload itself). Both should be on.

Client-side encryption block

A top-level encryption block per stack root (terraform/dns/main.tf and terraform/infra/main.tf):

terraform {
  encryption {
    key_provider "aws_kms" "state" {
      kms_key_id = "alias/cabal-tf-state-<env>"
      region     = "<region>"
      key_spec   = "AES_256"
    }

    method "aes_gcm" "state" {
      keys = key_provider.aws_kms.state
    }

    state {
      method   = method.aes_gcm.state
      enforced = true
    }

    plan {
      method   = method.aes_gcm.state
      enforced = true
    }
  }
}

enforced = true is a one-way door: once set, every operator who runs terraform plan or terraform apply against this stack must have kms:Decrypt on the key. For our CI-only apply model with a single deploy principal, that is fine — and is the whole point. The migration step below uses enforced = false exactly once per stack/env, then flips it on.

The <env> and <region> placeholders mean the encryption block has to be templated like the backend block. Extend make-terraform.sh to write it alongside backend.tf, or inline both into a single generated _generated.tf file.

Required Terraform version

Bump every versions.tf’s required_version to >= 1.10. Pin setup-terraform to terraform_version: "~1.10" in the workflows so a future TF 2.x release doesn’t surprise us.

Migration sequence

Per stack (dns, then infra) per environment (dev → stage → prod):

Bootstrap the KMS key. A new tiny stack terraform/state-keys/ (or a one-shot aws kms create-key via the console — fine for a one-time bootstrap) creates the three CMKs and aliases. Output the key ARNs to a non-secret file (docs/0.9.0/key-arns.md) for reference.
Bump TF version. Update versions.tf floors and setup-terraform versions in CI. Confirm terraform plan still no-ops on every environment.
Add the backend encrypt + kms_key_id. Update make-terraform.sh. The next terraform init migrates the state file (S3 PutObject with SSE-KMS). Test on dev first; this is reversible by stripping the lines and re-init’ing with -migrate-state.
Add the encryption block with enforced = false and a one-shot migration block.
```
terraform {
  encryption {
    # ... key_provider, method as above ...

    state {
      method = method.aes_gcm.state
      # No enforced flag yet.
    }

    state {
      unencrypted = true
    }
  }
}
```
On the next apply, TF reads the still-unencrypted state and writes encrypted. One apply per stack per environment. Confirm by downloading the state file from S3 and observing it is now an opaque blob with "encryption": {...} metadata at the top.
Remove the migration state { unencrypted = true } block; flip enforced = true. From this point forward, anyone without kms:Decrypt cannot read state. Same for plan files.
Switch monitoring secrets to TF-managed.
- Drop ignore_changes = [value] and the placeholder strings on aws_ssm_parameter.pushover_user_key, aws_ssm_parameter.pushover_app_token, aws_ssm_parameter.ntfy_publisher_token.
- Add variable "pushover_user_key", variable "pushover_app_token", variable "ntfy_publisher_token" to the monitoring module and the root, all sensitive = true.
- In CI, source these from GitHub secrets (secrets.PUSHOVER_USER_KEY etc.), not vars. Set them as TF_VAR_* env on the apply step rather than writing them to terraform.tfvars on disk.
- Document that rotation is “update the GitHub secret and re-run terraform apply” — no aws ssm put-parameter, no terraform taint.
(ntfy bootstrap caveat) The ntfy publisher token is generated by the running ntfy container, so the very first deploy of a new environment is still: apply (with placeholder secrets) → ECS-Exec bootstrap → put GitHub secret → re-apply. From the second deploy onward, rotation lives in GitHub. Document this in docs/monitoring.md.

Per-environment ordering

dev first, end-to-end, with the migration steps spaced out by at least one CI run each so that any breakage shows up cheaply. Then stage, then prod. The whole sequence per stack should fit in one PR per environment if we want clean rollback boundaries; bundling all three is also acceptable once we’ve done dev.

Rollback

Step	Rollback
KMS bootstrap	Disable + schedule deletion (30-day window).
TF version bump	Revert the workflow + `versions.tf` change; no state implications.
Backend `encrypt` + `kms_key_id`	Remove the lines, run `terraform init -migrate-state`. New state writes drop SSE-KMS.
Client-side encryption (migration apply)	Restore the prior state version from S3 versioning + remove the encryption block.
`enforced = true`	Revert to `enforced = false` and re-add `state { unencrypted = true }` migration block; one apply restores readability with the old toolchain.
Secret-management switch	Revert the `ignore_changes` removal; re-add placeholders. The real values are still in SSM; nothing breaks at runtime.

CI changes

Workflows (terraform.yml, bootstrap.yml):

The deploy IAM principal needs kms:Encrypt, kms:Decrypt, kms:GenerateDataKey, kms:DescribeKey on the per-environment CMK. Either inline that into the existing deploy policy or add a kms policy attachment.
Set GitHub Actions environment secrets PUSHOVER_USER_KEY, PUSHOVER_APP_TOKEN, NTFY_PUBLISHER_TOKEN (per environment).
In the apply job (and plan if we want secret-aware diffs there), pass them through as env vars rather than writing to terraform.tfvars:
```
- name: apply-terraform
  env:
    TF_VAR_PUSHOVER_USER_KEY:    $
    TF_VAR_PUSHOVER_APP_TOKEN:   $
    TF_VAR_NTFY_PUBLISHER_TOKEN: $
  run: terraform apply ...
```
GitHub auto-masks secret values in logs. They never touch the runner’s filesystem.

Disaster recovery

Lost KMS key (accidental deletion). Within the 30-day deletion window: cancel the deletion. After the window: state is unrecoverable, but Cabalmail’s Terraform state is reproducible — the resources it manages are recoverable from AWS Backup (DynamoDB + EFS) and re-applied from code. Total recovery cost: hours, not days. Practice once on dev.
Lost deploy IAM credentials. Standard rotation; no state-encryption-specific impact.
Compromised deploy IAM credentials. Same plus revoke the key grant for the compromised principal. The CMK rotation policy (annual) limits the window of useful stolen ciphertext.
State file corruption. S3 versioning is on the bucket today; we keep that. Restore to the previous version, run terraform plan to confirm it still reflects reality.

Acceptance

All three environments’ state files are KMS-encrypted (visible as aws:kms ServerSideEncryption in S3 console) and opaque when downloaded directly (no readable JSON, no plaintext secrets).
A simulated read by an IAM principal with s3:GetObject but without kms:Decrypt returns access-denied on the underlying object.
terraform plan still produces no diff in steady state.
Rotating the Pushover app token consists of: update PUSHOVER_APP_TOKEN secret in the prod GitHub environment, re-run the terraform workflow, observe the SSM SecureString updated; trigger a Kuma test alert and confirm Pushover delivery still works.
The runbook in docs/monitoring.md is updated to reflect the new rotation flow; the manual aws ssm put-parameter instructions are reduced to the ntfy first-boot bootstrap only.

Open questions

Single-region failover. Today there’s one region; if 0.9.0 introduces multi-region, the per-environment CMK becomes a multi-region key with replicas. Out of scope here.
Plan-file artifacts. plan-terraform.sh produces a plan output; with plan { enforced = true }, the artifact uploaded between the plan and apply jobs is encrypted at rest (already true in GitHub Actions) and additionally encrypted client-side. The apply job needs kms:Decrypt to consume it — confirm the existing apply principal has it.
OpenTofu migration. If we ever switch, the encryption block syntax is identical, but the key provider names differ slightly (aws_kms is the same). One-day spike to confirm.

Out of scope for 0.9.0

Application-level secrets management (e.g. moving the Cognito client secret out of state — separate posture decision).
Hardware-backed key custody (AWS CloudHSM, KMS XKS).
Per-secret keys vs. per-environment keys. The current per-environment design is a good default; revisit if a future compliance regime demands tighter blast radius.