Cabalmail

Host your own email and enhance your privacy

View the Project on GitHub cabalmail/cabal-infra

IaC Quality Gates Plan

Note (post-simplification update): this plan was originally written against the pre-infra.yml workflow layout. Phase 6 of build-deploy-simplification-plan.md (shipped in 0.9.5) replaced terraform.yml + bootstrap.yml with a single infra.yml that owns both the terraform/infra stage and the terraform/dns (bootstrap) stage as gated jobs. References below have been updated to the post-simplification names; the scanner configurations (broken tflint loop, soft-failing Checkov, deprecated tfsec, no scanners on the DNS stack) carried over verbatim from the legacy workflow, so the substantive findings here are unchanged.

Context

The infra.yml workflow has run three Terraform-targeted scanners on every push to terraform/infra/** for years: Checkov (broad cloud-config policy), tflint (Terraform-specific lint + provider rules), and tfsec (security-focused HCL scanner). The intent at the time was sound: catch misconfigurations before they land. The execution has drifted in ways that mean today these scanners are mostly noise generators, not gates:

This plan does two separable things, in order:

  1. Replace tfsec with Trivy. A like-for-like swap that costs us nothing and unblocks the rest of the work on a maintained tool.
  2. Convert all three scanners from background noise into staged quality gates. Net-new findings fail CI; the existing pile is captured as a baseline and walked down deliberately.

The two halves can ship independently, but combining them in one rollout per stack keeps the operator-facing churn to a single inflection point.

Goals

Non-goals

Tool currency assessment

Tool Status Verdict
Checkov Actively maintained by Prisma Cloud Keep. Broadest rule pack; strong AWS coverage; supports baselining natively.
tflint Actively maintained Keep. Only tool here that catches Terraform-language lint (unused vars, deprecated syntax, provider-arg shape). The AWS ruleset complements security scanners.
tfsec Merged into Trivy; maintenance mode Replace with Trivy IaC. Same Aqua engine, same finding IDs (AVD-AWS-*), actively maintained, single binary that we can later reuse for image scanning.

Other tools considered and rejected:

Current state (audit)

A short investigation step to bake into Phase 0: collect a current finding count from each tool against the current main to set expectations for the size of the initial baseline.

Target state

Tool inventory (post-migration)

Tool Action / Binary Version pin Scope
Checkov bridgecrewio/checkov-action@<sha> Tagged release, by SHA both stacks
tflint terraform-linters/setup-tflint@<sha> + tflint --recursive Tagged release, by SHA both stacks
Trivy IaC aquasecurity/trivy-action@<sha> with scan-type: config Tagged release, by SHA both stacks

All three are pinned to commit SHA, not floating tags, per the standard supply-chain hardening posture for third-party actions. The Renovate/Dependabot rule (covered below) bumps these as a normal PR with clear diffs.

tflint loop fix and config

Replace the broken loop with the supported --recursive flag (tflint 0.50+):

- name: run-linter
  working-directory: ./terraform/infra
  run: tflint --recursive --format compact

Bump .tflint.hcl to current AWS ruleset and enable the bundled terraform_* rules:

plugin "terraform" {
  enabled = true
  preset  = "recommended"
}

plugin "aws" {
  enabled = true
  version = "0.40.0"   # or whatever's latest at PR time
  source  = "github.com/terraform-linters/tflint-ruleset-aws"
}

The recommended preset enables terraform_required_version, terraform_module_pinned_source, terraform_unused_declarations, and a handful of similar lints we should already be passing. If we’re not, those become baseline entries.

Baseline files (per tool, per stack)

Each tool gets a per-stack baseline checked into the stack root:

terraform/infra/.checkov.baseline
terraform/infra/.tflint.baseline      # or `.tflint-ignore` per project convention
terraform/infra/.trivy.baseline
terraform/dns/.checkov.baseline
terraform/dns/.tflint.baseline
terraform/dns/.trivy.baseline

Format: each tool’s native baseline format. Checkov has --create-baseline / --baseline. Trivy has --ignorefile .trivyignore plus misconfig.exceptions in trivy.yaml. tflint emits a JSON-based ignore via --format=json plus an --ignore-rule/--filter mapping; the practical pattern is per-resource # tflint-ignore: <rule_name> # justification on the offending line (see “Suppressions” below).

The baseline files are first-class, reviewed artifacts. Each entry must include a one-line justification recorded either in the file itself (Checkov supports JSON metadata) or in a sibling BASELINE.md per stack (terraform/infra/BASELINE.md, terraform/dns/BASELINE.md) cross-referencing the IDs. The presence of a justification is verified by a small CI check (Phase 4).

Suppressions in code

Inline suppressions are preferred over baseline entries when the finding is local to one resource and we genuinely don’t intend to fix it:

A pre-commit + CI check rejects suppressions without a justification text after the colon/hash. Without that guard, the rationale-capture goal degrades into shrug-emoji silencing.

Drift detection

Once baselines exist, the gate has two failure modes:

  1. New finding not in the baseline. Standard fail-the-PR behaviour.
  2. Baseline entry that no longer matches any real finding (stale entry — the underlying code was fixed but the entry was never removed). This drifts the baseline upward and re-creates the run-and-ignore problem we’re solving.

Checkov has a --baseline mode but does not emit “stale entry” diagnostics out of the box. The pragmatic workaround: a small Python helper run after each scanner that diffs current_findings ∪ inline_suppressions against the baseline, fails if anything in the baseline is unaccounted for. Same pattern for Trivy and tflint. Lives in .github/scripts/baseline-diff.py.

Rollout posture per finding class

Decisions to make per finding type, applied during the baselining phase:

The classification work happens in Phase 2; the actual code remediation for the first bucket happens in Phase 2.5; ongoing decay for the third bucket happens in Phase 4. This split keeps “establish what we have” and “fix what must be fixed before gating” as separate, individually-rollback-able phases — rather than bundling them into the gate-flip PR where a slip on either side blocks the other.

CI: surfacing findings

Checkov, Trivy, and tflint can all emit SARIF; uploading SARIF via github/codeql-action/upload-sarif populates the GitHub Security → Code scanning tab with deduplicated, navigable findings tied to file lines. This replaces “scroll through job logs” as the operator’s first stop and gives us deltas per PR for free.

PR comment posture: leave the SARIF surface as the canonical view; do not spam PRs with one-comment-per-finding bots. Optional: a single summary comment per PR with counts and a link to the Security tab.

Local runner

A Makefile (or .github/scripts/scan-local.sh) at the repo root that runs all three tools against both stacks:

.PHONY: scan scan-infra scan-dns
scan: scan-infra scan-dns
scan-infra:
	checkov -d terraform/infra --baseline terraform/infra/.checkov.baseline
	tflint --recursive --chdir terraform/infra
	trivy config terraform/infra --ignorefile terraform/infra/.trivy.baseline
scan-dns:
	checkov -d terraform/dns --baseline terraform/dns/.checkov.baseline
	tflint --chdir terraform/dns
	trivy config terraform/dns --ignorefile terraform/dns/.trivy.baseline

The point is parity between local and CI: if it passes locally, it passes in CI, and vice versa. Same tool versions via mise/asdf/pre-commit-hooks, picked at PR time.

Migration sequence

One PR per phase. Each phase is independently deployable; each phase’s rollback is the previous phase.

Phase 0 — Baseline measurement (no behavioural change)

Run each tool against current main for both stacks; record the finding counts and severity breakdown in docs/0.10.x/iac-baseline-snapshot.md (created as part of this phase). Output sets the size of the work in Phase 3 and lets us track decay.

This phase doesn’t touch CI. Pure reconnaissance.

Phase 1 — Replace tfsec with Trivy IaC, scan both stacks, still soft-fail

Single PR against infra.yml:

After this phase merges, the Security → Code scanning tab has a real findings list against current main. Phase 0’s snapshot validates that the new Trivy findings are a superset of what tfsec reported (they should be; same engine, newer rules).

Phase 2 — Establish baselines

Single PR per stack (so two PRs):

The PR includes the baseline files but CI is still soft-fail. The baseline is established but not yet enforcing.

Phase 2.5 — Fix sweep (high-severity, code-localised)

Not a single PR — a fan-out of small focused PRs, one per finding cluster, sized so that each is reviewable independently. The phase is bounded by classification (which findings are in scope) and by an exit criterion (all of them are landed or reclassified), not by calendar time.

The actual size of this phase is unknown until Phase 0’s snapshot is in hand. The plausible range, given Cabalmail’s footprint, is somewhere between five and forty fixes split across both stacks; the realistic calendar cost is one to four weeks of part-time work, dominated by review latency rather than implementation effort. Do not promise a Phase 3 date until Phase 0 has reported.

What lands in this phase:

What does not land in this phase:

Mid-phase classification corrections are fine and expected; the boundary between “fix” and “suppress” is sometimes only obvious once you’ve started writing the fix. What matters is that every high-severity finding from Phase 0 has either been fixed, suppressed-with-reason, or explicitly reclassified by the time Phase 3 starts.

Exit criterion (verifiable, not vibes-based): re-run all three scanners against the candidate Phase 3 branch; the count of HIGH/CRITICAL findings not in the baseline and not inline-suppressed is zero. This is the same check Phase 3’s gate will perform, just run pre-flip so we know it passes.

If Phase 2.5 stretches past two calendar weeks without clear daylight, that is a signal that the severity threshold is wrong (too aggressive) or that some “code-localised” findings are actually design-driven and should be reclassified. Re-run the Phase 2 classification on the residual rather than grinding through.

Phase 3 — Flip the gate

Single PR:

After this phase, every new PR passes only if no new findings appear and no baseline entry has gone stale. The first PR after this lands is the test of whether the baseline + drift-detection setup is right; expect at least one false positive that needs a small fix to the diff script.

Phase 4 — Decay

Ongoing, but with structure rather than vibes:

Per-stack ordering

terraform/dns first through Phases 1–3 (it is small, has fewer findings, lower-stakes — DNS-only stack), then terraform/infra. The two stacks can share Phase 1 (single CI rewrite PR) but must split Phase 2 (one baseline PR per stack — they are different code).

Rollback

Phase Rollback
0 None needed — pure measurement.
1 Revert the infra.yml changes. tfsec returns; trivy/checkov/tflint scope on dns disappears.
2 Delete the baseline files and BASELINE.md. CI continues to soft-fail (no behavioural change since Phase 1 was already soft-fail).
2.5 Each fix PR is independently revertable. Reverting them all returns the codebase to its Phase 2 state with the original baseline. Generally we wouldn’t roll these back wholesale — the fixes are correctness improvements regardless of whether the gate flips.
3 Re-add soft_fail: true (or equivalent per tool) and remove the drift-detection step. The baseline files stay; gates revert to noise.
4 n/a (continuous). Individual tool-version bumps revert as normal PRs.

The window of weakened security posture is Phase 1–3, during which tfsec’s hard gate is gone but the new gates aren’t yet enforcing. Bound this to ~2 weeks of calendar time and avoid scheduling it across a code-freeze or release-cut window.

CI changes

Acceptance

Open questions

Out of scope for 0.9.0