Host your own email and enhance your privacy
Note (post-simplification update): this plan was originally written against the pre-
infra.ymlworkflow layout. Phase 6 ofbuild-deploy-simplification-plan.md(shipped in 0.9.5) replacedterraform.yml+bootstrap.ymlwith a singleinfra.ymlthat owns both theterraform/infrastage and theterraform/dns(bootstrap) stage as gated jobs. References below have been updated to the post-simplification names; the scanner configurations (broken tflint loop, soft-failing Checkov, deprecated tfsec, no scanners on the DNS stack) carried over verbatim from the legacy workflow, so the substantive findings here are unchanged.
The infra.yml workflow has run three Terraform-targeted scanners on every push to terraform/infra/** for years: Checkov (broad cloud-config policy), tflint (Terraform-specific lint + provider rules), and tfsec (security-focused HCL scanner). The intent at the time was sound: catch misconfigurations before they land. The execution has drifted in ways that mean today these scanners are mostly noise generators, not gates:
soft_fail: true (in the chekov job of infra.yml). The job always succeeds. Findings are visible only by reading the raw log of a successful workflow — no one does.soft_fail: false (in the tfsec job of infra.yml) and is a real gate. But tfsec itself was merged into Trivy in 2023 and the standalone project is in maintenance mode; the action we use (aquasecurity/tfsec-action@v1.0.0) is the same code under a deprecated wrapper.for i in ./ modules/* modules/*/modules/* ; do tflint ; done (in the tflint job of infra.yml) never cds into $i, so tflint is invoked N times in the same root directory and the modules are never scanned. The pinned AWS ruleset version in terraform/.tflint.hcl is 0.20.0 — the current ruleset is at the 0.40 line.terraform/dns (bootstrap) stage of infra.yml (the bootstrap_build / bootstrap_plan / bootstrap_apply jobs) has no scanners at all.@master, tflint installer is curl ... master/install_linux.sh, tfsec action at v1.0.0. Gate strictness drifts silently as upstream ships new rules.This plan does two separable things, in order:
The two halves can ship independently, but combining them in one rollout per stack keeps the operator-facing churn to a single inflection point.
terraform/infra/** or terraform/dns/** runs Checkov, tflint, and Trivy IaC against both stacks. New findings fail the workflow.make scan or equivalent), so feedback is reachable before pushing.cabal:owner tag”).pylint already), the React app (separate concern), or the Docker images. Container image scanning with Trivy is a natural follow-on but is a separate posture decision.terraform/dns) to use the same baseline file as terraform/infra. They get parallel-but-separate baselines.| Tool | Status | Verdict |
|---|---|---|
| Checkov | Actively maintained by Prisma Cloud | Keep. Broadest rule pack; strong AWS coverage; supports baselining natively. |
| tflint | Actively maintained | Keep. Only tool here that catches Terraform-language lint (unused vars, deprecated syntax, provider-arg shape). The AWS ruleset complements security scanners. |
| tfsec | Merged into Trivy; maintenance mode | Replace with Trivy IaC. Same Aqua engine, same finding IDs (AVD-AWS-*), actively maintained, single binary that we can later reuse for image scanning. |
Other tools considered and rejected:
infra.yml infra-stage jobs: chekov (sic, the job name has a typo), tflint, tfsec. All three depend on build (which writes backend.tf) and feed into apply via needs. The apply job won’t run if tflint or tfsec fails; Checkov is bypassed because of soft_fail: true.infra.yml bootstrap-stage jobs: bootstrap_build → bootstrap_plan → bootstrap_apply, gated on a path filter for terraform/dns/**. No scanner jobs at all.find . -maxdepth 4 -name '.checkov*' -o -name '.tfsec*' is empty.Makefile or developer-side runner for these scans..tflint.hcl lives at terraform/.tflint.hcl (one level above both stacks) and only enables the AWS plugin. No terraform_* rules, no terraform_module_pinned_source, no terraform_required_version enforcement.infra.yml’s push trigger covers terraform/dns/** and terraform/infra/** plus its helper scripts; the bootstrap and infra stages are gated downstream by a dorny/paths-filter@v3 step rather than by separate workflow triggers, so adding scanners against terraform/dns is a new sibling job rather than a new workflow.A short investigation step to bake into Phase 0: collect a current finding count from each tool against the current main to set expectations for the size of the initial baseline.
| Tool | Action / Binary | Version pin | Scope |
|---|---|---|---|
| Checkov | bridgecrewio/checkov-action@<sha> |
Tagged release, by SHA | both stacks |
| tflint | terraform-linters/setup-tflint@<sha> + tflint --recursive |
Tagged release, by SHA | both stacks |
| Trivy IaC | aquasecurity/trivy-action@<sha> with scan-type: config |
Tagged release, by SHA | both stacks |
All three are pinned to commit SHA, not floating tags, per the standard supply-chain hardening posture for third-party actions. The Renovate/Dependabot rule (covered below) bumps these as a normal PR with clear diffs.
Replace the broken loop with the supported --recursive flag (tflint 0.50+):
- name: run-linter
working-directory: ./terraform/infra
run: tflint --recursive --format compact
Bump .tflint.hcl to current AWS ruleset and enable the bundled terraform_* rules:
plugin "terraform" {
enabled = true
preset = "recommended"
}
plugin "aws" {
enabled = true
version = "0.40.0" # or whatever's latest at PR time
source = "github.com/terraform-linters/tflint-ruleset-aws"
}
The recommended preset enables terraform_required_version, terraform_module_pinned_source, terraform_unused_declarations, and a handful of similar lints we should already be passing. If we’re not, those become baseline entries.
Each tool gets a per-stack baseline checked into the stack root:
terraform/infra/.checkov.baseline
terraform/infra/.tflint.baseline # or `.tflint-ignore` per project convention
terraform/infra/.trivy.baseline
terraform/dns/.checkov.baseline
terraform/dns/.tflint.baseline
terraform/dns/.trivy.baseline
Format: each tool’s native baseline format. Checkov has --create-baseline / --baseline. Trivy has --ignorefile .trivyignore plus misconfig.exceptions in trivy.yaml. tflint emits a JSON-based ignore via --format=json plus an --ignore-rule/--filter mapping; the practical pattern is per-resource # tflint-ignore: <rule_name> # justification on the offending line (see “Suppressions” below).
The baseline files are first-class, reviewed artifacts. Each entry must include a one-line justification recorded either in the file itself (Checkov supports JSON metadata) or in a sibling BASELINE.md per stack (terraform/infra/BASELINE.md, terraform/dns/BASELINE.md) cross-referencing the IDs. The presence of a justification is verified by a small CI check (Phase 4).
Inline suppressions are preferred over baseline entries when the finding is local to one resource and we genuinely don’t intend to fix it:
# checkov:skip=CKV_AWS_<n>: <one-sentence justification># trivy:ignore:AVD-AWS-<n> # <justification># tflint-ignore: <rule_name> # <justification>A pre-commit + CI check rejects suppressions without a justification text after the colon/hash. Without that guard, the rationale-capture goal degrades into shrug-emoji silencing.
Once baselines exist, the gate has two failure modes:
Checkov has a --baseline mode but does not emit “stale entry” diagnostics out of the box. The pragmatic workaround: a small Python helper run after each scanner that diffs current_findings ∪ inline_suppressions against the baseline, fails if anything in the baseline is unaccounted for. Same pattern for Trivy and tflint. Lives in .github/scripts/baseline-diff.py.
Decisions to make per finding type, applied during the baselining phase:
BASELINE.md and a target version to clear them by. These are the decay-candidates..checkov.yaml with a top-of-file rationale. Use sparingly.The classification work happens in Phase 2; the actual code remediation for the first bucket happens in Phase 2.5; ongoing decay for the third bucket happens in Phase 4. This split keeps “establish what we have” and “fix what must be fixed before gating” as separate, individually-rollback-able phases — rather than bundling them into the gate-flip PR where a slip on either side blocks the other.
Checkov, Trivy, and tflint can all emit SARIF; uploading SARIF via github/codeql-action/upload-sarif populates the GitHub Security → Code scanning tab with deduplicated, navigable findings tied to file lines. This replaces “scroll through job logs” as the operator’s first stop and gives us deltas per PR for free.
PR comment posture: leave the SARIF surface as the canonical view; do not spam PRs with one-comment-per-finding bots. Optional: a single summary comment per PR with counts and a link to the Security tab.
A Makefile (or .github/scripts/scan-local.sh) at the repo root that runs all three tools against both stacks:
.PHONY: scan scan-infra scan-dns
scan: scan-infra scan-dns
scan-infra:
checkov -d terraform/infra --baseline terraform/infra/.checkov.baseline
tflint --recursive --chdir terraform/infra
trivy config terraform/infra --ignorefile terraform/infra/.trivy.baseline
scan-dns:
checkov -d terraform/dns --baseline terraform/dns/.checkov.baseline
tflint --chdir terraform/dns
trivy config terraform/dns --ignorefile terraform/dns/.trivy.baseline
The point is parity between local and CI: if it passes locally, it passes in CI, and vice versa. Same tool versions via mise/asdf/pre-commit-hooks, picked at PR time.
One PR per phase. Each phase is independently deployable; each phase’s rollback is the previous phase.
Run each tool against current main for both stacks; record the finding counts and severity breakdown in docs/0.10.x/iac-baseline-snapshot.md (created as part of this phase). Output sets the size of the work in Phase 3 and lets us track decay.
This phase doesn’t touch CI. Pure reconnaissance.
Single PR against infra.yml:
tfsec job; add a trivy job using aquasecurity/trivy-action@<sha> with scan-type: config on terraform/infra.*-dns scanner jobs (checkov-dns, tflint-dns, trivy-dns) that run against terraform/dns. Gate them on needs.bootstrap_apply (or, more precisely, on a !cancelled() && needs.changes.outputs.dns == 'true' condition that mirrors how the existing infra-stage scanners are reached only when the infra stage is exercised) and wire them into bootstrap_apply’s needs so the bootstrap apply blocks on them in the same way the infra apply blocks on the infra-stage scanners.--recursive).github/codeql-action/upload-sarif.After this phase merges, the Security → Code scanning tab has a real findings list against current main. Phase 0’s snapshot validates that the new Trivy findings are a superset of what tfsec reported (they should be; same engine, newer rules).
Single PR per stack (so two PRs):
.checkov.baseline, .tflint.baseline, .trivy.baseline).BASELINE.md cross-referencing IDs to one-paragraph rationales and target removal versions where applicable.The PR includes the baseline files but CI is still soft-fail. The baseline is established but not yet enforcing.
Not a single PR — a fan-out of small focused PRs, one per finding cluster, sized so that each is reviewable independently. The phase is bounded by classification (which findings are in scope) and by an exit criterion (all of them are landed or reclassified), not by calendar time.
The actual size of this phase is unknown until Phase 0’s snapshot is in hand. The plausible range, given Cabalmail’s footprint, is somewhere between five and forty fixes split across both stacks; the realistic calendar cost is one to four weeks of part-time work, dominated by review latency rather than implementation effort. Do not promise a Phase 3 date until Phase 0 has reported.
What lands in this phase:
What does not land in this phase:
.checkov.yaml with rationale, not a code change.Mid-phase classification corrections are fine and expected; the boundary between “fix” and “suppress” is sometimes only obvious once you’ve started writing the fix. What matters is that every high-severity finding from Phase 0 has either been fixed, suppressed-with-reason, or explicitly reclassified by the time Phase 3 starts.
Exit criterion (verifiable, not vibes-based): re-run all three scanners against the candidate Phase 3 branch; the count of HIGH/CRITICAL findings not in the baseline and not inline-suppressed is zero. This is the same check Phase 3’s gate will perform, just run pre-flip so we know it passes.
If Phase 2.5 stretches past two calendar weeks without clear daylight, that is a signal that the severity threshold is wrong (too aggressive) or that some “code-localised” findings are actually design-driven and should be reclassified. Re-run the Phase 2 classification on the residual rather than grinding through.
Single PR:
soft_fail: true from Checkov.--ignorefile, Checkov --baseline, tflint inline # tflint-ignore is already in code from Phase 2.baseline-diff.py drift-detection script and call it from each scanner job.grep-class CI step that flags any checkov:skip, trivy:ignore, tflint-ignore comment without a : or # justification suffix).After this phase, every new PR passes only if no new findings appear and no baseline entry has gone stale. The first PR after this lands is the test of whether the baseline + drift-detection setup is right; expect at least one false positive that needs a small fix to the diff script.
Ongoing, but with structure rather than vibes:
BASELINE.md (e.g. “clear by 0.9.3”). Entries without a target version are treated as suppressions-disguised-as-baselines and reclassified.medium/low severity entries, total count is below an agreed-upon ceiling, and no entry is more than one minor version past its target.” Pick the ceiling at the end of Phase 2 once the initial baseline size is known; encoding the number now would be guessing.terraform/dns first through Phases 1–3 (it is small, has fewer findings, lower-stakes — DNS-only stack), then terraform/infra. The two stacks can share Phase 1 (single CI rewrite PR) but must split Phase 2 (one baseline PR per stack — they are different code).
| Phase | Rollback |
|---|---|
| 0 | None needed — pure measurement. |
| 1 | Revert the infra.yml changes. tfsec returns; trivy/checkov/tflint scope on dns disappears. |
| 2 | Delete the baseline files and BASELINE.md. CI continues to soft-fail (no behavioural change since Phase 1 was already soft-fail). |
| 2.5 | Each fix PR is independently revertable. Reverting them all returns the codebase to its Phase 2 state with the original baseline. Generally we wouldn’t roll these back wholesale — the fixes are correctness improvements regardless of whether the gate flips. |
| 3 | Re-add soft_fail: true (or equivalent per tool) and remove the drift-detection step. The baseline files stay; gates revert to noise. |
| 4 | n/a (continuous). Individual tool-version bumps revert as normal PRs. |
The window of weakened security posture is Phase 1–3, during which tfsec’s hard gate is gone but the new gates aren’t yet enforcing. Bound this to ~2 weeks of calendar time and avoid scheduling it across a code-freeze or release-cut window.
.github/workflows/infra.yml: Trivy job replaces tfsec on the infra stage; Checkov and tflint pinned and corrected; SARIF uploads added on every scanner; drift-detection step added in Phase 3. Sibling checkov-dns / tflint-dns / trivy-dns jobs added against the bootstrap stage and wired into bootstrap_apply’s needs so the bootstrap apply gates on them the same way the infra apply gates on the infra-stage scanners..github/scripts/baseline-diff.py: new helper, one place per tool. Reads the tool’s current JSON output, reads the baseline, diffs, exits non-zero on either “new finding not in baseline” or “baseline entry not in current findings.”.github/scripts/check-suppression-justifications.sh: new helper, grep-based..github/dependabot.yml (or Renovate config if we adopt it): pin-bump rules for bridgecrewio/checkov-action, aquasecurity/trivy-action, terraform-linters/setup-tflint, and the tflint AWS ruleset version inside .tflint.hcl.Makefile (new) or .github/scripts/scan-local.sh: local parity runner.BASELINE.md shows decreasing row counts at each tagged release through 1.0.0.mise (or whatever we standardise on) installed runs make scan in under 60 seconds and produces the same pass/fail verdict as CI for the same revision.# tflint-ignore comments rather than a baseline file. For findings on generated code or where we don’t want comment churn, we may need a JSON wrapper around tflint --format json filtered by a checked-in allowlist. Decide during Phase 2.--severity HIGH,CRITICAL. Do we let MEDIUM findings through silently, baseline them, or fail on them? Recommendation: fail on all severities for the gate to work as a hygiene tool, not just a security tripwire — but acknowledge this widens the Phase 2 baseline considerably. Revisit after Phase 0’s snapshot.trivy image against the three docker tier images is a follow-on with its own findings/baseline cycle. Not in scope here, but worth flagging to schedule for 0.9.x once IaC gates have soaked.npm audit / Dependabot already cover this).