Cabalmail

Host your own email and enhance your privacy

View the Project on GitHub cabalmail/cabal-infra

SMTP Sinkhole Test Harness Plan

Context

Cabalmail has no controllable way to produce a transient SMTP error against smtp-out on demand. The natural sources of 4xx responses - greylisting from real remote MTAs, transient DNS failures, rate-limited send paths - are either unreliable or only reproducible by accident. This becomes acute when validating queue-persistence behavior (see smtp-out-queue-persistence-plan.md): the phase 3 acceptance criteria require a queued message that survives task replacement, and the test sequence assumed in that plan does not exist as written.

Sending to a non-existent recipient on a live mail domain produces 550 5.1.1 User unknown - a permanent failure that bounces immediately rather than queueing. Sending to a domain with no MX produces 5.1.2 host not found - also permanent. Both are the wrong shape for queue-persistence testing, which needs a deferred message that sits in /var/spool/mqueue long enough for an operator to force a task replacement and observe handoff.

This plan adds a feature-flagged sinkhole ECS tier: a tiny SMTP listener that returns operator-configurable responses to every RCPT TO, reachable from smtp-out via Cloud Map plus a sendmail mailertable override. It is permanent infrastructure in the dev and stage environments, never enabled in prod, and is the test fixture that makes deferred-retry scenarios reproducible.

The first concrete use is queue-persistence phase 3 validation. Subsequent uses (DSN handling, large-message timeouts, STARTTLS-fallback behavior, multi-runner coordination once confMIN_QUEUE_AGE ships in phase 4) reuse the same fixture by toggling its response mode at runtime.

Goals

Non-goals

Current state (audit)

Target state

New tier: sinkhole

smtp-out integration

A single mailertable line activated when SINKHOLE_ENABLED=true is present in the smtp-out task’s environment:

sinkhole.test       smtp:[sinkhole.cabal.internal]:25

.test is reserved by RFC 2606 and never resolves on the public internet, so a leaked packet has no escape path. The mailertable bracket-and-port syntax skips MX lookup entirely - smtp-out connects directly to the Cloud Map name. Cloud Map resolves it to the sinkhole task’s ENI IP via the private VPC resolver.

generate-config.sh is extended with one conditional block that appends the line when the env var is set. The line is regenerated on every SNS-triggered reconfigure for free; no Docker rebuild needed if the flag flips at runtime.

When var.sinkhole = true, Terraform sets SINKHOLE_ENABLED=true in the smtp_out task definition’s environment list. When false, the env var is omitted and generate-config.sh skips the mailertable line.

Behavior when the flag is off

var.sinkhole = false is the default. With the flag off:

Prod safety

Two independent guardrails:

  1. GitHub Actions variable. vars.TF_VAR_SINKHOLE is set per environment. Prod’s value is fixed at false and never changed.
  2. Terraform precondition. The sinkhole task definition carries a precondition that fails the plan if the resolved environment is prod and the flag is true:

    precondition {
      condition     = !(var.sinkhole && var.environment == "prod")
      error_message = "Sinkhole tier must never run in prod."
    }
    

    The var.environment value is derived from the branch in the existing infra.yml flow; the precondition fires at plan time, before any resource is touched.

Belt and suspenders: even an accidental TF_VAR_SINKHOLE=true override on a prod apply fails the plan.

Behavior toggling

Operator workflow during a test session:

aws ssm put-parameter --name /cabal/sinkhole_mode --value defer  --overwrite
# (drive smtp-out, observe queue)
aws ssm put-parameter --name /cabal/sinkhole_mode --value accept --overwrite
# (next retry delivers; queue drains)

The listener re-reads the parameter on every connection (cached 30 s). No task restart, no reconfigure SQS hop, no DNS change. The 30 s cache exists only to bound SSM API call rate; for a test session that wants tighter coupling, restart the sinkhole task to clear the cache.

Quiesce behavior

The sinkhole tier wires into var.quiesced the same way other tiers do: when quiesced, desired count goes to zero. Coming out of quiesce, desired count goes back to 1. The Cloud Map registration persists (no instances registered while quiesced), so the mailertable on smtp-out continues to resolve to a name with no answers - which surfaces as a connection failure on the smtp-out side. That is the right shape for “sinkhole offline” test scenarios.

Migration sequence

One PR per phase, in order. Each phase is independently apply-able and each phase’s rollback is the previous phase.

  1. Plan doc. This PR. No code, no infrastructure.
  2. Image. Add docker/sinkhole/ with a Dockerfile and listener.py. Wire into the app.yml docker matrix gated on vars.TF_VAR_SINKHOLE. ECR repo created Terraform-side in phase 3, so phase 2 alone cannot push - this PR introduces the buildable image and leaves it dormant.
  3. ECR + flag. Add var.sinkhole and var.environment variables to terraform/infra/variables.tf. Add the cabal-sinkhole ECR repo with prevent_destroy. No tier resources yet. Plumb vars.TF_VAR_SINKHOLE and vars.TF_VAR_ENVIRONMENT through infra.yml. Phase 3 is safe to apply with the flag off everywhere.
  4. Tier. Add the sinkhole task definition, ECS service, Cloud Map registration, SSM parameter, and orphan-reconciliation terraform_data to terraform/infra/modules/ecs/. Gated on var.sinkhole. Carry the prod-refusal precondition. Apply in dev first; flag stays off in stage and prod.
  5. smtp-out integration. Add the conditional mailertable line in generate-config.sh. Add SINKHOLE_ENABLED env var on smtp_out task def, gated on var.sinkhole. After this lands and var.sinkhole = true is set in stage, a smtp-out reconfigure activates the route. Verify by sending a test message addressed to nobody@sinkhole.test from inside stage and observing the queued result.
  6. Runbook. Add docs/testing/queue-persistence.md with the step-by-step test sequence: enable in stage, set mode defer, send a message, force-replace a smtp-out task, observe handoff, set mode accept, wait for delivery, set mode back to defer (or scale sinkhole to zero) when done. Document the SSM commands, the mailq inspection via aws ecs execute-command, and the cleanup checklist.

Per-environment ordering

dev end-to-end through phase 6, then stage. prod does not progress past phase 3 (the ECR repo is harmless to create; the tier itself is permanently gated off).

Rollback

Step Rollback
Plan doc (1) None needed - text only.
Image (2) Delete the image directory. No infrastructure was created.
ECR + flag (3) Terraform-side: delete the ECR resource (the prevent_destroy will block; remove the lifecycle clause first, then delete). The variables themselves can stay - they default to safe values.
Tier (4) Set var.sinkhole = false in the target environment’s GitHub variables and apply. Terraform destroys the tier cleanly; Cloud Map registration drains via the orphan-reconciliation pattern. Image and ECR repo persist.
smtp-out integration (5) Revert generate-config.sh and remove the env var from the task def. A subsequent smtp-out reconfigure drops the mailertable line.
Runbook (6) Delete the runbook file.

Operational considerations

Acceptance

Open questions

Out of scope