Cabalmail

Host your own email and enhance your privacy

View the Project on GitHub cabalmail/cabal-infra

SMTP-OUT Queue Persistence Plan

Context

The smtp-out tier accepts authenticated submissions on 465/587, signs with DKIM, and hands off to remote MTAs via sendmail. When sendmail can’t deliver immediately — the most common cause is a remote 4xx (greylisting, rate limit, transient DNS, recipient deferral) — the message lands in /var/spool/mqueue/ and the in-process queue runner retries on a -q15m cadence with a confTO_QUEUERETURN bounce horizon of 4 days (see out-sendmail.mc:9).

Today that queue lives on the container’s ephemeral writable layer. ECS replaces tasks for ordinary reasons — image deploys, host draining, scale-in events, EC2 instance recycling — and any queued message in a replaced task is silently lost. The user never sees a bounce, the recipient never sees the mail, and the only signal is the absence of the eventual delivery. Greylisting in particular guarantees a deferral on first contact with most well-configured remote MTAs, so the window of exposure is not hypothetical: it overlaps with every deploy.

This plan persists the sendmail MTA queue on a new EFS access point on the existing mailstore filesystem, mounted by every smtp-out task. With the queue on shared storage, a replaced task hands off its in-flight retries to whichever sibling task next scans the queue, and a freshly-launched task picks up where its predecessor left off. Sendmail’s classic shared-NFS queue pattern (multiple MTAs running queue runners against one spool, coordinated by fcntl locks on each qf* file) provides the correctness guarantee.

Goals

Non-goals

Current state (audit)

Target state

EFS access point

A new access point on the existing mailstore filesystem, scoped to /smtp-queue:

resource "aws_efs_access_point" "smtp_queue" {
  file_system_id = aws_efs_file_system.mailstore.id

  root_directory {
    path = "/smtp-queue"
    creation_info {
      owner_uid   = 0      # root
      owner_gid   = 12     # mail group on AL2023 sendmail packaging
      permissions = "0700"
    }
  }

  tags = {
    Name = "cabal-smtp-queue"
  }
}

No POSIX user override. Sendmail manages ownership across the qf (control), df (data), xf (transcript), and tf (temp) files itself; an enforced uid/gid on the access point would break the privilege drops between the listener (root) and queue runner. The access point only enforces the root-directory boundary and the initial creation owner; sendmail’s own perms govern the rest.

The owner_gid = 12 matches the AL2023 sendmail rpm default for /var/spool/mqueue (root:mail, mode 0700). Verified in image: getent group mail yields mail:x:12:, getent group smmsp yields smmsp:x:51:, and /var/spool/mqueue ships owned root:mail mode 0700. smmsp owns the client submission queue (/var/spool/clientmqueue), which we do not persist - see Non-goals. An earlier draft of this plan used 25 (smmsp’s historic gid under older RPM packaging) and 0750; both are corrected here.

ECS task-definition changes

In task-definitions.tf, the smtp_out task definition gains a volume block and the container gains a mountPoints entry:

resource "aws_ecs_task_definition" "smtp_out" {
  # ... existing fields ...

  container_definitions = jsonencode([{
    # ... existing fields ...
    stopTimeout = 120

    mountPoints = [{
      sourceVolume  = "smtp-queue"
      containerPath = "/var/spool/mqueue"
    }]
  }])

  volume {
    name = "smtp-queue"
    efs_volume_configuration {
      file_system_id     = var.efs_id
      transit_encryption = "ENABLED"
      authorization_config {
        access_point_id = var.smtp_queue_access_point_id
        iam             = "DISABLED"
      }
    }
  }
}

iam = "DISABLED" matches the IMAP mount’s posture today (no IAM auth on EFS). The access point itself is the path/uid boundary; IAM auth is a defense-in-depth layer we can add later for both mounts in one pass. transit_encryption = "ENABLED" is the safe default and has negligible perf impact on small files.

stopTimeout = 120 is the ECS-task-level grace window (max useful value; ECS hard-caps at 120s for EC2 launch type). Combined with the supervisord change below, this gives sendmail up to ~110 seconds to finish an in-flight delivery before SIGKILL.

The efs module exposes the new access point id as an output (smtp_queue_access_point_id); the root module wires it through to the ecs module.

One-shot replacement. The smtp-out task definition already carries lifecycle { ignore_changes = [container_definitions] } (added in phase 1 of docs/0.9.x/build-deploy-simplification-plan.md to protect out-of-band image-tag updates from topology-only Terraform applies). Adding mountPoints and stopTimeout inside container_definitions to a steady-state task def would be silently ignored. Phase 3 of this plan introduces a small marker resource (terraform_data.smtp_out_taskdef_revision_marker with input = "smtp-queue-mount-v1") and adds replace_triggered_by = [terraform_data.smtp_out_taskdef_revision_marker] to the task-def’s lifecycle block. Replacement forces a fresh create, which is not governed by ignore_changes, so the new revision picks up the full configured container_definitions (mountPoints, stopTimeout) plus the new volume block. The marker stays in state after the first apply; subsequent applies behave as before. If we ever need to push another topology-only change through the same gate, bump the input string (e.g. smtp-queue-mount-v2).

Container runtime changes

Sendmail .mc change

In out-sendmail.mc, add:

define(`confMIN_QUEUE_AGE', `5m')dnl

This sets the minimum age before a queued message is eligible for a fresh delivery attempt by any queue runner. With multiple smtp-out tasks each running -q15m, a freshly-enqueued message would otherwise be eligible for a second attempt within seconds of acceptance. 5 minutes is conservative — enough to avoid thundering-herd retries against a remote MTA that just deferred us, short enough that a real outbound after a transient blip still goes out promptly.

confTO_QUEUERETURN=4d is left as-is. The bounce horizon was already chosen for “messages can sit deferred for days”; persistent queue doesn’t change the rationale, it just makes the existing 4-day window meaningful where today it’s effectively capped at the deploy cadence.

Concurrency and locking

Sendmail’s queue-runner concurrency is per-qf file: each candidate message is locked via fcntl(F_SETLK) on its control file before delivery is attempted. EFS supports NFSv4 byte-range locks, so this works across mount points and across hosts. The “shared NFS mqueue” pattern was the canonical way to scale sendmail before everyone moved to commercial MTAs, and it’s documented in the sendmail op.me operations guide.

Three failure modes worth naming explicitly, with how each is handled:

  1. Two tasks pick the same message simultaneously. Both fcntl the qf, one wins, the loser logs lost lock and moves on. No double-delivery. Standard.
  2. A task dies mid-delivery, holding a lock. NFSv4 advisory locks are released by EFS when the client connection drops (NFS RELEASE_LOCKOWNER). A surviving task picks up the orphaned qf on its next scan. Worst case: the message is delivered twice if the dying task already handed the message off to the remote MTA but didn’t get to delete the qf. This is identical to the failure mode of any persistent queue under host loss, and is bounded by the same idempotency the recipient MTA already needs (Message-ID-based dedup).
  3. A task dies mid-write, leaving a partial tf or df. Sendmail’s startup queue scan ignores files that don’t pair (qf without df, or tf not yet renamed to qf); they age out via the temp-file cleanup or get picked up on the next full scan. No corruption.

Migration sequence

One PR per phase, in order. Each phase is independently apply-able and each phase’s rollback is the previous phase.

  1. Verify the smmsp / mail gid. Done out-of-band against an amazonlinux:2023 image with the same dnf install sendmail invocation the smtp-out Dockerfile uses. Result: smmsp is gid 51, mail is gid 12, and the rpm ships /var/spool/mqueue as root:mail mode 0700. The MTA queue (the one we persist) belongs to the mail group, not smmsp; smmsp owns the client submission queue (/var/spool/clientmqueue) which is out of scope here.
  2. Add the EFS access point. Terraform-only PR: new aws_efs_access_point.smtp_queue resource and module output. No mount yet, no behavioural change. The access point creates /smtp-queue on the filesystem with the correct ownership.
  3. Mount the queue and bump timeouts. PR with three coordinated changes:
    • task-definitions.tf — add the volume and mountPoints blocks, add stopTimeout = 120.
    • smtp-out/supervisord.confstopwaitsecs=15 → 110.
    • shared/sendmail-wrapper.sh — defensive chown/chmod.

    On apply, ECS rolls the smtp-out service. Each new task mounts the (empty) shared queue. The first deploy effectively starts the persistent-queue era with a clean slate; any messages already queued in the previous task’s ephemeral mqueue are lost in this one transition — same failure mode as any deploy today, no worse.

  4. Add confMIN_QUEUE_AGE. Single-line .mc change, triggers a docker rebuild and a fresh service rollout. With the persistent queue already in place, the MIN_QUEUE_AGE is the last bit of multi-runner coordination tuning.
  5. Soak. One week minimum at the new posture, watching for: mailq depth on each task agreeing (proves shared mount works), no lost lock storms in CloudWatch Logs (proves fcntl semantics work over EFS), no perms errors on qf writes (proves the access-point creation_info matched smmsp).

Per-environment ordering

dev end-to-end through phase 5, then stage, then prod. The access point is cheap to create in advance across all three (phase 2 can fan out), but the mount/timeout change (phase 3) is the breakable one and should bake on dev for at least a few deploys before promotion.

Rollback

Step Rollback
Verify gid (1) None needed — read-only.
Access point (2) Delete the aws_efs_access_point resource. The /smtp-queue directory remains on the filesystem; harmless.
Mount + timeouts (3) Revert the task-definition, supervisord, and wrapper changes, AND bump terraform_data.smtp_out_taskdef_revision_marker.input (e.g. smtp-queue-mount-v1 -> smtp-queue-rollback-v1). The bump is required: removing the volume block alone would otherwise leave mountPoints stranded inside the ignored container_definitions, registering a revision that references a non-existent volume. Forcing replacement makes Terraform rebuild the resource from the rolled-back config in full. ECS rolls back to ephemeral queue. Any messages in the persistent queue at rollback time are stranded - manually copy them out of the EFS mount on a one-off basis if necessary, or let the new ephemeral queue accept replacements as users retry sending.
MIN_QUEUE_AGE (4) Single-line revert. No state implication.

Operational considerations

Acceptance

Open questions

Out of scope for 0.9.0