Host your own email and enhance your privacy
Cabalmail runs its mail tiers (and the Image Builder build instances) in private subnets. Their only path to the internet and to AWS service APIs is through the VPC’s NAT. This document covers the two supported NAT modes, how to stand either one up in a new environment, and how to diagnose the one failure mode that takes the whole data plane down with it.
There are no VPC endpoints. Every call a private-subnet container makes - DynamoDB, S3, SSM, ECR, Cognito, SQS/SNS, CloudWatch Logs, and outbound SMTP on port 25 - egresses through the NAT. If NAT egress breaks, all of the following break at once, even though the instances keep “running”:
/send Lambda hangs and times out (its submission to smtp-out blocks on
smtp-out’s now-unreachable outbound delivery).awslogs driver cannot reach
the Logs endpoint), so the tiers go silent in CloudWatch.Treat any “NAT instance replaced” line in a Terraform plan as a brief egress outage and apply it in a maintenance window.
NAT runs in one of two first-class, indefinitely supported modes, selected per
environment by use_nat_instance (a GitHub Environment variable,
TF_VAR_USE_NAT_INSTANCE, echoed into tfvars by infra.yml; defaults to
true):
| Mode | use_nat_instance |
What | Who it’s for |
|---|---|---|---|
| NAT instances | true (default) |
One EC2 instance per AZ from a custom AL2023 AMI baked by EC2 Image Builder | Cheapest; small / personal / family deployments |
| NAT Gateway | false |
One AWS-managed NAT Gateway per AZ; no AMI, no OS | Commercial / at-scale operators, or anyone preferring managed over cheap |
Approximate us-east-1 monthly cost (the reason instances are the small-scale default): 2x t3.micro instances ~$15 (no per-GB); 1 NAT Gateway ~$33 + ~$0.045/GB; 2 NAT Gateways (per-AZ HA) ~$65 + per-GB. At four-user scale a gateway is roughly half the run-rate; at commercial volume the managed reliability wins.
Both modes reuse the same Elastic IPs (aws_eip.nat_eip, one per AZ).
These are the stable outbound source IPs for mail: nat.tf maintains a
public smtp.<control-domain> A record over them (the forward record that
the EIPs’ reverse DNS is validated against - see aws_eip_domain_name),
they are what you allow-list for the port-25 block (see below), and they are
what your SPF records authorize. Switching modes does not change them, and
they are preserved across quiesce, so allow-lists never need re-issuing.
The resources live in terraform/infra/modules/vpc/nat.tf:
aws_eip.nat_eip), shared by both modes.cabal-nat-al2023-* AMI, with source_dest_check = false, in the public
subnets, each the 0.0.0.0/0 target of its AZ’s private route table.aws_nat_gateway per AZ in the public subnets, holding
the same EIPs, each the 0.0.0.0/0 target of its AZ’s private route table.| Variable | Where | Default | Purpose |
|---|---|---|---|
use_nat_instance |
root + vpc module var (TF_VAR_USE_NAT_INSTANCE) |
true |
NAT instances vs. NAT Gateways. See “The two modes”. |
build_nat_ami |
root + vpc module var (TF_VAR_BUILD_NAT_AMI) |
true |
Whether the Image Builder pipeline that bakes the NAT AMI exists. Independent of the egress mode; set false only in a pure-gateway environment that will never run instances. |
nat_instance_type |
vpc module var |
t3.micro |
NAT instance size. x86_64 - the custom-AMI pipeline matches this arch. |
region |
vpc module var (from var.aws_region) |
n/a | Used to build the Image Builder managed-image ARN. |
quiesced |
root + vpc module var |
false |
Scales NAT (instances or gateways) to zero (non-prod cost saving). EIPs are kept. See quiesce.md. |
A NAT instance needs a userspace firewall tool to install the masquerade
(SNAT) rule that makes it a NAT, and AL2023’s base AMI ships none (neither
nftables nor iptables) - and a boot-time install is fragile: if it fails,
the instance forwards without SNAT and all private-subnet egress silently
breaks. So instance mode
always launches from a custom AMI: an EC2 Image Builder pipeline
(nat_ami.tf +
nat-nftables-component.yaml)
bakes nftables, the masquerade ruleset, ip_forward, and an enabled
nftables.service into an image named cabal-nat-al2023-*. Instances launched
from it come up as a working NAT with no boot-time install.
The chicken-and-egg this creates - the pipeline’s build instance needs egress, but instance-mode egress needs the AMI the pipeline produces - is resolved by bootstrapping a new instance-mode environment through a NAT Gateway (below).
data.aws_ami.custom_nat (the lookup the NAT instances launch from) hard-fails
when no cabal-nat-al2023-* AMI exists. That error is deliberate: it is the
guard that stops you flipping an environment to instance mode before the first
AMI has been built.
A “new environment” is a new AWS account / GitHub Environment / branch with its own
infraTerraform state.
Set TF_VAR_USE_NAT_INSTANCE = false on the GitHub Environment and apply.
There is no step two; the gateways and routes come up in the first apply.
Optionally set TF_VAR_BUILD_NAT_AMI = false as well if the environment will
never run NAT instances, to skip building an AMI it will never use.
Then clear the port-25 block (step 3 below).
A fresh environment has no custom NAT AMI yet, so it cannot start on instances. Bootstrap is a deliberate double-apply:
TF_VAR_USE_NAT_INSTANCE = false and let
infra.yml apply. NAT Gateways provide egress; the Image Builder pipeline
(present because build_nat_ami defaults to true) can now reach the
internet through them.nat_ami_build.yml) from the environment’s branch - it triggers the
pipeline and waits for the image. Or, with local AWS credentials:
aws imagebuilder start-image-pipeline-execution \
--image-pipeline-arn "$(aws imagebuilder list-image-pipelines \
--query "imagePipelineList[?name=='cabal-nat-al2023'].arn | [0]" --output text)"
Wait ~15-20 min, then confirm an AMI named cabal-nat-al2023-* is
available and carries the Role=cabal-nat tag (the tag is applied
only after the build’s test stage passes, so it is the signal that the
image is actually usable):
aws ec2 describe-images --owners self \
--filters "Name=name,Values=cabal-nat-al2023-*" "Name=tag:Role,Values=cabal-nat" \
--query 'reverse(sort_by(Images,&CreationDate))[].[Name,ImageId,State]' --output table
TF_VAR_USE_NAT_INSTANCE = true and
apply. Terraform creates the NAT instances from the new AMI, repoints the
private routes at them, and destroys the gateways. Expect a brief per-AZ
egress blip while the EIPs move; for a bootstrap (nothing running yet) this
is a non-event, but for a mode switch on a live environment do it in a
window. Expect to apply twice - see “The gateway-to-instance cutover
takes two applies” below.The relay_ips output lists your NAT EIPs. AWS blocks outbound port 25 by
default; request removal via the rdns-limits form.
See Post-Automation Steps in setup.md. Because
both modes use the same EIPs, this never needs redoing - including across mode
switches.
See “Verifying egress” below. Confirm private-subnet egress works before relying on anything else - the rest of the mail system depends on it.
Either direction is a single variable flip (TF_VAR_USE_NAT_INSTANCE) and an
apply in a maintenance window; expect a few minutes of egress loss per AZ while
the EIPs and routes move. Instances -> gateway needs nothing else. Gateway ->
instances requires a cabal-nat-al2023-* AMI to exist (build one through the
gateway first, exactly like bootstrap step 2); the data.aws_ami.custom_nat
hard-error stops the apply if you forget.
Gateway mode is also the rollback path if the NAT instances themselves are misbehaving (e.g. a bad AMI build): flip to the gateway, fix or rebuild the AMI, flip back.
When flipping gateway -> instances (bootstrap step 3, or a live mode switch),
the first apply reliably fails at aws_eip_association.nat with a misleading
AuthFailure: You do not have permission to access the specified resource.
This is not an IAM problem. EC2 returns that error when associating an EIP
that a deleting NAT gateway still holds: Terraform starts the association as
soon as the NAT instance exists, while the gateway (whose deletion frees the
EIP) takes a few minutes to go away in parallel. There is no clean way to
order a create after an unrelated destroy in Terraform - depends_on orders
dependent creates before the dependency’s destroy, which would make the
failure deterministic rather than racy - so the retry is accepted as the cost
of a rare, deliberate operation. Everything else in the cutover (instance,
route repoint, gateway deletion) completes on the first apply; private-subnet
egress is down in the gap because the route already points at an instance
that does not have its public IP yet. Re-run the apply once the gateway
shows deleted; the association is the only remaining change and the second
apply converges in seconds. The instance -> gateway direction does not have
this race (the association is destroyed before the EIP is handed to the new
gateway, in correct dependency order).
running, one per AZ:
aws ec2 describe-instances --filters "Name=tag:Name,Values=cabal-nat-*" \
"Name=instance-state-name,Values=running" \
--query 'Reservations[].Instances[].[InstanceId,Placement.AvailabilityZone,State.Name]' --output table
In gateway mode, the gateways are available:
aws ec2 describe-nat-gateways --filter "Name=state,Values=available" \
--query 'NatGateways[].[NatGatewayId,SubnetId,State]' --output table
0.0.0.0/0 points at a NAT instance ENI (instance
mode) or a NAT gateway (gateway mode): aws ec2 describe-route-tables.aws logs describe-log-streams --log-group-name /ecs/cabal-smtp-out \
--order-by LastEventTime --descending --max-items 1 \
--query 'logStreams[0].lastEventTimestamp'
A timestamp within the last few minutes means egress is healthy. A timestamp frozen at some point in the past is the classic broken-egress symptom.
imap/smtp-in/smtp-out are healthy, and a test send
completes in ~2-3 s (not 30-60 s).The pipeline (cabal-nat-al2023) checks daily and builds only when the AL2023
base image actually has an update
(EXPRESSION_MATCH_AND_DEPENDENCY_UPDATES_AVAILABLE), so it tracks AL2023
security patches without churning no-op images. Builds are asynchronous and do
not roll the NAT on their own: nat.tf reads the latest AMI via
data.aws_ami.custom_nat (owners = ["self"], most_recent = true), so a fresh
build appears as a NAT replacement in the next plan and is adopted only when you
deliberately apply it. To force an off-schedule rebuild (e.g. an urgent CVE),
run the “Build NAT AMI” workflow (nat_ami_build.yml) from the environment’s
branch, or the start-image-pipeline-execution command from the bootstrap
steps above.
The build and test instances run in a private subnet, so a rebuild needs healthy egress (either mode) - if egress is down the build fails and the last-good AMI stays in place, a safe no-op.
Changing the bootstrap itself (the component YAML) requires bumping the
version on aws_imagebuilder_component.nat_nftables and
aws_imagebuilder_image_recipe.nat in nat_ami.tf - Image Builder component and
recipe versions are immutable.
Symptoms: sends time out at the /send Lambda; outbound mail queues instead
of delivering; the mail tiers go silent in CloudWatch (logs stop shipping
because the awslogs driver can’t reach the Logs endpoint); private-subnet
API calls hang.
aws ec2 get-console-output --instance-id <nat-instance-id> --latest \
--query 'Output' --output text | grep -iE "nftables|forward|fail|error"
Unit file nftables.service does not exist or a missing masquerade rule
means the AMI bake is bad - the instance is forwarding without SNAT.
sudo dnf install -y nftables
sudo nft -f /etc/nftables/cabal-nat.nft
grep -q cabal-nat.nft /etc/sysconfig/nftables.conf || \
echo 'include "/etc/nftables/cabal-nat.nft"' | sudo tee -a /etc/sysconfig/nftables.conf
sudo systemctl enable --now nftables && sudo nft list ruleset
TF_VAR_USE_NAT_INSTANCE =
false) and apply: managed gateways restore egress on the same EIPs with no
AMI in the path. Rebuild or fix the AMI, then flip back in a window.