Host your own email and enhance your privacy
Cabalmail’s identity and authorisation story is layered: Cognito authenticates users into the admin app and into IMAP/SMTP (via master-user + per-user OS account); API Gateway brokers Lambda invocations on Cognito JWTs; per-Lambda IAM roles bound the blast radius of any single function compromise. The layering is sound. The current configuration of each layer has drifted from defensible defaults over the project’s lifetime — small omissions that individually look harmless and together leave the system more permissive than the design intent.
This audit pass found three clusters:
data_trace_enabled = true with logging_level = "INFO" means CloudWatch receives full request and response bodies — including the contents of every email body fetched via /fetch_message, every attachment metadata payload, every preference write. The default authorizer cache (300 s) means a revoked token stays usable for up to 5 minutes after invalidation. Method-level cache TTL on some user-personalised endpoints exceeds the authorizer cache.arn:aws:<service>:REGION:ACCOUNT:<resource-type>/${local.wildcard} pattern, where local.wildcard = "*". The wildcard satisfies the IaC scanners (the literal string "*" is what they look for) but the effect is account-wide. Notably the assign_osid Lambda can call AdminUpdateUserAttributes on any user pool in the account, and the certbot_renewal Lambda has route53:ListHostedZones on Resource: "*".The plan addresses each cluster in its own phase. The Cognito work is the most user-visible — adding TOTP, requiring email verification, enabling adaptive auth — and gets the most operator-side documentation cost. The API Gateway change is the highest-impact security fix per LOC: turning off the data trace silently removes the leakiest log stream in the system.
risk = High.route53:ListHostedZones which has no resource-level grammar).scan_on_push = true and image_tag_mutability = "IMMUTABLE". Scan findings populate the GitHub Security tab via the same SARIF surface as the IaC scanners.application-surface-hardening-plan.md.application-surface-hardening-plan.md Phase 5 covers the high-leverage attack patterns; WAF is a follow-on once we have per-endpoint signal.docs/0.6.x/ where applicable.terraform/infra/modules/user_pool/main.tf:5-36:
auto_verified_attributes = ["phone_number"]. Email is not auto-verified.sms_configuration is set; no software_token_mfa_configuration. SMS is the only second-factor option, and it is not required.account_recovery_setting lists only verified_phone_number.mfa_configuration attribute — defaults to "OFF". MFA cannot even be enabled by a user who wants it; the pool refuses.user_pool_add_ons { advanced_security_mode = ... }. Adaptive authentication is off.terraform/infra/modules/user_pool/main.tf:74-80:
access_token_validity = 12 (hours). Reasonable.id_token_validity = 12 (hours). Reasonable.refresh_token_validity is not set — defaults to 30 days.token_validity_units is not set — defaults are hours for access/id and days for refresh.enable_token_revocation is not set — defaults to true per AWS docs, but explicit is better.terraform/infra/modules/app/main.tf:134-146:
data_trace_enabled = true. The AWS docs explicitly say “not recommended for production.” Logs include full request bodies (passwords, JWTs, email contents, attachments — anything passing through the gateway).logging_level = "INFO". Combined with data_trace_enabled, every API call writes a multi-KB log line.throttling_rate_limit = 100, throttling_burst_limit = 50. Stage-wide; no per-method overrides.terraform/infra/modules/app/main.tf:148-... for aws_api_gateway_method_settings.cache_settings:
caching_enabled = each.value.cache and cache_ttl = 3600 (1 hour) on cached methods. Authorizer cache TTL is not set (defaults to 300 s). Any caller whose token is invalidated mid-cache-window can still hit the cached method response for up to 1 hour.Authorizer config: [terraform/infra/modules/app/main.tf around the aws_api_gateway_authorizer resource]. No authorizer_result_ttl_in_seconds override. Default is 300 s.
terraform/infra/modules/user_pool/counter.tf:7: wildcard = "*". Used at line 44 to form arn:aws:cognito-idp:REGION:ACCOUNT:userpool/*. The Lambda can AdminUpdateUserAttributes any user in any pool in the account.
Similar pattern at:
terraform/infra/modules/user_pool/check_invite.tf:76 — logs group wildcard, low impact.terraform/infra/modules/certbot_renewal/iam.tf:55-60 — route53:ListHostedZones* on Resource: "*". Service-level required by AWS (no resource grammar for List ops).terraform/infra/modules/ecs/iam.tf:115-121 — ssmmessages:* on Resource: "*". Required by the service for ECS Exec.terraform/infra/modules/user_pool/iam.tf:22 — sns:Publish on Resource: "*". The Cognito-SMS role; SNS Publish for SMS goes through the platform endpoint, no specific topic ARN.The four “service requires wildcard” cases are fine; the userpool/* case is the one that should be specific.
terraform/infra/modules/ecr/main.tf: repositories created without image_scanning_configuration, without image_tag_mutability, without a repository policy. Defaults are scan-off, tag-mutable, any-principal-in-account.
terraform/infra/modules/user_pool/main.tf gains:
resource "aws_cognito_user_pool" "users" {
...
mfa_configuration = "OPTIONAL"
software_token_mfa_configuration {
enabled = true
}
account_recovery_setting {
recovery_mechanism {
name = "verified_email"
priority = 1
}
recovery_mechanism {
name = "verified_phone_number"
priority = 2
}
}
auto_verified_attributes = ["email", "phone_number"]
}
For admins specifically (the cabal-admin user group): a Lambda pre-token-generation trigger refuses to issue tokens to admin-group members whose MFAEnabled Cognito attribute is false. Mechanically that’s a second small Lambda (require_admin_mfa) hung off the pool’s pre_token_generation slot.
Updated React signup flow: collect email at signup time, send a verification email, gate the welcome screen on email confirmation. The Apple client likewise.
User-facing migration: existing users get an email on next signin saying “please verify your email address” with a one-click link. SMS-only users keep working; they just gain a fallback.
resource "aws_cognito_user_pool" "users" {
...
user_pool_add_ons {
advanced_security_mode = "AUDIT" # promoting to "ENFORCED" is Phase 2.5
}
}
resource "aws_cognito_user_pool_client" "users" {
...
refresh_token_validity = 7
access_token_validity = 12
id_token_validity = 12
token_validity_units {
refresh_token = "days"
access_token = "hours"
id_token = "hours"
}
enable_token_revocation = true
}
Advanced security in AUDIT mode is free (per AWS pricing) and emits risk-score events into a CloudWatch metric stream. Phase 2.5 promotes to ENFORCED after a soak period during which we collect false-positive rates. ENFORCED is the right end state for a primary mailbox; do not skip the audit phase.
The 7-day refresh-token lifetime means anyone whose laptop is stolen has 7 days of bounded exposure rather than 30. React and Apple clients both transparently re-auth at refresh expiry — no UX work needed.
terraform/infra/modules/app/main.tf:
resource "aws_api_gateway_method_settings" "general_settings" {
...
settings {
metrics_enabled = true
data_trace_enabled = false # was true
logging_level = "ERROR" # was INFO
throttling_rate_limit = 100
throttling_burst_limit = 50
}
}
resource "aws_api_gateway_authorizer" "cognito" {
...
authorizer_result_ttl_in_seconds = 60
}
The access log format (separate setting on the stage) is left intact — it captures source IP, caller identity, method, path, status, latency. That is the right amount of information to keep.
Method-level cache TTLs on user-personalised endpoints (/fetch_message, /list_envelopes, /fetch_attachment, /fetch_inline_image) drop from 3600 to 0 (disabled). The Lambda S3-cache layer in _shared/helper.py already covers the dominant cache concern (body re-fetches); API Gateway caching atop it provides little additional benefit and creates the cache-vs-authz drift documented in the audit. Non-personalised cached endpoints (/list_my_domains, BIMI fetches) keep their cache TTL.
The CloudWatch log group for API Gateway access logs has its retention set explicitly in Terraform (it is today; verify the value is reasonable — 14 days is the project default and matches the rest of the log groups). The execution-logs group, now emitting only ERROR, can stay at 14 days as well.
terraform/infra/modules/user_pool/counter.tf:44:
{
Effect = "Allow"
Action = "cognito-idp:AdminUpdateUserAttributes"
Resource = aws_cognito_user_pool.users.arn # was userpool/${local.wildcard}
}
The change is name-only; the Lambda already accesses only this pool at runtime. Wildcard was a Terraform shape, not a runtime intent.
Same audit pass against every Lambda in terraform/infra/modules/ and terraform/infra/modules/app/modules/. The pattern is mechanical: where local.wildcard resolves to a resource ID, replace with the specific ID; where it resolves to the service grammar’s required * (logs streams, SSM Messages, SNS Publish, Route53 List*), keep with a code comment saying so.
A small CI helper script (.github/scripts/check-iam-resource-scope.py) flags any Resource = "*" or Resource containing the literal string local.wildcard without an accompanying tfsec:ignore or checkov:skip comment with rationale. Picked up by the Phase 3 suppression-justification check in iac-quality-gates-plan.md.
terraform/infra/modules/ecr/main.tf: each aws_ecr_repository resource gains:
image_tag_mutability = "IMMUTABLE"
image_scanning_configuration {
scan_on_push = true
}
A repository policy that restricts pull to the ECS execution role and the CI deploy principal:
resource "aws_ecr_repository_policy" "tier" {
for_each = aws_ecr_repository.tier
repository = each.value.name
policy = data.aws_iam_policy_document.ecr_repo[each.key].json
}
scan_on_push = true produces findings; the rollout pattern mirrors iac-quality-gates-plan.md Phase 2 (baseline current findings, accept them as known, fail on new). Image-scan output uploads to GitHub Code Scanning via SARIF.
image_tag_mutability = "IMMUTABLE" means re-tagging a SHA-tagged image fails. The deploy script in container-runtime-hardening-plan.md Phase 3 moves to digest references, sidestepping the mutability concern at the consumer side.
| Phase | Scope | User-visible | Risk |
|---|---|---|---|
| 1 — Cognito MFA + email recovery | user_pool module, react admin, apple kit | Yes | Medium. UX change; needs a clear migration message to existing users. Test by going through the signup flow end-to-end with TOTP enabled. |
| 2 — Advanced security + 7-day refresh | user_pool module | Mostly no | Low. Refresh-token TTL change forces a re-auth after 7 days instead of 30 — invisible to anyone using the app weekly. |
| 3 — API Gateway logging + authz cache | app module | No | Low. The data_trace change is purely subtractive (less logging). The authorizer cache TTL change shortens the window between Cognito revoke and API refuse — opposite of regression. |
| 4 — IAM resource narrowing | many modules | No | Low. The wildcard-to-specific mapping is mechanical; each individual change is reversible. A regression would surface as a runtime AccessDenied error in CloudWatch. |
| 5 — ECR scan-on-push + immutable tags | ecr module, deploy script | No | Medium. Tag immutability requires the digest-pin work from container-runtime-hardening-plan.md Phase 3 to land first; sequence accordingly. |
Per-environment ordering: dev → stage → prod for each phase, with at least one normal deploy cycle between flips so issues surface cheaply.
Phase 1 and Phase 5 have inter-plan dependencies (Apple/React client work; container-plan digest work). Phase 2, 3, 4 are independent of each other and of the other plans; they can ship in any order.
| Phase | Rollback |
|---|---|
| 1 | Set mfa_configuration = "OFF", drop software_token_mfa_configuration, restore SMS-only recovery. Existing TOTP secrets in the user pool remain but are not consulted. No data loss. |
| 2 | Set advanced_security_mode = "OFF", drop refresh_token_validity (returns to default 30 d). |
| 3 | Set data_trace_enabled = true, logging_level = "INFO", authorizer_result_ttl_in_seconds = 300. Per-method cache TTLs restored from git. |
| 4 | Revert the specific Resource ARNs to the wildcard shape. The CI scope check stays in place but only fails on new wildcards. |
| 5 | image_tag_mutability = "MUTABLE", scan_on_push = false, remove the repository policy. Existing images remain. |
.github/workflows/infra.yml gains the check-iam-resource-scope.py step (Phase 4). Wired into the same scanner job structure as the Phase 3 Checkov/tflint/Trivy work in iac-quality-gates-plan.md..github/workflows/app.yml consumes the ECR scan-on-push findings via aws ecr describe-image-scan-findings and uploads SARIF (Phase 5).react/admin/ and is built/deployed by the existing app.yml React-bundle path.apple/CabalmailKit/ and is exercised by swift test.require_admin_mfa trigger returns an error). The React admin app displays a “Please enrol MFA before continuing” screen.NotAuthorizedException. The client transparently re-auths.Risk = High events for impossible-travel scenarios (synthesize via VPN test). An alarm fires.aws apigateway get-stage --rest-api-id <id> --stage-name <stage> returns dataTraceEnabled: false and loggingLevel: ERROR.aws cognito-idp admin-user-global-sign-out is rejected by the API within 60 seconds (verified by hammering an endpoint with a previously-valid token and observing the 401).assign_osid Lambda’s IAM policy contains the specific aws_cognito_user_pool.users.arn value, no * segment.describe-repositories shows imageTagMutability: IMMUTABLE and imageScanningConfiguration: {scanOnPush: true} for every Cabalmail repository.TagAlreadyExistsException.ENFORCED vs AUDIT for advanced security. Phase 2 lands AUDIT; the promotion to ENFORCED (Phase 2.5) needs at least a week of data to calibrate the risk thresholds. Schedule the promotion as a separate PR after the soak.timeout = 30. Per-Lambda overrides would tighten the budget for endpoints that should be fast (/folder_status, /list_my_domains) and relax it for slow ones (/process_dmarc). Not strictly identity/IAM; flag for follow-up.amazon-cognito-identity-js. Migrating to Hosted UI would centralise MFA UX, password-reset UX, and email/phone verification UX in Cognito’s first-party flow. Larger change; flag for separate consideration.exp checks and proactive refresh would smooth the UX after the 7-day refresh-window change. Out of scope here; track in the Apple client’s own backlog.