Skip to content

Runbook: Email Module (Runtime)

This runbook covers operating the running ShopAccess/Email module (Phase 5b): the drift assertions, the configuration lifecycle controls, and recurring operational tasks. It is distinct from the Postmark External Resource Provisioning runbook, which covers the one-time manual Postmark account / signature setup (Phase 1), and from the Encryption-Key Rotation runbook.

For the module’s behavior and design, see Functional → Email; for the partition infrastructure it consumes, see Runtime → Partition Mail Topology.

Module config lives under modules.v1.shop-access.email.extras.* (HOCON), with deploy-time values injected from CloudFormation exports. Operationally relevant keys:

  • verification.pollingTimeoutHours (default 24) / verification.pollingIntervalSeconds (default 30) — the verification poll window and cadence.
  • drift.checkIntervalMinutes (default 15) — drift assertion cadence (rides the registry-refresh tick).
  • drift.postmarkInventory.enabled / drift.registry.enabled / drift.dns.enabled — the three drift assertions, all default off. Enable per partition.
  • Per-route Helm disable flags — individual routes (configuration, job, job/postmark-events) can be disabled at the chart level.

ARDA_API_KEY rotation affects every route including the inbound webhook (Postmark sends it as the bearer). After rotation, Postmark’s stored webhook bearer must be updated or callbacks fail auth — watch email.webhook.rejected.total{reason=auth_mismatch}.

Three default-off assertions compare the module’s database against external state and emit one Sentry event per discrepancy — no auto-remediation. They run on the registry-refresh tick once enabled.

AssertionComparesSentry tag
drift.postmarkInventoryDB ↔ Postmark server inventoryemail.drift.postmark_orphan_db_row, email.drift.postmark_orphan_server
drift.registryDB ↔ encryption-key registryemail.drift.registry_missing_key
drift.dnsDB ↔ live Route 53 recordsemail.drift.dns_dkim_mismatch, email.drift.dns_return_path_mismatch

The partition is not a per-event context field — every Sentry event already carries it via the native environment dimension (SENTRY_ENVIRONMENT = {infrastructure}-{purpose}, e.g. Alpha002-dev; see Sentry Observability). The drift event context carries the actionable identifiers (eId, postmarkServerId, keyVersionId, record names).

The module emits five structured-log telemetry families: email.configuration.lifecycle.*, email.send.*, email.webhook.*, email.drift.*, and email.material_registry.refresh.* (counters + a small number of gauges), forwarded to Sentry by the Logback appender. No plaintext server tokens are logged at any level.

  • postmark_orphan_server — a Server created by hand in Postmark’s UI for testing. Differentiate by name: provisioned servers match ardamails-{partition}-{slug} (the sending domain, labels reversed, TLD dropped). A name that doesn’t match is a manual test resource.
  • dns_dkim_mismatch shortly after a re-provision — the new record hasn’t propagated past the old TTL. Wait one TTL (300–3600s) and re-check.

Operators drive an EmailConfiguration through signal routes (see Lifecycle):

  • unlock — make a verified configuration sendable (configurations are created locked-by-default).
  • lock — graceful: waits for in-flight sends to drain (UNLOCKED → LOCKING → LOCKED).
  • force-lock — immediate: cancels in-flight sends and goes straight to LOCKED. Use for incident containment.
  • re-verify — restart the verification poll with a fresh window (use when DNS propagation exceeded the timeout).
  • re-provision — from MISSING: cleans partial Postmark state, then re-runs provisioning. A createServer 422 after cleanup signals a Postmark drift/orphan condition (see the orphan-server false-positive above) — investigate before retrying.

A LOCKING configuration stranded by a crashed pod is completed automatically by the startup reconciliation pass.

A configuration stuck in AWAITING_VERIFICATION past the timeout moves to UNVERIFIED. The gate is DKIMVerified ∧ ReturnPathDomainVerified (SPF verification is deprecated and not checked). Common causes: DNS propagation slower than the timeout in a fresh environment (increase verification.pollingTimeoutHours, then re-verify); or the DKIM/Return-Path records were not written (check the partition zone with dig, and the provisioning logs for a writeRecords failure).

Suppressions are per-tenant and created automatically on hard bounce / spam complaint / Postmark-console subscription change. A send to a suppressed recipient is rejected 403 before any Postmark call. To inspect, query the tenant’s active suppression_entry rows via the configuration service (the module never reads another service’s tables).