Email Integration -- Cross-Cutting Design (Revision 1)
Security and operations concerns that span the whole email-integration project: authentication, authorisation, secret management, drift detection, OAM (operations / administration / management), and compliance. Aligned with the Revision 1 goal.md, architecture-overview.md, and phases.md.
Revision 1 note. Supersedes the prior cross-cutting design (now superseded by this document). Substantive changes:
- Authentication for the email service is performed in-component (Ktor server configuration of each route), not at the API Gateway. API-gateway-level authorisers for the email service are out of scope.
- Postmark Console is acknowledged as the primary OAM surface for the email service; Arda-side metrics, logs, and runbooks supplement it but do not replace it.
- Credential resolution uses 1Password as the system of record. CI and operator runs resolve credentials via the 1Password SDK; runtime pods receive partition-scoped credentials via Secrets Manager + ESO.
- Drift detection is added as a routine, scheduled assertion of declared-vs-observed state for external resources.
- Corporate Resource Group is treated as a peer security domain to Application Runtime tenants, with its own credential scope, sending zone, and operational surface.
1. Threat model summary
Section titled “1. Threat model summary”The email integration adds defence-in-depth on top of the existing platform-level security posture. This layer specifically protects against:
- DB exposure with read-only privilege (backups leaving the trust boundary, SQL-injection-style read leaks, analyst sessions): per-tenant Postmark server tokens are application-encrypted (AES-256-GCM) before persistence, so a
SELECT *onemail_configurationyields ciphertext, not usable tokens. - Provisioning replay attacks: pre-flight
checkAvailabilityplus the persist-first lifecycle (DQ-205) prevent silent orphan creation. - Webhook spoofing: in-component Bearer-token validation on the
postmark-eventsroute (DQ-011). - Free Kanban Tool sending integrity: the Corporate consumer’s Postmark server token is held in 1Password (
Free-Kanban-Generator-Postmark-Serverin theArda-CorporateOAMvault, distinct from theArda-SystemsOAMvault that holds deploy-time / OAM credentials), never persisted to theinfrastructurerepository, and never transmitted through CDK context or CI environment variables. - Drift between declared and observed state: scheduled drift-detection asserts that external resources match the declarations in the
infrastructurerepository; surfaced divergences trigger an automated investigation issue.
Out of scope at this layer (covered elsewhere or accepted):
- Pod / process compromise — an attacker with pod memory has both the encryption key and the DB connection; the application-layer encryption does not defend against this. Platform-level controls (IRSA, network segmentation, container hardening) own this.
- Postmark account compromise — an attacker holding the Postmark account-level token can read tenant tokens directly from Postmark and send arbitrary email. Platform-level secret management owns this; the project mitigates by sourcing the account token only from 1Password and never from a long-lived environment variable.
- 1Password compromise (service-account or operator) — an attacker holding
OP_SERVICE_ACCOUNT_TOKEN(CI) or operator credentials (local-dev) reads every Postmark and tenant credential reachable fromArda-SystemsOAM. The token is scoped read-only to that vault; this project does not introduce write paths to 1Password from automated systems. - Insider with both
extras.email.encryptionKeyHOCON access and DB write privilege — equivalent to pod compromise. - DDoS / rate-limit abuse — handled by API Gateway and Postmark’s own controls. The API Gateway remains a forwarding layer for inbound traffic even though it does not authenticate the email-service routes.
2. Authentication
Section titled “2. Authentication”2.1 Inbound HTTP — email-configuration and email-job
Section titled “2.1 Inbound HTTP — email-configuration and email-job”The email-configuration and email-job HTTP endpoints reuse the existing operations-component in-component authentication pipeline: Ktor server configuration of each route validates ARDA_API_KEY plus context headers (X-Tenant-Id, X-Author) forwarded by the BFF. No JWT path; no email-specific authentication mechanism. The API Gateway is a passthrough for these routes; no gateway-level authoriser is attached.
2.2 Inbound HTTP — Postmark webhook
Section titled “2.2 Inbound HTTP — Postmark webhook”The postmark-events endpoint is an inbound webhook from Postmark’s infrastructure. Like other email-service routes it reaches the platform via API Gateway (forwarding only), and authentication is performed in-component by the receiving Ktor route: the route validates the ARDA_API_KEY value Postmark sends in the Authorization: Bearer ... header per DQ-011. Reusing the existing ARDA_API_KEY avoids introducing a new credential type.
Postmark configures this header on each per-tenant webhook via POST /webhooks with an HttpHeaders field setting Authorization: Bearer <ARDA_API_KEY>. Every inbound webhook request carries the header.
If and when the platform later adopts gateway-level authorisation as a cross-cutting concern, the email-service routes (consumer-facing and webhook) migrate alongside the rest. Until then, in-component validation is the contract.
Optional defence-in-depth: Postmark publishes a small set of webhook source IPs; these can be allowlisted at the network layer. IPs may change; treat as supplementary, not authoritative.
2.3 Outbound — Postmark Account API (Application Runtime, runtime)
Section titled “2.3 Outbound — Postmark Account API (Application Runtime, runtime)”The Postmark Account API (server / domain / webhook CRUD, DKIM / Return-Path verification) is authenticated with the X-Postmark-Account-Token header. For Application Runtime tenants, the runtime path is:
- The token is created in the Postmark console (manual, one-time per account; partition-pair scoped) and stored in 1Password: global-utility items
Postmark-ProdandPostmark-NonProdinArda-SystemsOAMfor cross-partition tooling, and per-partition itemsPostmarkin eachArda-{Env}OAMvault for partition-scoped deploy tooling (Phase 4 prerequisite:platform/postmark-service.tsmust exposepostmarkCredentialOpReference(partition: PartitionId): stringso Phase 4 deploy tooling resolves from the partition vault rather than the global item). - A per-partition Secrets Manager entry (
{fqn}-I-EmailPostmarkAccountToken) holds the runtime copy.amm.shreads the value from the partition’s 1Password vault at deploy time (op read "$(postmarkCredentialOpReference <partition>)") and passes it tocdk deployas aNoEchoCFN parameter; the CDKpartition-emailstack declares the parameter and creates the SM secret viaSecretValue.cfnParameter(). CDK has no 1Password dependency. Seecurrent-system/oam/security/secret-delivery-pattern.mdfor the canonical pattern. - ESO synchronises the secret into the partition’s Kubernetes namespace; the operations pod consumes it as the HOCON property
extras.email.postmarkAccountTokenat startup. - Read by
postmarkAccountProxyonly, in-process; never logged.
Two Postmark accounts: PostmarkProd (used by prod / demo partitions, real delivery) and PostmarkNonProd (used by dev / stage partitions, sandbox delivery).
2.4 Outbound — Postmark Server API (per-tenant, runtime)
Section titled “2.4 Outbound — Postmark Server API (per-tenant, runtime)”The Postmark Server API (send email, configure webhook) is authenticated with the X-Postmark-Server-Token header. The token is per-tenant (issued by Postmark when the tenant’s server is created):
- Captured at provisioning time from the
POST /serversresponse. - Encrypted application-side (AES-256-GCM versioned envelope per DQ-202) before persistence.
- Stored in
email_configuration.server_token_encrypted(text column, base64 envelope). - Decrypted on demand by
EmailConfigurationService.getUnlockedConfiguration(). - Passed by reference (in-memory only) through L2 (
EmailSender) and L1 (postmarkServerProxy) as a method argument; never persisted again, never logged.
2.5 Outbound — Postmark Server API (Free Kanban Tool, deploy-time and runtime)
Section titled “2.5 Outbound — Postmark Server API (Free Kanban Tool, deploy-time and runtime)”The Free Kanban Tool’s Postmark server is provisioned once by the Corporate CLI at deploy time. Its token follows a different lifecycle than the per-tenant Application Runtime tokens:
- The server is created in PostmarkProd (the Corporate consumer is bound to the production-grade Postmark account).
- Phase A of the Corporate CLI (per
phases.mdPhase 3 § J1 interim mechanism) writes the resulting Server API token directly into theFree-Kanban-Generator-Postmark-Server1Password item in theArda-CorporateOAMvault. Canonical reference:op://Arda-CorporateOAM/Free-Kanban-Generator-Postmark-Server/credential. - The token never enters CDK context, file artefacts, environment variables in the deploy pipeline, or GitHub Actions secrets. The CDK Stack composes the DNS records using only the public DKIM and Return-Path values surfaced through
cdk.context.json. - The Free Kanban Tool itself reads its server token from the 1Password item via its own resolution path (out of scope of this project’s runtime).
This isolation gives the Free Kanban Tool a bounded blast radius: a leak of CDK context, a cloud-runtime credential, a CI secret, or even OP_SERVICE_ACCOUNT_TOKEN does not yield the Free Kanban Tool’s sending credential. OP_SERVICE_ACCOUNT_TOKEN is scoped read-only to Arda-SystemsOAM; the Free Kanban server token lives in a separate vault (Arda-CorporateOAM) reachable only by the Free Kanban Tool’s own runtime resolution path. See DQ-R1-007.
2.6 Outbound — Postmark API (deploy-time, thin-wrappers)
Section titled “2.6 Outbound — Postmark API (deploy-time, thin-wrappers)”The Postmark thin-wrapper constructs at src/main/cdk/platform/constructs/postmark/ (Phase 3 deliverables) make Postmark Account API calls during operator-driven deploys (Phase A of the Corporate CLI) and during drift-detection runs in CI. The resolution of the account token at these moments uses the 1Password SDK rather than runtime Secrets Manager:
- Local-dev operator: DesktopAuth biometric unlock against the operator’s 1Password app. No service-account token leaves the operator’s machine.
- CI: the
OP_SERVICE_ACCOUNT_TOKENGitHub Actions secret authenticates the SDK, which then resolvesop://Arda-SystemsOAM/Postmark-Prod/credential(or…/Postmark-NonProd/credential).
The same dual-auth path is used by every operator-facing tool that needs Postmark account access at deploy time.
2.7 Outbound — AWS (runtime)
Section titled “2.7 Outbound — AWS (runtime)”Route53 access traverses an STS role chain settled in DQ-204:
- The pod’s IRSA service-account role (existing operations-component pattern) provides base AWS credentials.
- The AWS SDK’s
StsAssumeRoleCredentialsProvider(configured at module startup with 15-minute session duration) auto-chains anAssumeRolecall to the partition’sEmailDnsProvisioningRole. - The provider caches credentials lazily and refreshes on first use after expiry. The pod possesses Route53 write credentials for at most ~15 minutes after the last call.
- The L1
route53ZoneProxymakes Route53 SDK calls without per-call AssumeRole; STS is handled by the credentials provider transparently.
Aurora PostgreSQL access uses the existing operations-component DB credential pattern; no email-specific change.
2.8 Outbound — 1Password (deploy-time, CI)
Section titled “2.8 Outbound — 1Password (deploy-time, CI)”The 1Password service-account token (OP_SERVICE_ACCOUNT_TOKEN) is the one and only GitHub Actions repository secret provisioned for this project. CI workflows resolve every other downstream credential at runtime via the 1Password SDK. The token is:
- Issued from the 1Password admin console as a service-account token.
- Scoped read-only to the
Arda-SystemsOAMvault. The token cannot read other vaults and cannot write anywhere. - Provisioned into the
Arda-cards/infrastructurerepository’s secrets via thetools/gha-secret.tsoperator utility (libsodium-encrypted upload via Octokit). - Rotatable through the same operator utility; the new token replaces the old in a single API call.
The token is never logged, never echoed in CI output (workflow steps mask it via the standard GitHub Actions secret-masking), and never carried into pod runtime.
2.9 Outbound — GitHub (deploy-time, operator tools)
Section titled “2.9 Outbound — GitHub (deploy-time, operator tools)”The tools/gha-secret.ts operator CLI provisions GitHub Actions repository secrets. It is operator-driven (not part of any automated pipeline), takes --repo, --name, and --op-ref as inputs, and is the only outbound-to-GitHub credential-write surface in this project. Authentication to GitHub uses an operator-supplied PAT or OAuth token; encryption of the secret value uses libsodium against the repository’s public key.
This tool is a transition-state utility (scratch.md N1+ context). When declarative GHA-secret management is adopted as a wider repo pattern, the tool is retired.
3. Authorisation
Section titled “3. Authorisation”3.1 email-configuration endpoint
Section titled “3.1 email-configuration endpoint”The email-configuration endpoint is CS-only in v1. It is accessed by CS scripts directly using ARDA_API_KEY, validated in-component. The endpoint is not exposed through the BFF in v1.
Future state: a CS administration UI may surface this endpoint behind a privileged role check. v1 accepts that any caller with ARDA_API_KEY can invoke the endpoint, consistent with the existing platform pattern. The Postmark Console covers operational visibility in the meantime (§ 5.1).
3.2 email-job endpoint
Section titled “3.2 email-job endpoint”The email-job endpoint is tenant-scoped via the X-Tenant-Id header forwarded by the BFF after JWT validation. The L3 EmailJobService:
- Trusts the
X-Tenant-Idheader (the BFF is the trust boundary, per existing platform conventions). - Verifies that the requested
emailConfigurationIdbelongs to the asserted tenant (otherwise 403).
The trust-boundary delegation matches the existing pattern in other modules; revalidating the tenant on every request would duplicate work the BFF already performs.
3.3 postmark-events webhook
Section titled “3.3 postmark-events webhook”The postmark-events route authenticates via Bearer token only (in-component validation). There is no caller-identity concept beyond “the Bearer token is valid”; the route trusts that a request with a valid Bearer token comes from Postmark (Postmark IP allowlisting is the optional second factor).
Tenant scoping at the webhook is implicit: the inbound payload contains a MessageID that maps to an EmailJob row, which carries the tenant via email_configuration_id. No header-level tenant assertion.
3.4 Free Kanban Tool sending
Section titled “3.4 Free Kanban Tool sending”The Free Kanban Tool’s send operations are governed by the application running on the Free Kanban Tool’s own infrastructure (out of scope of this project). The Postmark server-token-based authorisation to the Postmark Server API matches the per-tenant model: possession of the server token authorises sending from freekanban.arda.ardamails.com. Token issuance and lifecycle are owned by the Corporate Resource Group (§ 4.1).
4. Secret management
Section titled “4. Secret management”4.1 Secret inventory and lifecycle
Section titled “4.1 Secret inventory and lifecycle”| Secret | Issued by | Stored in | Delivered as | Used by | Rotation |
|---|---|---|---|---|---|
| Postmark account token (per Postmark account) | Postmark console (manual) | 1Password: global-utility items Postmark-Prod / Postmark-NonProd in Arda-SystemsOAM (qualified names — both accounts in one vault); per-partition copies under item title Postmark in each Arda-{Env}OAM vault (e.g., op://Arda-ProdOAM/Postmark/credential); per-partition runtime copy in AWS Secrets Manager ({fqn}-I-EmailPostmarkAccountToken) | ESO → HOCON extras.email.postmarkAccountToken (runtime); 1Password SDK (deploy-time / CI / drift) | postmarkAccountProxy (runtime); tools/corporate-cli.ts, drift workflows (deploy-time) | Manual: regenerate in Postmark console → update 1Password (both the Arda-SystemsOAM item and the affected Arda-{Env}OAM item) → rerun deploy-time tooling to refresh Secrets Manager → refresh ESO |
| 1Password service-account token | 1Password admin console (manual) | Local: operator’s 1Password client. CI: OP_SERVICE_ACCOUNT_TOKEN repository secret in Arda-cards/infrastructure | Environment variable to CI workflows; read-only access to Arda-SystemsOAM | All deploy-time and drift-detection workflows | Manual: regenerate in 1Password admin → rerun tools/gha-secret.ts to update the GitHub Actions secret |
| Per-partition encryption key | CDK GeneratedSecret (passwordLength: 64, DQ-203.c) | AWS Secrets Manager ({fqn}-I-EmailEncryptionKey per partition); native SM versioning carries rotation history (AWSCURRENT, AWSPREVIOUS, historical versionIds) per DQ-R1-019 | ESO → two ExternalSecret resources (AWSCURRENT and AWSPREVIOUS) → HOCON extras.email.encryptionKeys (list keyed by SM versionId) → HKDF derivation in TokenCipher. Rare rows older than AWSPREVIOUS fall back to a direct AWS SM SDK fetch via the EmailEncryptionKeyFallbackRole, which the operations pod assumes via STS from its IRSA-bound pod role; the pod role itself does not carry secretsmanager:GetSecretValue (DQ-R1-020). | EmailConfigurationService (only) | aws secretsmanager update-secret creates new versionId → ESO refreshes both mounts → pod’s TokenCipher holds both AWSCURRENT and AWSPREVIOUS keys → first non-up-to-date read synchronously re-encrypts its own row and launches a per-pod coroutine to mop up the rest of the partition → operator verifies completion (SELECT count(*) WHERE NOT LIKE 'a1.k${currentVersionId}:%') → optionally retires the prior SM version (DQ-R1-019; full design in 4-runtime-platform-updates/design/email-server-key-encryption.md). |
| Per-tenant Postmark server token | Postmark API (POST /servers response) | DB email_configuration.server_token_encrypted (AES-256-GCM versioned envelope) | Decrypted on demand by EmailConfigurationService.getUnlockedConfiguration(); passed in-memory through L2 / L1 | postmarkServerProxy (via method argument) | Per-tenant via Postmark POST /servers/{id}/rotateToken; deferred to v2 |
| Free Kanban Tool Postmark server token | Postmark API (Phase A of Corporate CLI; one-time, idempotent) | 1Password (Free-Kanban-Generator-Postmark-Server in Arda-CorporateOAM; ref op://Arda-CorporateOAM/Free-Kanban-Generator-Postmark-Server/credential) | 1Password SDK at the Free Kanban Tool’s runtime resolution (out of scope of this project) | Free Kanban Tool runtime (out of scope) | Manual: regenerate in Postmark console → update 1Password item |
ARDA_API_KEY | Existing platform mechanism | Existing platform store | Existing platform delivery | Inbound HTTP auth (in-component) + outbound Postmark webhook Bearer header | Existing platform rotation procedure |
Note: the encryption-key secret is HKDF-derived in the application before use (DQ-203). The HKDF info string is "arda.email.serverToken.a{N}" where a{N} is the algorithm version (the first axis of the two-axis envelope per DQ-R1-019). v1 of the algorithm uses info = "arda.email.serverToken.a1". Future algorithm changes bump a{N} (rare, code-released); secret-material rotations are tracked independently by AWS SM’s native versionId mechanism and surface in the envelope as the k{...} axis.
4.2 Logging and redaction
Section titled “4.2 Logging and redaction”Three classes of values must never appear in logs:
- The per-partition encryption key (raw secret value or HKDF-derived key bytes).
- Per-tenant Postmark server tokens (plaintext or any decrypted form).
- Email body content (HTML / text body).
Plus, in deploy-time tooling:
- The 1Password service-account token (
OP_SERVICE_ACCOUNT_TOKEN). - Postmark account-level tokens.
- The Free Kanban Tool’s Postmark server token.
Redaction is enforced at:
- L1 proxies: per-surface application logging contract pinned in DQ-220.h. Highlights:
postmarkServerProxy.sendEmaillogs only recipient, subject, andMessageID;postmarkAccountProxy.*logs only operation name, resource ids, and HTTP status. - Transport layer (Ktor
Loggingplugin): common-libhttpClient(incommon-module) installssanitizeHeader { ... }coveringAuthorization,X-Postmark-Account-Token,X-Postmark-Server-Token, andapi_key. One-line addition that benefits every proxy in the platform. - L3 services: structured-log frameworks must mark relevant fields as redacted; entity field-by-field log helpers exclude
serverTokenEncryptedandserverTokenPlaintext. - Deploy-time tooling: a cross-cutting
redact()utility undersrc/main/cdk/utils/logging.tsis consumed by the Postmark thin-wrapper constructs and the Corporate CLI. The same utility is the basis for drift-workflow log scrubbing.
What may be logged (acceptable): Postmark resource IDs (serverId, domainId, webhookId), MessageID, sending domain FQDN, recipient email addresses (these are PII; see § 6.2), bounce diagnostics.
4.3 Drift detection
Section titled “4.3 Drift detection”The project introduces scheduled drift detection that asserts external resources match their declared state:
- A monthly GitHub Actions workflow exercises the live Postmark Account API surface for each Postmark account, enumerates servers and sending domains, and asserts that the visible state matches what the
infrastructurerepository declares. - The workflow authenticates to 1Password via
OP_SERVICE_ACCOUNT_TOKENand resolves Postmark account tokens at runtime; no Postmark token is persisted as a GitHub Actions secret. - On any divergence, the workflow opens a labelled GitHub issue with the run URL and the observed-vs-declared diff. The repository’s existing on-call routing handles triage.
The same pattern applies to other external surfaces (1Password vault contents; GitHub repository configuration); the initial implementation covers the Postmark surface and serves as the template.
5. OAM (operations, administration, management)
Section titled “5. OAM (operations, administration, management)”5.1 Postmark Console — the primary OAM surface
Section titled “5.1 Postmark Console — the primary OAM surface”Aggregate operational management of the email service is performed through the Postmark Console, not through Arda-built tooling. The Postmark Console is the source of truth for:
- Server-by-server delivery / bounce / complaint statistics.
- Per-domain DKIM / SPF / DMARC verification status.
- Suppression-list management.
- Message-level diagnostics (search by recipient, by subject, by
MessageID). - Webhook configuration changes (post-provisioning).
Arda-side telemetry (§ 5.2, § 5.3) supplements the Postmark Console with information that is specific to Arda’s use of Postmark (per-tenant aggregates, polling-task health, lifecycle transitions). It does not replace the Postmark Console.
This project does not build an Arda-side OAM UI for the email service; that is explicitly out of scope (goal.md Out of scope).
5.2 Logging conventions
Section titled “5.2 Logging conventions”Structured JSON via the existing operations-component logging stack. Standard fields:
- Correlation ID (per request).
- Tenant ID (where relevant).
- Configuration ID, Job ID (where relevant).
- Layer marker (L1 / L2 / L3) for traceability across boundaries.
- Severity per the standard SLF4J levels.
Logging contract per layer:
- L1 proxies:
INFOper remote call (request path + response status);WARNon non-2xx;ERRORon parse / connection failures. Body content excluded for tokens / email body. - L2 capability composers:
INFOper capability operation (start / success);WARNonResult.failure(PartialProgress)with thefailedAtstep. - L3 services:
INFOon lifecycle transitions;WARNon retry-with-backoff (DQ-205.e step-9 path);ERRORon persistent UPDATE failures with diagnostic naming the orphans.
Deploy-time tooling (Corporate CLI, Postmark thin-wrappers, drift workflows) follows the same redaction contract via the cross-cutting src/main/cdk/utils/logging.ts utility.
5.3 Metrics
Section titled “5.3 Metrics”| Metric | Source | Use |
|---|---|---|
email_send_total{tenant, status} | EmailJob transitions | Per-tenant send volume and outcome distribution |
email_delivery_rate{tenant} | DELIVERED / total | Deliverability tracking |
email_bounce_rate{tenant, type} | BOUNCED breakdown | Bounce-rate alerting (in v2; ESP-OOTB in v1 per DQ-006) |
email_complaint_rate{tenant} | COMPLAINED / total | Spam-complaint alerting (v2) |
email_provisioning_duration_seconds | provision call latency | Provisioning health |
email_dns_verification_attempts_total{outcome} | bounded polling rounds | Verification health |
email_polling_active_count{pod} | per-pod activePolling map size | In-flight verification visibility |
Drift-detection workflows do not emit pod-level metrics; their signal is the absence of a failing run plus the absence of an open drift issue in the repository.
v1 emits the metrics above via the existing operations metrics pipeline; no email-specific Prometheus exporter or CloudWatch namespace.
5.4 Operator alerts and runbooks
Section titled “5.4 Operator alerts and runbooks”| Alert | Query | Threshold | Recipient | Runbook |
|---|---|---|---|---|
email_configuration_pending_stale | count(*) WHERE status='PENDING_VERIFICATION' AND verification_started_at < now() - interval '15 minutes' | result > 0 | CS / on-call | (1) Inspect diagnostic_message. (2) Verify Postmark domain status via GET /domains/{id} (or in the Postmark Console). (3) Hit PUT /retry-verification if DNS is now ready, or DELETE if known-broken. (4) Confirm alert clears within ~15 min. (DQ-207.j) |
email_configuration_provisioning_stuck | count(*) WHERE status='PROVISIONING' AND provisioning_started_at < now() - interval '5 minutes' | result > 0 | on-call | (1) Identify orphan external resources (server name pattern, sending-domain FQDN) — the Postmark Console is the source of truth here. (2) Manually transition row to PROVISIONING_FAILED with diagnostic. (3) Run DELETE to invoke best-effort decommission. (DQ-205.f) |
| Drift-detection workflow failure | Auto-issue opened by the drift workflow | issue created | on-call | (1) Read the run logs at the link in the issue body. (2) Compare observed Postmark state to the declared state in infrastructure. (3) Either reconcile manually (Postmark Console + tooling) or open a follow-up to update the declarations. (4) Close the issue when reconciled. |
Future v2: email_bounce_rate_high | bounce rate > 5% per tenant per hour | exceeded | CS | Postmark Console for diagnostics; tenant outreach (DQ-006) |
Future v2: email_complaint_rate_high | complaint rate > 0.1% per tenant per hour | exceeded | CS | Same |
5.5 Manual operations
Section titled “5.5 Manual operations”CS / on-call workflows surfaced as endpoints, operational queries, or operator scripts:
PUT /email-configuration/<configId>/retry-verification— kicks off a fresh bounded DNS verification round (DQ-207.b). Allowed fromPENDING_VERIFICATIONorVERIFICATION_FAILED; refreshesverification_started_at.PUT /email-configuration/<configId>/lockand/unlock— pure DB transitions; lock prevents new sends through that configuration (Scenario 6).DELETE /email-configuration/<configId>— runs best-effort decommission (R53 deletes first, then Postmark per DQ-205.k); deletes the row unconditionally (DQ-205.d).- Manual stuck-row triage (no endpoint) — operator runs the stuck-row queries above; for
PROVISIONINGrows, manually transitions toPROVISIONING_FAILEDthenDELETE. The Postmark Console is the source of truth for whether orphans exist on the Postmark side. - Postmark API change watch (continuous) — the
#dev-teamSlack channel is subscribed to https://postmarkapp.com/updates/type/api. Triage rule: any post mentioningServers,Domains,Webhooks,Email, or transport-layer changes is reviewed against the L1 proxy implementations and the Postmark thin-wrapper constructs; if a contract change affects either surface, a follow-up issue is filed. - First-deploy credential verification (per partition, per setup) — on the first deploy of the email module to a new partition (or after rotating any of the Postmark account token / encryption key / DNS provisioning role ARN), the operator runs an ad-hoc smoke test: provision a known-disposable test tenant via
POST /email-configuration, observe the row reachPENDING_VERIFICATION, exercise/retry-verification, observe verification success, send a test email, observe the webhook firing, thenDELETEthe row. The runbook substitutes for live integration testing of the L1 proxies (per DQ-220.g). - Corporate CLI operator runs — the Corporate CLI (
tools/corporate-cli.ts) invokes Phase A (Postmark thin-wrapper calls) followed by Phase B (cdk deploy). On a Phase-A failure the operator can re-run Phase A; on a Phase-B failure the operator can re-run Phase B without re-issuing Phase A’s Postmark calls (idempotent reconcile).
5.6 Rotation procedures
Section titled “5.6 Rotation procedures”Postmark account token — out-of-band, manual:
- Regenerate the token in the Postmark Console.
- Update both 1Password copies: the qualified item in
Arda-SystemsOAM(Postmark-ProdorPostmark-NonProd) and the per-partition itemPostmarkin each affectedArda-{Env}OAMvault (e.g.,Arda-ProdOAM/Postmarkfor aPostmarkProdrotation). The two stores are independent; both must be updated so CI drift-detection and partition runtime tooling resolve the same token. - Re-run the per-partition deploy (or a dedicated reconcile command) to refresh the Secrets Manager copy in each affected partition.
- Refresh ESO sync (or wait for the 1h interval).
- Restart pods if needed (the new value is picked up on next pod startup; existing pods continue using the cached value until restart).
Drift-detection workflows continue to authenticate via the 1Password service-account token after the rotation (no separate update required).
1Password service-account token — manual:
- Regenerate the service-account token in the 1Password admin console.
- Run
tools/gha-secret.tswith the new value to update theOP_SERVICE_ACCOUNT_TOKENrepository secret. - Confirm the next CI run authenticates successfully.
Encryption key — hot-swap via SM-native versioning + lazy migration (DQ-R1-019; full design in 4-runtime-platform-updates/design/email-server-key-encryption.md):
-
Generate the new key material and write it to the SM secret (the AWS CLI’s
update-secretsubcommand does not accept--generate-random-password— that flag belongs toget-random-password):Terminal window NEW=$(aws secretsmanager get-random-password \--password-length 64 --exclude-characters '"@/\' \--require-each-included-type \--output text --query RandomPassword)aws secretsmanager put-secret-value \--secret-id "{fqn}-I-EmailEncryptionKey" --secret-string "$NEW"put-secret-valuecreates a newversionId, promotes it toAWSCURRENT, and demotes the prior version toAWSPREVIOUS. -
Within the ESO
refreshInterval(~1 min), bothExternalSecretresources (AWSCURRENTandAWSPREVIOUS) refresh; the corresponding Kubernetes Secrets update with the new versionIds and material. -
Operations component pod picks up the change on its next refresh (rolling restart, or in-pod
TokenCipher.reload()tick). TheTokenCiphernow holds both old and new derived keys; new writes encrypt asa1.k{new-versionId}:…. -
Migration runs automatically. The first
EmailConfigurationService.getUnlockedConfiguration()call that encounters a row still tagged with the old versionId synchronously re-encrypts that row and launches a per-pod coroutine that mops up the rest. Idempotent and self-healing across pod restarts. -
Operator verifies completion:
SELECT COUNT(*) FROM email_configuration WHERE server_token_encrypted NOT LIKE 'a1.k${currentVersionId}:%'returns zero (an admin endpoint will expose this in Phase 5b). -
(Optional)
aws secretsmanager update-secret-version-stage --remove-from-version-id <old>retires theAWSPREVIOUSlabel. The version stays in SM history (still SDK-fetchable via theEmailEncryptionKeyFallbackRoleSTS hop); full deletion isdelete-secret --version-id <old>and triggers the 7-day SM-deletion window.
A row encrypted under a version older than AWSPREVIOUS (rare; happens only after two consecutive rotations within an un-migrated window) triggers a one-off secretsmanager:GetSecretValue SDK call from the pod, cached for the pod’s lifetime. If the SM version has been deleted, TokenCipher throws RetiredSecretVersion; the operator runbook covers manual remediation.
Automated rotation (AWS SM Rotation Lambda) is enabled by this design and deferred to a future deliverable.
Free Kanban Tool Postmark server token — out-of-band, manual:
- Regenerate the server token in the Postmark Console (or via the Postmark Server API).
- Update the
Free-Kanban-Generator-Postmark-Server1Password item in theArda-CorporateOAMvault. - Restart / refresh the Free Kanban Tool runtime so it re-resolves the token.
Per-tenant Postmark server tokens — v2 only. Postmark’s POST /servers/{id}/rotateToken issues a new token and invalidates the old. v1 does not exercise this.
6. Compliance and audit
Section titled “6. Compliance and audit”6.1 Bitemporal audit trail
Section titled “6.1 Bitemporal audit trail”Both email_configuration and email_job are persisted via Arda’s bitemporal DataAuthority pattern. Every state transition produces a new version with valid_from / valid_to and transaction_time columns. Operators can query “what was the state of this config at time T” or “what was the latest state as known at time T” without losing history.
The exact bitemporal application (what counts as a state-change event) is a substantive open decision — see DQ-240.a / DQ-250.c in phased-design-requirements.md.
6.2 PII handling
Section titled “6.2 PII handling”The system handles the following PII classes:
- Recipient email addresses (
to,ccon EmailJob): stored plaintext in DB. Required for resend / audit. - Reply-To address (often the user’s own email): stored plaintext.
- Email body (HTML / text): stored plaintext for resend support.
- Bounce / complaint diagnostics: may contain recipient address fragments and bounce reason codes.
GDPR-shaped data-subject rights flows (right to erasure, right to access) are out of scope for v1. v2 will address retention and erasure procedures; v1 retains all data indefinitely via the bitemporal history.
6.3 Data retention
Section titled “6.3 Data retention”- EmailJob: retained indefinitely in v1. Bitemporal history preserves all status transitions.
- EmailConfiguration: retained indefinitely while the row exists; on
DELETE, the row is removed but bitemporal versions of the deleted row persist in history (this is theDataAuthoritydefault). - Postmark-side data: Postmark retains messages per its own retention policies; we do not control or override.
A retention policy (e.g., archive jobs older than N years; purge after M years) is deferred to v2.
6.4 Free Kanban Tool data scope
Section titled “6.4 Free Kanban Tool data scope”The Free Kanban Tool’s send-side data resides in the Free Kanban Tool’s own runtime (out of scope of this project). On the Postmark side, the Free Kanban Tool’s server stores send history per Postmark’s retention policies. The infrastructure repository declares only the sending-domain DNS records and the Postmark server’s identifying metadata (server name, signing identity); no per-message data crosses into Arda’s email-integration scope.
7. Defence-in-depth posture
Section titled “7. Defence-in-depth posture”| Layer | Protection |
|---|---|
| Network | TLS everywhere (HTTPS for inbound, HTTPS / SDK-over-TLS for outbound). No plaintext within VPC. |
| Application — in-component authentication | Each route validates its own credential in Ktor (no API-gateway-level authoriser dependency for the email service). |
| Application — error pathway | All proxy methods return Result<T>; failures captured via runCatching; no exception propagation outside Result.failure. Token redaction at the transport and L1 layers. |
| Application — secret handling | Encryption key held only by L3 service. Per-tenant tokens encrypted at rest; plaintext lives only in the in-memory call stack during a send. |
| Storage — Aurora | KMS-backed volume encryption (existing platform default). |
| Storage — application-level encryption | AES-256-GCM versioned envelope on per-tenant tokens (DQ-202). Defence against DB dumps, SQL-injection-style read leaks, analyst sessions. |
| Credential resolution | 1Password as system of record; SDK-mediated reads at deploy time and CI; ESO-projected at runtime. No long-lived token environments outside 1Password. |
| IAM | Pod IRSA service-account role; STS-chained credentials with 15-min session duration; no long-lived AWS credentials in pods (DQ-204). |
| External-resource state | Drift-detection workflows assert declared-vs-observed state on a schedule; divergences raise an automated investigation issue. |
| Operational | Operator alerts on stale / stuck rows; manual triage workflow; Postmark Console as source of truth for delivery diagnostics; bitemporal audit trail for forensics. |
The combination addresses the threat model in § 1; gaps acknowledged in that section (pod compromise, Postmark account compromise, 1Password compromise) are out of scope at this layer.
7a. Message stream discipline
Section titled “7a. Message stream discipline”All Arda outbound email is transactional. Every Postmark server provisioned for Arda (the Free Kanban Tool server, future per-partition operations servers, future per-tenant servers) uses the default Transactional Message Stream and must not be repurposed for marketing, newsletter, or bulk/broadcast sends. Postmark’s policy (Best Practices for Bulk/Broadcast Sending) requires bulk traffic to flow through a dedicated Broadcast Message Stream; co-mingling transactional and broadcast on the same stream both violates that policy and risks degrading transactional reputation.
If a future use case requires bulk send (e.g., tenant announcements, product updates), provision a separate Broadcast stream on the relevant Postmark server, treat it as a distinct OAM surface (its own throttle, suppression list, reputation), and expose it through a separate L2 API in the operations component — do not route through the transactional EmailSender interface. Phase 5b’s EmailSender is explicitly transactional-only.
8. References
Section titled “8. References”- Project goal
- Architecture overview
- Phase structure
- Decision log — upstream decisions (
DQ-001—DQ-013) and Revision-1 decisions (DQ-R1-NNN) - Application-layer open decisions —
DQ-201—DQ-208 - Application-layer functional design
- Application-layer decision log — DQ-220 series
Copyright: © Arda Systems 2025-2026, All rights reserved