Skip to content

Email Integration -- Phased Design Requirements

For each implementation phase defined in dependency-analysis.md (other than Phase 1, which is owned by 1-infrastructure/), this document inventories the design decisions that must be settled before that phase can be implemented.

Decisions fall into one of three statuses:

StatusMeaning
ResolvedAlready settled in open-decision-analysis.md (DQ-201…DQ-208) or in decision-log.md (DQ-001…DQ-013). The phase consumes the decision; no further work needed.
Parked / DeferredExplicitly settled to be resolved later (e.g., end-to-end pass, v2). The phase plans around the parked status.
NewNot yet identified as a discrete decision. Must be resolved during the per-phase planning artifacts (requirements / specification) before the phase implements. Provisional DQ identifier listed for traceability.

The point of this inventory is to make per-phase planning straightforward: each phase’s planning pass starts with the New column and produces formal requirements / specifications that close those decisions, plus references to the Resolved ones.

The numbering scheme for New decisions:

  • DQ-22x for Phase 2
  • DQ-23x for Phase 3
  • DQ-24x for Phase 4
  • DQ-25x for Phase 5
  • DQ-26x for Phase 6
  • DQ-20x for cross-cutting

These are provisional handles; final identifiers will be assigned when each phase’s decision-log.md is created.

Stylistic conventions covered by repo-level skills (kotlin-coding, unit-tests-backend, path-conventions, document-writing, etc.) are excluded — those are handled by skill loading at implementation time, not project-level decisions.


#DecisionStatusReference / Note
0.1PostmarkProd / PostmarkNonProd accounts on Platform planResolvedREQ-PM-ACCT-001 / 002; supersedes Phase 1 REQ-PM-001
0.2Account-level Postmark tokens stored in 1Password (Arda-SystemsOAM)ResolvedREQ-PM-ACCT-003; supersedes Phase 1 REQ-PM-002
0.3Free Prod Kanban Tool standalone Postmark server provisioningResolvedREQ-PM-FREE-001..004 — Option A (fully scripted TypeScript orchestrator)
0.4Free server domain free.platform.prod.ardamails.com; platform slug already reservedResolvedREQ-PM-FREE-002, REQ-PM-SLUG-001; list literal already includes platform (no code change needed)
0.5Free server DNS records hosted in prod.ardamails.com partition zone (records-only; separate platform.prod.ardamails.com zone deferred)ResolvedREQ-PM-FREE-005; TTL 300s for future-migration hygiene; recipe to be documented at current-system/oam/postmark-service/free-platform-server.md (page not yet authored)
0.6TypeScript orchestrator (idempotent; reusable for future Platform-Engineering shared servers)ResolvedREQ-PM-SCRIPT-001; 0-postmark-foundations/specification.md § 2
0.7CDK Route53 records construct (reads DKIM / Return-Path values from 1Password at synth time)ResolvedREQ-PM-SCRIPT-002; 0-postmark-foundations/specification.md § 3
0.8Postmark API observations note feeds Phase 2 designResolvedREQ-PM-DOC-003
0.9Postmark service overview + Free-server integration guide in current-system/oam/postmark-service/ResolvedREQ-PM-DOC-001 / 002
0.10Token rotation, decommission, multi-server orchestrationOut of scopePhase 0 deferred items; future runbook / v2

Phase 2 — Postmark API Clients (L1 Protocol Proxies + Route53)

Section titled “Phase 2 — Postmark API Clients (L1 Protocol Proxies + Route53)”
#DecisionStatusReference / Note
1L1 proxy shape and location (per-credential, under shopaccess/email/servers/)ResolvedDQ-201, DQ-201.b
2Stateless L1, constructor-time validationResolvedDQ-201, DQ-205.h
3Postmark Account API client method surface (server / domain / webhook CRUD + verifyDkim / verifyReturnPath + findServerByName / findDomainByName)ResolvedDQ-201, DQ-205.i
4Postmark Server API client method surface (sendEmail, createWebhook, deleteWebhook + per-call serverToken parameter)ResolvedDQ-201
5Route53 proxy method surface (upsertTxtRecord, upsertCnameRecord, deleteRecord, findRecordsAt)ResolvedDQ-201, DQ-205.m
6STS AssumeRole strategy for Route53 (auto-chain, 15-min sessions)ResolvedDQ-204
7Postmark name filter behavior (substring; client-side exact-match; pagination for /domains)ResolvedDQ-205.i
8Public contract: every method returns Result<T> via runCatchingResolvedDQ-201 (clarified during discussion)
9HTTP client library + engine choice for Postmark proxies (Ktor Java engine, per Documint precedent)ResolvedDQ-220.a. Reuse common-lib httpClient(log).
10HTTP client timeouts, connection pool, keep-alive policyResolvedDQ-220.b. Inherit common-lib defaults; revisit only on observed failures.
11Retry policy on transient failures (5xx / network) — none, or limited?ResolvedDQ-220.c. No L1 retries; fail-fast per DQ-205.l.
12JSON serialization config: reuse JsonConfig.standardJson from common-lib, or scoped variant?ResolvedDQ-220.d. Reuse JsonConfig.standardJson (already wired by common-lib httpClient).
13AppError.ExternalService shape used for Postmark / Route53 / STS failures (HTTP code, message, response body, structured error subtype if any)ResolvedDQ-220.e. Reuse flat AppError.ExternalService(msg, code, description); no 4xx/5xx subtype split; Postmark ErrorCode preserved verbatim in description.
14Postmark response wire-model code generation vs hand-written DTOsResolvedDQ-220.f. Hand-written @Serializable data classes per proxy Model.kt.
15Integration test gating (env var to enable; CI vs local-only)ResolvedDQ-220.g. No live integration tests at L1. Drift signal via npm postmark witness package + Dependabot; implicit coverage via dev deploys + Bruno api-test; human watch on Postmark API updates feed. Most of DQ-260.c becomes moot under this resolution and will be revisited at Phase 6 planning.
16Logging level conventions for L1 calls (request / response logging, redaction of tokens)ResolvedDQ-220.h. Per-surface application logging contract + transport-level sanitizeHeader added to common-lib httpClient (cross-repo: common-module).

#DecisionStatusReference / Note
17Module location and Helm apis.system.shopAccess.email entryResolvedDQ-201.b implicitly; functional.md specifies the path
18HOCON namespace email.*Resolvedfunctional.md
19HOCON keys for Postmark account token, encryption key, DNS role ARN, hosted zone ID, polling parametersResolvedfunctional.md, DQ-203, DQ-204, DQ-207.d
20ESO ExternalSecret entries that materialize into secrets.propertiesResolvedDQ-203, infrastructure exports
21Flyway migration location: shopaccess/email/database/migrations/Resolvedfunctional.md
22Construct-time validation of all wired-in classes (TokenCipher round-trip, proxies, services)ResolvedDQ-205.h
23Module dependency wiring pattern (manual constructor injection in Module.kt, matching Documint / Image pattern)ResolvedDQ-230.a. Manual constructor injection via Application.emailModule(cfgProvider, registry); matches pdfRenderService / kanban / orders precedent. No DI framework.
24Coroutine scope used for kickOffBoundedPolling (Application scope vs scoped to module vs scoped to service)ResolvedDQ-230.b. EmailConfigurationService.Impl(scope = this) from Application.emailModule; monitor.subscribe(ApplicationStopping) { service.stop() } cancels active polling.
25Module bootstrap order — when does L3 service start polling existing PENDING_VERIFICATION rows on startup? Or never (per DQ-207’s “trigger-driven only”)?ResolvedDQ-230.c. No startup sweep. Pod restart leaves stranded rows for trigger-driven recovery + email_configuration_pending_stale operator alert.
26Logger names + log levels per packageResolvedDQ-230.d. Standard operations precedent (LogEnabled by LogProvider(<Class>::class) + optional LoggerFactory.getLogger). No email-specific convention. Redaction inherited from DQ-220.h.
26.1Module HOCON path casingResolvedDQ-230.e. system.shopAccess.email (matches system.shopAccess.pdfRender precedent).

Phase 4 — EmailConfiguration Service and Persistence

Section titled “Phase 4 — EmailConfiguration Service and Persistence”
#DecisionStatusReference / Note
27EmailConfiguration entity fields (incl. provisioningStartedAt, verificationStartedAt, provisionedAt, postmarkWebhookId, diagnosticMessage)Resolvedfunctional.md, DQ-205.b, DQ-207.i
28Lifecycle states + transitions (PROVISIONING entry, PROVISIONING_FAILED terminal, no auto VERIFICATION_FAILED in v1)ResolvedDQ-205.a, DQ-207.h
29Encryption: AES-256-GCM versioned envelope; HKDF over GeneratedSecret.password; column type text + base64ResolvedDQ-202, DQ-202.b, DQ-203
30TokenCipher class shape, decryption-failure behaviorResolvedDQ-202.b, DQ-203
31Slug derivation algorithmResolvedDQ-206
32Pre-flight checkAvailability + INSERT inside one DataAuthority transaction; ROLLBACK on ConflictResolvedDQ-205.i, DQ-205.j
33Persist-first lifecycle, partial-failure path, retry idempotency (409 if any row exists)ResolvedDQ-205.c, DQ-205.g
34DELETE runs best-effort decommission (Route53-first inverse)ResolvedDQ-205.d, DQ-205.k
35Step 9 (UPDATE after externals succeed): bounded retry with backoff (3 attempts; 100/500/2000 ms); persistent failure leaves row in PROVISIONING with diagnosticResolvedDQ-205.e
36Bounded DNS verification mechanism + triggers + cancellationResolvedDQ-207, DQ-207.a–.k
37Bitemporal pattern application: how EmailConfiguration uses Arda’s bitemporal abstractions; what counts as a state-change eventResolvedDQ-240.a. Standard Arda bitemporal pattern; every persisted change = new version; valid-time = transaction-time except for EmailJob webhook updates (Postmark wire timestamp); standard BitemporalEntityClass “list latest per eId” view for current-state queries.
38DataAuthority concrete repository methodsResolvedDQ-240.b. No intermediate “repository-typed wrapper” layer between service and universe. Generic Universe<EP, M> API + the five custom Universe methods documented in functional-design.md § 5.5 cover every persistence operation across all eight scenarios. Original candidate method names mapped to standard universe.create / update / delete / findOne / read plus payload shaping in the service.
39Concurrent UPDATE strategy when two pods both detect verification success and try to mark UNLOCKEDResolvedDQ-240.c. Status-guard alone (WHERE status = '<source>'); no optimistic-lock version column; no advisory locks. Re-confirms DQ-207.e.
40Database schema details: column types, indexesResolved (informative scope)DQ-240.d. The DDL in information-model.md § 4 is informative; the column inventory + cross-Universe rule + uniqueness pattern + flattening convention are normative; SQL types/lengths/index expressions are implementation choices made when migrations are authored.
41API request validation — what shape rejects with 400 vs 422?ResolvedDQ-240.e. 400 for malformed; 422 for semantic violations. Cross-repo impact: common-module AppError &rarr; HttpCode mapping is upgraded to express the distinction cleanly.
42Public response shape (does it include postmark_server_id, postmark_domain_id, postmark_webhook_id?)ParkedDQ-201.e — explicitly parked for end-to-end pass
43Endpoint authentication and tenant scopingResolvedDQ-240.f. Standard Arda X-Tenant-Id header captured in ApplicationContext. CS-only endpoints validate CS authority via ARDA_API_KEY at the gateway; tenant comes from header, not body.
44Audit logging contractResolvedDQ-240.g. INFO on every state transition with structured fields (configId, eId, fromStatus, toStatus, author, correlationId, externalIds). Redaction inherits DQ-220.h.
45Stub Postmark / Route53 implementation surface for unit testsResolvedDQ-240.h. Optional injected proxy parameters on Application.emailModule(...), matching pdfRender precedent. MockK at unit level; no interface extraction.
46Operator-alert metric exposureResolvedDQ-240.i. Application emits gauges via existing operations metrics pipeline. Platform has no push-notification mechanism in v1; alerts are only visible by polling metrics or querying data-authority routes.

Phase 5 — EmailJob Service and Persistence

Section titled “Phase 5 — EmailJob Service and Persistence”
#DecisionStatusReference / Note
47EmailJob lifecycle (NEW, QUEUED, SENT, DELIVERED, BOUNCED, COMPLAINED, FAILED, CANCELLED) and transitionsResolvedfunctional.md (existing)
48Out-of-order event acceptance (QUEUED → DELIVERED if SENT was missed)Resolvedfunctional.md, architectural-scenarios.md Scenario 3
49Webhook auth: Bearer token (ARDA_API_KEY)ResolvedDQ-011
50Webhook unknown MessageID: log + 200 OK (idempotent)Resolvedfunctional.md / Scenario 3
51Resend creates new job with originalJobId reference; original retainedResolvedarchitectural-scenarios.md Scenario 5
52Cancel allowed only from NEW (idempotent guard)Resolvedarchitectural-scenarios.md Scenario 5
53Send-time precondition behavior on PENDING_VERIFICATION (kick-off polling, fail fast)ResolvedDQ-207, Scenario 1b.3
54EmailJob entity fields and column typesResolvedDQ-250.a. Finalised in information-model.md § 3.2 + § 4.2; field inventory normative; SQL details inherit DQ-240.d’s informative scope.
55Attachment storage strategy (inline confirmed; column shape)ResolvedDQ-250.b. Inline attachments JSONB column storing List<Attachment>. 10 MB Postmark cap is the operational ceiling.
56Bitemporal pattern for EmailJobResolvedDQ-250.c. Same answer as DQ-240.a; webhook updates use Postmark wire timestamp as effective_as_of.
57DataAuthority repository surface for EmailJobResolvedDQ-250.d. Same answer as DQ-240.b; generic Universe API + 3 custom methods on EmailJobUniverse.
58Webhook MessageID lookup — indexed column on email_jobResolvedDQ-250.e. Indexed via idx_email_job_message_id; EmailJobUniverse.findByMessageId is the canonical lookup.
59Webhook idempotency on Postmark retryResolvedDQ-250.f. Naturally idempotent via validator’s status-guard (DQ-240.c); webhook returns 200 OK on retry.
60Bounce / SpamComplaint diagnostic field captureResolvedDQ-250.g. EmailJob.diagnosticMessage typed JsonElement? and persisted to a JSONB column. Specific field subset is implementation-flexible (DQ-240.d informative scope).
61Cancel-while-sending race windowResolvedDQ-250.h. v1 accepts the race; status-guard makes a late cancel a 409. No SELECT FOR UPDATE, no advisory lock.
62Lock-during-send race (Scenario 6’s note): same shape — allow the in-flight send to complete?Resolved (in-text)architectural-scenarios.md Scenario 6 narrative says v1 accepts the race; reaffirm in v1 plan.
63Public response shape (status only? full body? originalJobId chain depth?)ParkedDQ-201.e — to be resolved with API contract pass
64Endpoint authentication and tenant scoping (EmailJob)ResolvedDQ-250.i. Same answer as DQ-240.f; standard X-Tenant-Id header captured in ApplicationContext.
65Authorization for bounce-management endpointsResolved (out of scope v1)DQ-250.j. No bounce-management endpoints in v1. Re-evaluate in v2 alongside the broader bounce-management feature design.
66Stub ESP implementation surface for unit testsResolvedDQ-250.k. Same answer as DQ-240.h; optional injected proxy parameters; MockK at unit level; no interface extraction.
67Local Development Stub fidelity (synthesizes fake delivery events for testing the full lifecycle, per functional.md)ResolvedDQ-250.l. WireMock + LocalStack containers in helmInstallToLocal; webhook event synthesis via WireMock post-serve actions. No application code changes; same decision as DQ-260.f.

#DecisionStatusReference / Note
68TenantProvisioner (L2) shape: provision, decommission, verifyDns, checkAvailability methodsResolvedDQ-201, DQ-201.a, DQ-205.d, DQ-205.i
69EmailSender (L2) shape: sendOne(serverToken, message)ResolvedDQ-201
70Provisioning step ordering (Postmark first, then Route53)ResolvedDQ-205.k
71UPSERT semantics in Route53 proxyResolvedDQ-205.m
72Decommission step ordering (Route53 first, then Postmark)ResolvedDQ-205.d, DQ-205.k
73ProvisionedExternals + PartialProgress structured Result typesResolvedDQ-201.d
74DnsVerificationLoop shape: in-process bounded coroutines spawned by L3 (no separate class)Resolved (revised)DQ-207, DQ-208 (superseded)
75L2 error type taxonomy and mapping from L1 errorsResolvedDQ-260.a. No bespoke L2 sealed hierarchy. Add AppError.Application<T> to common-module (cross-repo deliverable); TenantProvisioner.provision uses AppError.Application<PartialProgress>; other L2 methods use existing AppError subtypes directly.
76Logging at L2 vs L1 boundaryResolvedDQ-260.b. L1 logs the wire call; L2 logs the capability boundary; L3 logs the business event. Avoids triple-logging; correlation via CallId plugin. Redaction inherits DQ-220.h.
77E2E test data strategyResolved (subsumed)DQ-260.c. Subsumed by DQ-220.g; no live integration tests at L1, so per-tenant E2E hygiene is moot.
78Async DNS verification observability metricResolvedDQ-260.d. Same answer as DQ-240.i; gauge emission via existing operations metrics pipeline; no push notifications in v1.
79Stuck-PROVISIONING-row alertResolvedDQ-260.e. Same answer as DQ-240.i; same exposure pipeline.
80Local Development Stub: stub for ESP + Route53 + DataAuthority interactions; how it ties into the dev clusterResolvedDQ-260.f — cross-reference to DQ-250.l. WireMock + LocalStack containers brought up by helmInstallToLocal; the wired-up L1+L2+L3+L4 surface runs unchanged against the stubbed endpoints.
81Resource adoption on retry (Variant from DQ-205.n)Deferred to v2DQ-205.n. Reaffirm in v1 plan.
82Async reconciler (Strategy 3 from DQ-205)Deferred to v2Reaffirm.
83Watchdog for stuck PROVISIONING rows (DQ-205.f, DQ-207’s loop extension)Deferred to v2Reaffirm.
84Cross-pod coordination revisit (currently no coordination per DQ-207.e)Deferred to v2If multi-pod issues observed in production, revisit.

#DecisionStatusReference / Note
85Public exposure of Postmark / Route53 resource IDs in HTTP responsesParkedDQ-201.e — explicitly to be resolved at API-contract pass
86Postmark account-token rotation protocolDeferredDQ-200.a. Defer until after implementation is complete. Runbook captured in cross-cutting-design.md § 5.5; will be re-confirmed post-deploy when CS exercises it.
87API contract versioningResolvedDQ-200.b. Standard Arda URL-path versioning (/v1/...); parallel /v2/... path on incompatible evolution; backward-compatible additions don’t require a bump.

PhaseResolvedParked / DeferredNew (need DQ before implementation)
Phase 010 (10 resolved REQ-PM-* per 0-postmark-foundations/requirements.md)5 (rotation, decommission, multi-server, AWS-Secrets copy, per-tenant DMARC — all out-of-scope deferrals)0
Phase 216 (8 baseline + 8 resolved DQ-220.a..h)00
Phase 311 (6 baseline + 5 resolved DQ-230.a..e)00
Phase 420 (10 baseline + 10 resolved DQ-240.a..j)1 (DQ-201.e parked)0
Phase 523 (7 baseline + 16 resolved DQ-250.a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p)1 (DQ-201.e parked)0 open
Phase 613 (7 baseline + 6 resolved DQ-260.a, b, c, d, e, f)4 (deferred to v2)0 open
Cross1 (1 resolved DQ-200.b)1 (DQ-201.e parked) + 1 (DQ-200.a deferred)0
Cross012 (DQ-200.a–.b)

Total new decisions identified across phases: ~44 — though many are mechanical (column types, validation contracts, repository method names) rather than architectural.

The architecturally-substantive ones to flag for actual design discussion before each phase begins:

  • DQ-240.a / DQ-250.c — bitemporal pattern application for both entities (non-trivial in Arda; affects schema and DataAuthority API).
  • DQ-250.b — attachment storage strategy in detail (inline confirmed, but column type and limits).
  • DQ-250.h — cancel-while-sending race window (accept or guard?).
  • DQ-260.a — L2 error type taxonomy (drives all callsite handling).
  • DQ-260.d / .e — observability metrics / alert pipeline operationalization.

The rest are mostly mechanical / contract-definition decisions that surface naturally during the per-phase planning artifacts (requirements, specification, exports) without needing further design discussion.


  1. At the start of each phase’s planning pass, take the phase’s New rows and convert each into a short open-question entry in that phase’s decision-log.md (or analogue). Each becomes a DQ to be resolved before the phase’s specification is frozen.
  2. The Resolved column becomes the cross-link list at the top of the phase’s requirements.md — “this phase consumes the following decisions: <list>”.
  3. Parked rows are reflected in the phase’s requirements.md as explicit gaps with a pointer to the parking decision.
  4. Deferred rows are reflected in the phase’s future-work.md (or equivalent) — they are not v1 scope, but are tracked for v2.

The numbering scheme (DQ-22x, DQ-23x, etc.) is provisional. When each phase is opened, the planning agent assigns final identifiers and updates this document in-place to keep cross-references stable.