Email Integration -- Phased Design Requirements
For each implementation phase defined in dependency-analysis.md (other than Phase 1, which is owned by 1-infrastructure/), this document inventories the design decisions that must be settled before that phase can be implemented.
Decisions fall into one of three statuses:
| Status | Meaning |
|---|---|
| Resolved | Already settled in open-decision-analysis.md (DQ-201…DQ-208) or in decision-log.md (DQ-001…DQ-013). The phase consumes the decision; no further work needed. |
| Parked / Deferred | Explicitly settled to be resolved later (e.g., end-to-end pass, v2). The phase plans around the parked status. |
| New | Not yet identified as a discrete decision. Must be resolved during the per-phase planning artifacts (requirements / specification) before the phase implements. Provisional DQ identifier listed for traceability. |
The point of this inventory is to make per-phase planning straightforward: each phase’s planning pass starts with the New column and produces formal requirements / specifications that close those decisions, plus references to the Resolved ones.
The numbering scheme for New decisions:
DQ-22xfor Phase 2DQ-23xfor Phase 3DQ-24xfor Phase 4DQ-25xfor Phase 5DQ-26xfor Phase 6DQ-20xfor cross-cutting
These are provisional handles; final identifiers will be assigned when each phase’s decision-log.md is created.
Stylistic conventions covered by repo-level skills (kotlin-coding, unit-tests-backend, path-conventions, document-writing, etc.) are excluded — those are handled by skill loading at implementation time, not project-level decisions.
Phase 0 — Postmark Foundations
Section titled “Phase 0 — Postmark Foundations”| # | Decision | Status | Reference / Note |
|---|---|---|---|
| 0.1 | PostmarkProd / PostmarkNonProd accounts on Platform plan | Resolved | REQ-PM-ACCT-001 / 002; supersedes Phase 1 REQ-PM-001 |
| 0.2 | Account-level Postmark tokens stored in 1Password (Arda-SystemsOAM) | Resolved | REQ-PM-ACCT-003; supersedes Phase 1 REQ-PM-002 |
| 0.3 | Free Prod Kanban Tool standalone Postmark server provisioning | Resolved | REQ-PM-FREE-001..004 — Option A (fully scripted TypeScript orchestrator) |
| 0.4 | Free server domain free.platform.prod.ardamails.com; platform slug already reserved | Resolved | REQ-PM-FREE-002, REQ-PM-SLUG-001; list literal already includes platform (no code change needed) |
| 0.5 | Free server DNS records hosted in prod.ardamails.com partition zone (records-only; separate platform.prod.ardamails.com zone deferred) | Resolved | REQ-PM-FREE-005; TTL 300s for future-migration hygiene; recipe to be documented at current-system/oam/postmark-service/free-platform-server.md (page not yet authored) |
| 0.6 | TypeScript orchestrator (idempotent; reusable for future Platform-Engineering shared servers) | Resolved | REQ-PM-SCRIPT-001; 0-postmark-foundations/specification.md § 2 |
| 0.7 | CDK Route53 records construct (reads DKIM / Return-Path values from 1Password at synth time) | Resolved | REQ-PM-SCRIPT-002; 0-postmark-foundations/specification.md § 3 |
| 0.8 | Postmark API observations note feeds Phase 2 design | Resolved | REQ-PM-DOC-003 |
| 0.9 | Postmark service overview + Free-server integration guide in current-system/oam/postmark-service/ | Resolved | REQ-PM-DOC-001 / 002 |
| 0.10 | Token rotation, decommission, multi-server orchestration | Out of scope | Phase 0 deferred items; future runbook / v2 |
Phase 2 — Postmark API Clients (L1 Protocol Proxies + Route53)
Section titled “Phase 2 — Postmark API Clients (L1 Protocol Proxies + Route53)”| # | Decision | Status | Reference / Note |
|---|---|---|---|
| 1 | L1 proxy shape and location (per-credential, under shopaccess/email/servers/) | Resolved | DQ-201, DQ-201.b |
| 2 | Stateless L1, constructor-time validation | Resolved | DQ-201, DQ-205.h |
| 3 | Postmark Account API client method surface (server / domain / webhook CRUD + verifyDkim / verifyReturnPath + findServerByName / findDomainByName) | Resolved | DQ-201, DQ-205.i |
| 4 | Postmark Server API client method surface (sendEmail, createWebhook, deleteWebhook + per-call serverToken parameter) | Resolved | DQ-201 |
| 5 | Route53 proxy method surface (upsertTxtRecord, upsertCnameRecord, deleteRecord, findRecordsAt) | Resolved | DQ-201, DQ-205.m |
| 6 | STS AssumeRole strategy for Route53 (auto-chain, 15-min sessions) | Resolved | DQ-204 |
| 7 | Postmark name filter behavior (substring; client-side exact-match; pagination for /domains) | Resolved | DQ-205.i |
| 8 | Public contract: every method returns Result<T> via runCatching | Resolved | DQ-201 (clarified during discussion) |
| 9 | HTTP client library + engine choice for Postmark proxies (Ktor Java engine, per Documint precedent) | Resolved | DQ-220.a. Reuse common-lib httpClient(log). |
| 10 | HTTP client timeouts, connection pool, keep-alive policy | Resolved | DQ-220.b. Inherit common-lib defaults; revisit only on observed failures. |
| 11 | Retry policy on transient failures (5xx / network) — none, or limited? | Resolved | DQ-220.c. No L1 retries; fail-fast per DQ-205.l. |
| 12 | JSON serialization config: reuse JsonConfig.standardJson from common-lib, or scoped variant? | Resolved | DQ-220.d. Reuse JsonConfig.standardJson (already wired by common-lib httpClient). |
| 13 | AppError.ExternalService shape used for Postmark / Route53 / STS failures (HTTP code, message, response body, structured error subtype if any) | Resolved | DQ-220.e. Reuse flat AppError.ExternalService(msg, code, description); no 4xx/5xx subtype split; Postmark ErrorCode preserved verbatim in description. |
| 14 | Postmark response wire-model code generation vs hand-written DTOs | Resolved | DQ-220.f. Hand-written @Serializable data classes per proxy Model.kt. |
| 15 | Integration test gating (env var to enable; CI vs local-only) | Resolved | DQ-220.g. No live integration tests at L1. Drift signal via npm postmark witness package + Dependabot; implicit coverage via dev deploys + Bruno api-test; human watch on Postmark API updates feed. Most of DQ-260.c becomes moot under this resolution and will be revisited at Phase 6 planning. |
| 16 | Logging level conventions for L1 calls (request / response logging, redaction of tokens) | Resolved | DQ-220.h. Per-surface application logging contract + transport-level sanitizeHeader added to common-lib httpClient (cross-repo: common-module). |
Phase 3 — Module Skeleton
Section titled “Phase 3 — Module Skeleton”| # | Decision | Status | Reference / Note |
|---|---|---|---|
| 17 | Module location and Helm apis.system.shopAccess.email entry | Resolved | DQ-201.b implicitly; functional.md specifies the path |
| 18 | HOCON namespace email.* | Resolved | functional.md |
| 19 | HOCON keys for Postmark account token, encryption key, DNS role ARN, hosted zone ID, polling parameters | Resolved | functional.md, DQ-203, DQ-204, DQ-207.d |
| 20 | ESO ExternalSecret entries that materialize into secrets.properties | Resolved | DQ-203, infrastructure exports |
| 21 | Flyway migration location: shopaccess/email/database/migrations/ | Resolved | functional.md |
| 22 | Construct-time validation of all wired-in classes (TokenCipher round-trip, proxies, services) | Resolved | DQ-205.h |
| 23 | Module dependency wiring pattern (manual constructor injection in Module.kt, matching Documint / Image pattern) | Resolved | DQ-230.a. Manual constructor injection via Application.emailModule(cfgProvider, registry); matches pdfRenderService / kanban / orders precedent. No DI framework. |
| 24 | Coroutine scope used for kickOffBoundedPolling (Application scope vs scoped to module vs scoped to service) | Resolved | DQ-230.b. EmailConfigurationService.Impl(scope = this) from Application.emailModule; monitor.subscribe(ApplicationStopping) { service.stop() } cancels active polling. |
| 25 | Module bootstrap order — when does L3 service start polling existing PENDING_VERIFICATION rows on startup? Or never (per DQ-207’s “trigger-driven only”)? | Resolved | DQ-230.c. No startup sweep. Pod restart leaves stranded rows for trigger-driven recovery + email_configuration_pending_stale operator alert. |
| 26 | Logger names + log levels per package | Resolved | DQ-230.d. Standard operations precedent (LogEnabled by LogProvider(<Class>::class) + optional LoggerFactory.getLogger). No email-specific convention. Redaction inherited from DQ-220.h. |
| 26.1 | Module HOCON path casing | Resolved | DQ-230.e. system.shopAccess.email (matches system.shopAccess.pdfRender precedent). |
Phase 4 — EmailConfiguration Service and Persistence
Section titled “Phase 4 — EmailConfiguration Service and Persistence”| # | Decision | Status | Reference / Note |
|---|---|---|---|
| 27 | EmailConfiguration entity fields (incl. provisioningStartedAt, verificationStartedAt, provisionedAt, postmarkWebhookId, diagnosticMessage) | Resolved | functional.md, DQ-205.b, DQ-207.i |
| 28 | Lifecycle states + transitions (PROVISIONING entry, PROVISIONING_FAILED terminal, no auto VERIFICATION_FAILED in v1) | Resolved | DQ-205.a, DQ-207.h |
| 29 | Encryption: AES-256-GCM versioned envelope; HKDF over GeneratedSecret.password; column type text + base64 | Resolved | DQ-202, DQ-202.b, DQ-203 |
| 30 | TokenCipher class shape, decryption-failure behavior | Resolved | DQ-202.b, DQ-203 |
| 31 | Slug derivation algorithm | Resolved | DQ-206 |
| 32 | Pre-flight checkAvailability + INSERT inside one DataAuthority transaction; ROLLBACK on Conflict | Resolved | DQ-205.i, DQ-205.j |
| 33 | Persist-first lifecycle, partial-failure path, retry idempotency (409 if any row exists) | Resolved | DQ-205.c, DQ-205.g |
| 34 | DELETE runs best-effort decommission (Route53-first inverse) | Resolved | DQ-205.d, DQ-205.k |
| 35 | Step 9 (UPDATE after externals succeed): bounded retry with backoff (3 attempts; 100/500/2000 ms); persistent failure leaves row in PROVISIONING with diagnostic | Resolved | DQ-205.e |
| 36 | Bounded DNS verification mechanism + triggers + cancellation | Resolved | DQ-207, DQ-207.a–.k |
| 37 | Bitemporal pattern application: how EmailConfiguration uses Arda’s bitemporal abstractions; what counts as a state-change event | Resolved | DQ-240.a. Standard Arda bitemporal pattern; every persisted change = new version; valid-time = transaction-time except for EmailJob webhook updates (Postmark wire timestamp); standard BitemporalEntityClass “list latest per eId” view for current-state queries. |
| 38 | DataAuthority concrete repository methods | Resolved | DQ-240.b. No intermediate “repository-typed wrapper” layer between service and universe. Generic Universe<EP, M> API + the five custom Universe methods documented in functional-design.md § 5.5 cover every persistence operation across all eight scenarios. Original candidate method names mapped to standard universe.create / update / delete / findOne / read plus payload shaping in the service. |
| 39 | Concurrent UPDATE strategy when two pods both detect verification success and try to mark UNLOCKED | Resolved | DQ-240.c. Status-guard alone (WHERE status = '<source>'); no optimistic-lock version column; no advisory locks. Re-confirms DQ-207.e. |
| 40 | Database schema details: column types, indexes | Resolved (informative scope) | DQ-240.d. The DDL in information-model.md § 4 is informative; the column inventory + cross-Universe rule + uniqueness pattern + flattening convention are normative; SQL types/lengths/index expressions are implementation choices made when migrations are authored. |
| 41 | API request validation — what shape rejects with 400 vs 422? | Resolved | DQ-240.e. 400 for malformed; 422 for semantic violations. Cross-repo impact: common-module AppError → HttpCode mapping is upgraded to express the distinction cleanly. |
| 42 | Public response shape (does it include postmark_server_id, postmark_domain_id, postmark_webhook_id?) | Parked | DQ-201.e — explicitly parked for end-to-end pass |
| 43 | Endpoint authentication and tenant scoping | Resolved | DQ-240.f. Standard Arda X-Tenant-Id header captured in ApplicationContext. CS-only endpoints validate CS authority via ARDA_API_KEY at the gateway; tenant comes from header, not body. |
| 44 | Audit logging contract | Resolved | DQ-240.g. INFO on every state transition with structured fields (configId, eId, fromStatus, toStatus, author, correlationId, externalIds). Redaction inherits DQ-220.h. |
| 45 | Stub Postmark / Route53 implementation surface for unit tests | Resolved | DQ-240.h. Optional injected proxy parameters on Application.emailModule(...), matching pdfRender precedent. MockK at unit level; no interface extraction. |
| 46 | Operator-alert metric exposure | Resolved | DQ-240.i. Application emits gauges via existing operations metrics pipeline. Platform has no push-notification mechanism in v1; alerts are only visible by polling metrics or querying data-authority routes. |
Phase 5 — EmailJob Service and Persistence
Section titled “Phase 5 — EmailJob Service and Persistence”| # | Decision | Status | Reference / Note |
|---|---|---|---|
| 47 | EmailJob lifecycle (NEW, QUEUED, SENT, DELIVERED, BOUNCED, COMPLAINED, FAILED, CANCELLED) and transitions | Resolved | functional.md (existing) |
| 48 | Out-of-order event acceptance (QUEUED → DELIVERED if SENT was missed) | Resolved | functional.md, architectural-scenarios.md Scenario 3 |
| 49 | Webhook auth: Bearer token (ARDA_API_KEY) | Resolved | DQ-011 |
| 50 | Webhook unknown MessageID: log + 200 OK (idempotent) | Resolved | functional.md / Scenario 3 |
| 51 | Resend creates new job with originalJobId reference; original retained | Resolved | architectural-scenarios.md Scenario 5 |
| 52 | Cancel allowed only from NEW (idempotent guard) | Resolved | architectural-scenarios.md Scenario 5 |
| 53 | Send-time precondition behavior on PENDING_VERIFICATION (kick-off polling, fail fast) | Resolved | DQ-207, Scenario 1b.3 |
| 54 | EmailJob entity fields and column types | Resolved | DQ-250.a. Finalised in information-model.md § 3.2 + § 4.2; field inventory normative; SQL details inherit DQ-240.d’s informative scope. |
| 55 | Attachment storage strategy (inline confirmed; column shape) | Resolved | DQ-250.b. Inline attachments JSONB column storing List<Attachment>. 10 MB Postmark cap is the operational ceiling. |
| 56 | Bitemporal pattern for EmailJob | Resolved | DQ-250.c. Same answer as DQ-240.a; webhook updates use Postmark wire timestamp as effective_as_of. |
| 57 | DataAuthority repository surface for EmailJob | Resolved | DQ-250.d. Same answer as DQ-240.b; generic Universe API + 3 custom methods on EmailJobUniverse. |
| 58 | Webhook MessageID lookup — indexed column on email_job | Resolved | DQ-250.e. Indexed via idx_email_job_message_id; EmailJobUniverse.findByMessageId is the canonical lookup. |
| 59 | Webhook idempotency on Postmark retry | Resolved | DQ-250.f. Naturally idempotent via validator’s status-guard (DQ-240.c); webhook returns 200 OK on retry. |
| 60 | Bounce / SpamComplaint diagnostic field capture | Resolved | DQ-250.g. EmailJob.diagnosticMessage typed JsonElement? and persisted to a JSONB column. Specific field subset is implementation-flexible (DQ-240.d informative scope). |
| 61 | Cancel-while-sending race window | Resolved | DQ-250.h. v1 accepts the race; status-guard makes a late cancel a 409. No SELECT FOR UPDATE, no advisory lock. |
| 62 | Lock-during-send race (Scenario 6’s note): same shape — allow the in-flight send to complete? | Resolved (in-text) | architectural-scenarios.md Scenario 6 narrative says v1 accepts the race; reaffirm in v1 plan. |
| 63 | Public response shape (status only? full body? originalJobId chain depth?) | Parked | DQ-201.e — to be resolved with API contract pass |
| 64 | Endpoint authentication and tenant scoping (EmailJob) | Resolved | DQ-250.i. Same answer as DQ-240.f; standard X-Tenant-Id header captured in ApplicationContext. |
| 65 | Authorization for bounce-management endpoints | Resolved (out of scope v1) | DQ-250.j. No bounce-management endpoints in v1. Re-evaluate in v2 alongside the broader bounce-management feature design. |
| 66 | Stub ESP implementation surface for unit tests | Resolved | DQ-250.k. Same answer as DQ-240.h; optional injected proxy parameters; MockK at unit level; no interface extraction. |
| 67 | Local Development Stub fidelity (synthesizes fake delivery events for testing the full lifecycle, per functional.md) | Resolved | DQ-250.l. WireMock + LocalStack containers in helmInstallToLocal; webhook event synthesis via WireMock post-serve actions. No application code changes; same decision as DQ-260.f. |
Phase 6 — Integration Wiring
Section titled “Phase 6 — Integration Wiring”| # | Decision | Status | Reference / Note |
|---|---|---|---|
| 68 | TenantProvisioner (L2) shape: provision, decommission, verifyDns, checkAvailability methods | Resolved | DQ-201, DQ-201.a, DQ-205.d, DQ-205.i |
| 69 | EmailSender (L2) shape: sendOne(serverToken, message) | Resolved | DQ-201 |
| 70 | Provisioning step ordering (Postmark first, then Route53) | Resolved | DQ-205.k |
| 71 | UPSERT semantics in Route53 proxy | Resolved | DQ-205.m |
| 72 | Decommission step ordering (Route53 first, then Postmark) | Resolved | DQ-205.d, DQ-205.k |
| 73 | ProvisionedExternals + PartialProgress structured Result types | Resolved | DQ-201.d |
| 74 | DnsVerificationLoop shape: in-process bounded coroutines spawned by L3 (no separate class) | Resolved (revised) | DQ-207, DQ-208 (superseded) |
| 75 | L2 error type taxonomy and mapping from L1 errors | Resolved | DQ-260.a. No bespoke L2 sealed hierarchy. Add AppError.Application<T> to common-module (cross-repo deliverable); TenantProvisioner.provision uses AppError.Application<PartialProgress>; other L2 methods use existing AppError subtypes directly. |
| 76 | Logging at L2 vs L1 boundary | Resolved | DQ-260.b. L1 logs the wire call; L2 logs the capability boundary; L3 logs the business event. Avoids triple-logging; correlation via CallId plugin. Redaction inherits DQ-220.h. |
| 77 | E2E test data strategy | Resolved (subsumed) | DQ-260.c. Subsumed by DQ-220.g; no live integration tests at L1, so per-tenant E2E hygiene is moot. |
| 78 | Async DNS verification observability metric | Resolved | DQ-260.d. Same answer as DQ-240.i; gauge emission via existing operations metrics pipeline; no push notifications in v1. |
| 79 | Stuck-PROVISIONING-row alert | Resolved | DQ-260.e. Same answer as DQ-240.i; same exposure pipeline. |
| 80 | Local Development Stub: stub for ESP + Route53 + DataAuthority interactions; how it ties into the dev cluster | Resolved | DQ-260.f — cross-reference to DQ-250.l. WireMock + LocalStack containers brought up by helmInstallToLocal; the wired-up L1+L2+L3+L4 surface runs unchanged against the stubbed endpoints. |
| 81 | Resource adoption on retry (Variant from DQ-205.n) | Deferred to v2 | DQ-205.n. Reaffirm in v1 plan. |
| 82 | Async reconciler (Strategy 3 from DQ-205) | Deferred to v2 | Reaffirm. |
| 83 | Watchdog for stuck PROVISIONING rows (DQ-205.f, DQ-207’s loop extension) | Deferred to v2 | Reaffirm. |
| 84 | Cross-pod coordination revisit (currently no coordination per DQ-207.e) | Deferred to v2 | If multi-pod issues observed in production, revisit. |
Cross-cutting (apply to all phases)
Section titled “Cross-cutting (apply to all phases)”| # | Decision | Status | Reference / Note |
|---|---|---|---|
| 85 | Public exposure of Postmark / Route53 resource IDs in HTTP responses | Parked | DQ-201.e — explicitly to be resolved at API-contract pass |
| 86 | Postmark account-token rotation protocol | Deferred | DQ-200.a. Defer until after implementation is complete. Runbook captured in cross-cutting-design.md § 5.5; will be re-confirmed post-deploy when CS exercises it. |
| 87 | API contract versioning | Resolved | DQ-200.b. Standard Arda URL-path versioning (/v1/...); parallel /v2/... path on incompatible evolution; backward-compatible additions don’t require a bump. |
Summary by phase
Section titled “Summary by phase”| Phase | Resolved | Parked / Deferred | New (need DQ before implementation) |
|---|---|---|---|
| Phase 0 | 10 (10 resolved REQ-PM-* per 0-postmark-foundations/requirements.md) | 5 (rotation, decommission, multi-server, AWS-Secrets copy, per-tenant DMARC — all out-of-scope deferrals) | 0 |
| Phase 2 | 16 (8 baseline + 8 resolved DQ-220.a..h) | 0 | 0 |
| Phase 3 | 11 (6 baseline + 5 resolved DQ-230.a..e) | 0 | 0 |
| Phase 4 | 20 (10 baseline + 10 resolved DQ-240.a..j) | 1 (DQ-201.e parked) | 0 |
| Phase 5 | 23 (7 baseline + 16 resolved DQ-250.a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p) | 1 (DQ-201.e parked) | 0 open |
| Phase 6 | 13 (7 baseline + 6 resolved DQ-260.a, b, c, d, e, f) | 4 (deferred to v2) | 0 open |
| Cross | 1 (1 resolved DQ-200.b) | 1 (DQ-201.e parked) + 1 (DQ-200.a deferred) | 0 |
| Cross | 0 | 1 | 2 (DQ-200.a–.b) |
Total new decisions identified across phases: ~44 — though many are mechanical (column types, validation contracts, repository method names) rather than architectural.
The architecturally-substantive ones to flag for actual design discussion before each phase begins:
- DQ-240.a / DQ-250.c — bitemporal pattern application for both entities (non-trivial in Arda; affects schema and DataAuthority API).
- DQ-250.b — attachment storage strategy in detail (inline confirmed, but column type and limits).
- DQ-250.h — cancel-while-sending race window (accept or guard?).
- DQ-260.a — L2 error type taxonomy (drives all callsite handling).
- DQ-260.d / .e — observability metrics / alert pipeline operationalization.
The rest are mostly mechanical / contract-definition decisions that surface naturally during the per-phase planning artifacts (requirements, specification, exports) without needing further design discussion.
How to use this document
Section titled “How to use this document”- At the start of each phase’s planning pass, take the phase’s New rows and convert each into a short open-question entry in that phase’s
decision-log.md(or analogue). Each becomes a DQ to be resolved before the phase’s specification is frozen. - The Resolved column becomes the cross-link list at the top of the phase’s
requirements.md— “this phase consumes the following decisions: <list>”. - Parked rows are reflected in the phase’s
requirements.mdas explicit gaps with a pointer to the parking decision. - Deferred rows are reflected in the phase’s
future-work.md(or equivalent) — they are not v1 scope, but are tracked for v2.
The numbering scheme (DQ-22x, DQ-23x, etc.) is provisional. When each phase is opened, the planning agent assigns final identifiers and updates this document in-place to keep cross-references stable.
Copyright: © Arda Systems 2025-2026, All rights reserved