Decision Log: Email Integration

Purpose

Tracks design decisions for the Arda Email Integration project, covering domain structure, sending model, tenant isolation, address handling, and subsystem responsibilities.

Decision Table

#	Question	Status	Decision	Round
DQ-001	Tenant sending domain structure	Decided	`<tenant>.<partition>.{mail-root-domain}` uniformly (see DQ-010)	R1
DQ-002	Multi-config domain strategy	Decided	Sub-subdomain, deferred to v2+	R1
DQ-003	Tenant slug source	Decided	From provisioning request (tenantEId, tenantName, tenantSlug); algorithm deferred	R1
DQ-004	Reply-To editability	Decided	Not user-editable	R1
DQ-005	Email order send paths	Decided	Copy-paste (existing) + system send (new)	R1
DQ-006	CS alerting scope in v1	Decided	ESP OOTB only; Arda-built is v2+	R1
DQ-007	Document generation responsibility	Decided	Calling feature, not email capability	R1
DQ-008	Send dialog interaction model	Decided	Single-step (no separate confirm)	R1
DQ-009	Mail root domain choice	Decided	`ardamails.com` (implementation parametric)	R1
DQ-010	Prod tenant zone placement	Decided	Own partition zone (`prod.{mail-root-domain}`), not root zone	R1
DQ-011	Webhook authentication mechanism	Decided	Bearer token via Postmark modern Webhooks API	R1
DQ-012	Per-tenant server token storage	Decided	Encrypted in DB (application-level), not Secrets Manager	R1
DQ-013	IAM role extraction from root stack	Decided	Do not extract; role stays in RootDnsStack	R2
DQ-R1-001	Drift workflow filename	Decided	`external-resources-drift.yml` — describes the asserted invariant	R1-Phase1
DQ-R1-002	Drift-check TypeScript location	Decided	`tools/drift-check.ts` — operator- and CI-runnable	R1-Phase1
DQ-R1-003	Operator runbook sign-off mechanism	Decided	Markdown “Operator Sign-off” section with name/date/deviations table	R1-Phase1
DQ-R1-004	Disposition of legacy parser-gated runbook	Decided	Delete in Phase 1 — no parser gate remains	R1-Phase1
DQ-R1-005	API-surface freshness cadence	Decided	At first drift-test failure attributable to surface drift, augmented by an annual review	R1-Phase1
DQ-R1-006	Locus of cross-zone NS-delegation writes	Decided	Child zone owner writes upstream via `WriteNSRecordsToUpstreamDns`; Root only owns the assume-role target	R1-Phase2
DQ-R1-007	Vault separation for Free Kanban Tool server token	Decided	Lives in `Arda-CorporateOAM` (separate vault), not `Arda-SystemsOAM`	R1-Phase1
DQ-R1-008	Adopt-vs-create the existing `ardamails.com` zone	Decided	Adopt via `cdk import` against `Z0721066239FWCD47EJDX`; CDK code mirrors the live zone’s AWS-default comment to keep the import read-only	R1-Phase2
DQ-R1-009	Postmark domain-verification target (parent vs leaf)	Decided	Verify at the Corporate-zone parent (`arda.ardamails.com`); leaf sub-domains inherit DKIM	R1-Phase3
DQ-R1-010	Locus of Corporate’s NS-delegation write (same-account)	Decided	Always go through `WriteNSRecordsToUpstreamDns` and assume the Root role even when same-account; preserves the pattern under future Corporate-account migration	R1-Phase3
DQ-R1-011	`route-53-hosted-zone.ts` → `dns-zone.ts` migration shape	Decided	Rename in place; existing callers updated in the same PR	R1-Phase3
DQ-R1-012	Corporate drift-workflow filename and scope	Decided	`corporate-drift.yml` — one workflow per instance group, exercising every asset listed in `instances/Corporate/`	R1-Phase3
DQ-R1-013	Phase A failure ordering for the Postmark server token	Decided	In-memory buffer + retries on the 1Password write; fail loud with redacted summary on permanent failure; manual operator action to recover	R1-Phase3
DQ-R1-014	`cdk.context.json` commit policy for Phase A’s outputs	Decided	Commit `cdk.context.json` — public values only, standard CDK convention, deterministic re-synth on a fresh checkout	R1-Phase3
DQ-R1-015	DMARC reporting mailbox (`rua` / `ruf`) for `_dmarc.arda.ardamails.com`	Decided	`dmarc-reports@arda.cards`; operator action to create the mailbox in Arda’s Google Workspace before Phase B deploy	R1-Phase3
DQ-R1-016	Reserved-name registry scope at `arda.ardamails.com`	Decided	Documentation-only; `corporate-cli.ts` enforces locally via a conflict-check at Phase A entry against pre-existing Postmark Sender Signatures, servers, and 1Password items	R1-Phase3
DQ-R1-017	Postmark Sender Signature granularity per partition	Decided	One Signature per partition sub-zone; leaves inherit DKIM; per-tenant Signatures deferred to Phase 5b	R1-Phase4
DQ-R1-018	`corporate-drift` rename and scope	Decided	Keep `corporate-drift`; add a parallel `runtime-platform-drift` workflow with shared reusable scripts	R1-Phase4
DQ-R1-019	Per-partition email server-token encryption key	Decided	Single SM secret per partition with native versioning; two-axis envelope `a{N}.k{SM-VERSION-ID}`; hot-swap via AWSCURRENT+AWSPREVIOUS mounts; lazy + coroutine migration; SDK fallback	R1-Phase4
DQ-R1-020	DNS-provisioning + SM-fallback IAM roles	Decided	Fresh per-purpose roles assumed via STS from the operations pod role (mirroring the image-asset-bucket `preSigningRole` pattern); trust policy = account principal + `ArnLike` on the partition role-name prefix	R1-Phase4
DQ-R1-021	Order of partition rollout	Decided	Partial order `dev` → (`stage` or `demo`, either order or in parallel) → `prod`; `kyle` excluded (partition suspended). Total order relaxed to partial order in 2026-05-13 amendment.	R1-Phase4
DQ-R1-022	Operator CLI shape for Phase 4	Decided	Integrate into `amm.sh`; extract reusable utilities shared with `corporate-cli` (no standalone `partition-mail-cli`)	R1-Phase4
DQ-R1-023	Per-tenant Postmark Sender Signature introduction (Phase 5b)	Open — TBC at Phase 5b planning	Four options (α status quo / β per-tenant v1 / γ hybrid opt-in / δ remediation-only). No Phase 4 dependency.	R1-Phase5b
DQ-R1-027	`AppError.Application` introduction	Decided	Add `sealed class Application` with `PreconditionFailed`, `PolicyRejected`, `ConflictingState` subtypes; `reportable() = emptyList()` at branch root; no HTTP-status hints.	R1-Phase5a
DQ-R1-028	`Internal.IncompatibleState` reclassification sweep	Decided	Discovery-then-classify methodology; sweep lands as the final Phase 5a PR (after the four additive minors); covers `common-module` only; major bump (10.0.0).	R1-Phase5a
DQ-R1-029	`sanitizeHeader` value-cleaning primitive	Decided	`sanitizeHeader(name, value): Result<String?>` in new `lib/api/headers/` package; composes downstream of `HeadersAllowList`; minor (`Added`; 9.3.0).	R1-Phase5a
DQ-R1-030	`TokenCipher` + `Hmac` cryptographic helpers	Decided	`companion operator fun invoke(info, materials, currentVersionId)` factory returning `Result<TokenCipher>`; AES-GCM auth-tag failure -> `Internal.IncompatibleState`; unknown `versionId` on decrypt -> `Transient.FailoverFailed` (bounded propagation lag handled by existing retry layers); `Hmac` extracted to share between `TokenCipher`, `OpaqueId`, `S3AssetService`. Minor (`Added`).	R1-Phase5a
DQ-R1-031	Idempotency helpers with native JsonElement + typed wrapper	Decided	`RawIdempotencyStore` (native `JsonElement`) + `IdempotencyStore<Req, Res>` via `inline fun typedAs()`; decode-failure -> `Result.failure(AppError.Internal.IncompatibleState)`; `Mismatch` carries `recordedRequest`; schema-evolution caller-controlled; `JSONB` storage. Minor (`Added`; 9.5.0).	R1-Phase5a

Round 1: Initial Design Decisions

DQ-001: Tenant Sending Domain Structure

Context: Each tenant needs an isolated sending domain for DKIM, SPF, and DMARC. The domain shape affects FQDN length, DNS zone management, and future extensibility. The choice of mail root domain itself is a separate decision (see DQ-009); this decision addresses the structure beneath whatever root is chosen.

Option	Description	Trade-offs
A	`<tenant>.{mail-root-domain}` (prod), `<tenant>.<partition>.{mail-root-domain}` (non-prod)	Short prod FQDNs but requires prod tenant records in the root zone (cross-account writes, mixed static/dynamic records).
B	`<tenant>.<partition>.{mail-root-domain}` uniformly for all partitions	One extra label in prod FQDNs. Consistent structure, clean IAM scoping, root zone stays static.
C	Full canonical: `<partition>.<infra>.{mail-root-domain}` per tenant	Consistent with existing Arda pattern but longest FQDNs; tenant identity buried in subdomain hierarchy.

Recommendation: Option A initially; revised to Option B after DQ-010.

Decision: Option B. Uniform <tenant>.<partition>.{mail-root-domain} across all partitions. The one-label cost in prod is outweighed by consistent zone structure, clean IAM, and a static root zone. See DQ-010 for the detailed rationale.

Applied to:

preliminary-exploration.md § Domain Structure (Working Assumption C) — note: exploration doc predates this revision
infrastructure.md § DNS, Tenant Domain Shape
DNS-structure diagram: see mail-dns-structure.drawio.svg in public/assets/diagrams/ (rendered inline in exploration/infrastructure.md § DNS).

DQ-002: Multi-Configuration Domain Strategy

Context: A tenant may eventually need multiple email configurations (e.g., separate sending domains for procurement vs. shipping). The v1 domain structure must not block this. Builds on DQ-001 and DQ-010, which fix the canonical Application-Runtime-tenant shape as <tenant-slug>.<partition>.{mail-root-domain}; this decision adds the <conf-slug> label.

Option	Description	Trade-offs
A	Sub-subdomain: `<conf-slug>.<tenant-slug>.<partition>.{mail-root-domain}`	Each config gets independent DKIM key and reputation. DMARC can apply at tenant level with subdomain policy. DNS hierarchy is explicit and parseable. Adds a label.
B	Composite slug: `<conf-slug>-<tenant-slug>.<partition>.{mail-root-domain}`	Flat structure at the conf-tenant boundary, shorter. But: hyphen boundary is ambiguous (parsing fragility), all configs share one DKIM key (defeats isolation purpose), no per-config DMARC override.

Recommendation: Option A — sub-subdomain preserves DKIM isolation and DNS hierarchy.

Decision: Option A. v1 provisions at <tenant-slug>.<partition>.{mail-root-domain} (single config, no <conf-slug> label; partition included per DQ-001 / DQ-010). Schema includes nullable config_slug field for v2+. Adding <conf-slug>.<tenant-slug>.<partition>.{mail-root-domain} later is additive — no migration of existing domains. Trade-off noted: if v2+ wants the default config to also live at a sub-subdomain, existing supplier address books would need updating, but this is opt-in, not forced.

Applied to:

(No surviving design artefact references this decision; recorded here for traceability.)

DQ-003: Tenant Slug Source

Context: The sending domain uses a tenant slug (<slug>.{mail-root-domain}). The slug must be DNS-safe (lowercase alphanumeric + hyphens), validated against reserved words, and permanent (changing it requires DKIM reputation re-warming and supplier address book updates).

Option	Description	Trade-offs
A	New field on Tenant entity	Explicit, decoupled from display name. Requires schema change and UI for CS to set it.
B	Derived from tenant name automatically	No new field. But: tenant names may contain spaces/special chars, derivation rules need defining, name changes would create inconsistency.
C	Provided by CS at provisioning time as separate input	No schema change on Tenant. Slug stored only in `tenant_email_config`. But: not visible in tenant management UI, potential for typos.

Recommendation: Option A — a permanent identifier deserves an explicit field with validation.

Decision: The tenant slug is provided as part of the provisioning request alongside tenantEId and tenantName. The slug and name may be null; the emailConfiguration service determines the final slug using a combination of the three inputs. The specific derivation algorithm is deferred to implementation. The slug is stored on the EmailConfiguration entity, not on the Tenant entity.

Applied to:

functional.md § Service: emailConfiguration, ProvisionRequest data type
architectural-scenarios.md § Scenario 1

DQ-004: Reply-To Editability

Context: When sending an order by email, should the user be able to edit the Reply-To address in the send dialog?

Option	Description	Trade-offs
A	Editable (To, Cc, Reply-To all editable)	Maximum flexibility. Risk: user sets Reply-To to an address they don’t control, replies go to wrong person.
B	Read-only (To and Cc editable, Reply-To resolved by system)	Controlled. Reply-To is always the procurement contact or user’s own email. v2+: tenant-configured functional address.

Recommendation: Option B — Reply-To should be system-controlled to prevent misdirected replies.

Decision: Option B. Reply-To resolved in order: (1) procurement.email from order header, (2) user email from JWT/ApplicationContext. Displayed as read-only in send dialog. v2+: tenant may configure a functional Reply-To (e.g., “procurement inbox”).

Applied to:

product/features/general-behaviors/email-communications.md (feature; not yet authored) § Sending Model
product/features/procurement/email-orders.md (feature; not yet authored) § Recipient Resolution, Requirements FR-0004
product/use-cases/general-behaviors/email-communications.md (use cases; not yet authored) § GEN::EML::0001::0003
product/use-cases/procurement/email-orders.md (use cases; not yet authored) § PRO::EML::0001::0004

DQ-005: Email Order Send Paths

Context: Email orders currently use a copy-paste workflow (side panel renders text, user copies to their own client). The new email capability adds system-send. Should copy-paste be removed?

Option	Description	Trade-offs
A	Replace copy-paste with system send	Simpler UX, one path. But: breaks existing workflow, users who prefer their own client lose that option.
B	Both paths coexist	Backward compatible. Copy-paste preserved for email orders; system send added as new option. PO orders are system-send only (no existing copy-paste path for PO).

Recommendation: Option B — backward compatibility with no user disruption.

Decision: Option B. Copy-paste is the existing path that stays as-is. System send is a new parallel path. For orderMethod=PURCHASE_ORDER, only system send is available (PDF attachment requires system involvement).

Applied to:

product/features/procurement/email-orders.md (feature; not yet authored) § Overview, Requirements FR-0011, FR-0012
product/use-cases/procurement/email-orders.md (use cases; not yet authored) § PRO::EML::0002::0002

DQ-006: CS Alerting Scope in v1

Context: The feature specifies bounce rate > 5% and complaint rate > 0.1% thresholds triggering CS alerts. Should Arda build this alerting in v1?

Option	Description	Trade-offs
A	Arda-built alerting from day one	Full control, custom thresholds. Engineering cost in v1.
B	Rely on ESP’s built-in alerting in v1, Arda-built in v2+	Postmark provides bounce/complaint alerting OOTB via its console. No engineering cost. Less customizable.

Recommendation: Option B — Postmark’s console alerting is sufficient for v1 at 100-150 tenants.

Decision: Option B. v1 relies on Postmark’s built-in alerting. Arda-built alerting with configurable thresholds is v2+.

Applied to:

product/features/general-behaviors/email-communications.md (feature; not yet authored) § Administration
product/use-cases/general-behaviors/email-communications.md (use cases; not yet authored) § GEN::EML::0004::0003

DQ-007: Document Generation Responsibility

Context: For PO-by-email, a PDF must be generated and attached. Should the general email capability generate documents, or receive them pre-generated?

Option	Description	Trade-offs
A	Email capability generates documents	Centralized, but couples email to PDF pipeline. Email capability needs to know about order rendering.
B	Calling feature generates document, passes Blob/URL to email capability	Clean separation. Email capability is document-agnostic. Calling feature handles generation errors before invoking email.

Recommendation: Option B — email capability should not know about document types.

Decision: Option B. The calling feature generates the PDF and passes it as a Blob or URL. If generation fails, the calling feature handles the error; email capability is never invoked.

Applied to:

product/use-cases/general-behaviors/email-communications.md (use cases; not yet authored) § GEN::EML::0002::0002
product/use-cases/procurement/email-orders.md (use cases; not yet authored) § PRO::EML::0003::0002

DQ-008: Send Dialog Interaction Model

Context: Should the send flow have separate “edit addresses” and “confirm send” steps, or a single combined dialog?

Option	Description	Trade-offs
A	Two steps: address resolution → confirmation dialog	Explicit separation. But: unnecessary friction if defaults are correct — user clicks through two dialogs to send.
B	Single-step dialog with editable fields + preview	One interaction: if defaults are correct, user just hits “Send.” Cancel with edits prompts for confirmation.

Recommendation: Option B — minimize friction for the happy path.

Decision: Option B. Single-step send dialog with To/Cc editable, Reply-To read-only, content preview. Cancel prompts if edits were made.

Applied to:

product/use-cases/general-behaviors/email-communications.md (use cases; not yet authored) § GEN::EML::0001::0003 (merged from former 0003+0004)
product/use-cases/procurement/email-orders.md (use cases; not yet authored) § PRO::EML::0001::0001

DQ-009: Mail Root Domain Choice

Context: All tenant sending domains are subdomains of a root mail domain. The choice of root domain affects reputation separability from the app domain (arda.cards), DNS delegation mechanics, and FQDN length.

Option	Description	Trade-offs
A	`mail.arda.cards` (subdomain of app domain)	No new domain registration. Shorter FQDNs if tenants are already familiar with `arda.cards`. But: shares reputation baseline with `arda.cards` — a deliverability incident on the app domain could affect mail, and vice versa. NS delegation from GoDaddy apex.
B	Standalone domain (e.g., `arda-mail.com` or similar)	Fully independent reputation from app domain. Clean separation for compliance or brand reasons. But: requires new domain registration and management. Tenants see an unfamiliar domain.
C	Other subdomain of `arda.cards` (e.g., `email.arda.cards`, `send.arda.cards`)	Same trade-offs as Option A with a different label.

Recommendation: Option B — standalone domain for full reputation separation.

Decision: Option B. ardamails.com (already owned, registered with Route53 in platformRoot account). Implementation must be parametric on the root domain value so it can be changed later if needed. The {mail-root-domain} parameter in infrastructure.md resolves to ardamails.com.

Applied to:

infrastructure.md § Parameters (entire document parametrized)
All documents using {mail-root-domain} notation

DQ-010: Prod Tenant Zone Placement

Context: The original design (exploration doc, Working Assumption C) placed prod tenant records directly in the root zone ({mail-root-domain}) to achieve shorter prod FQDNs (4 labels: acme.ardamails.com). Non-prod partitions each had their own delegated zone. This creates an asymmetry where the root zone contains both static infrastructure records (SPF, DMARC, NS delegations) and runtime-provisioned tenant records, and the operations service in Alpha001 needs write access to a zone in platformRoot.

Option	Description	Trade-offs
A	Prod tenants in root zone (original)	Shorter prod FQDNs (4 labels). But: root zone mixes static and dynamic records, prod provisioning needs cross-account write access to platformRoot, IAM scoping is more complex, root zone is not CDK-only.
B	Prod gets its own partition zone (`prod.{mail-root-domain}`)	One extra label in prod FQDNs (5 labels: `acme.prod.ardamails.com`). Uniform structure across all partitions, clean IAM (Alpha001 writes to its own zones), root zone stays static/CDK-only, no cross-account writes for tenant records.

Recommendation: Option B — consistency, clean IAM boundaries, and a static root zone outweigh one label of FQDN length.

Decision: Option B. All partitions (dev, stage, demo, prod) get their own delegated zone under {mail-root-domain}. The root zone contains only NS delegations and parent SPF/DMARC records — no runtime-provisioned records. This supersedes the “Working Assumption C” FQDN shape from the exploration doc for prod.

Applied to:

DQ-001 — revised from Option A to Option B
infrastructure.md § DNS (Tenant Domain Shape table, zone tables, IAM scoping)
DNS-structure diagram: see mail-dns-structure.drawio.svg in public/assets/diagrams/ (rendered inline in exploration/infrastructure.md § DNS).

DQ-011: Webhook Authentication Mechanism

Context: Postmark sends delivery status events (Delivery, Bounce, SpamComplaint) to a webhook URL on the Arda backend. The endpoint must verify that incoming requests are genuinely from Postmark. Postmark does not sign webhook payloads (no HMAC/signature). Two authentication approaches are available.

Option	Description	Trade-offs
A	HTTP Basic Auth credentials embedded in the webhook URL	Supported via legacy server-level fields (`DeliveryWebhook` etc.) and modern API (`HttpAuth` field). Credentials appear in URL strings, which may be logged by proxies and access logs. Requires a new credential type separate from existing API auth.
B	Bearer token via `HttpHeaders` on the modern Webhooks API	Configured via `POST /webhooks` with `HttpHeaders: [{"Name": "Authorization", "Value": "Bearer <token>"}]`. Reuses the existing `ARDA_API_KEY` validation already implemented in the backend. No credentials in URL strings. Requires the modern Webhooks API (not the legacy server-level fields).

Recommendation: Option B — reuses existing auth infrastructure, cleaner security posture.

Decision: Option B. Use Bearer token authentication via the modern Postmark Webhooks API. The token can be the same ARDA_API_KEY already used for API authentication, validated by the same backend mechanism. Webhooks are configured per server during provisioning via POST /webhooks (Server Token), not via the legacy server-level URL fields.

Applied to:

postmark-service.md § Webhook Authentication, § Step 5: Configure Webhooks, § Provisioning Sequence, § Legacy vs Modern Webhook API
functional.md § postmark-events endpoint, § Tenant Provisioning

DQ-012: Per-Tenant Server Token Storage

Context: Each Postmark server has an API token used at runtime to send email. This token must be stored securely. The operations service follows the ESO pattern where secrets are delivered to the pod at startup via External Secrets Operator, not fetched from Secrets Manager at runtime.

Option	Description	Trade-offs
A	Per-tenant Secrets Manager secrets	Each provisioning creates a new SM secret. ESO would need to sync all per-tenant secrets, or the service would need runtime SM read access (breaking the ESO pattern). IAM write access needed during provisioning. Scales poorly (N secrets per N tenants).
B	Encrypted in database (Aurora volume encryption only)	Tokens stored as plaintext columns, encrypted at the storage layer by Aurora’s KMS-backed volume encryption. Sufficient against disk theft, but plaintext to any DB user with SELECT access. No additional key management.
C	Encrypted in database (application-level encryption)	Service encrypts tokens with a partition-wide symmetric key before INSERT, decrypts after SELECT. The encryption key is a single static secret delivered via ESO at startup. DB dumps and SQL injection do not expose raw tokens. Key rotation is one key, not N.

Recommendation: Option C — maintains the ESO pattern (one static secret), eliminates per-tenant SM writes, and adds defense-in-depth beyond Aurora volume encryption.

Decision: Option C. Per-tenant server tokens are encrypted with a partition-wide encryption key and stored in the serverTokenEncrypted column of tenant_email_config. The encryption key is created by CDK in Secrets Manager and delivered to the pod via ESO as extras.email.encryptionKey in HOCON config. Only the emailConfiguration service handles encryption/decryption; the emailJob service calls emailConfiguration.getActiveConfiguration() to receive the decrypted token as an in-memory value.

Applied to:

infrastructure.md § AWS Secrets Manager (SM-3, SM-4 added; per-tenant SM writes removed; IAM-3 removed)
functional.md § Email Configuration (Secret Storage section, internal service method)
postmark-service.md § Authentication, § Step 6, § Provisioning Sequence
architectural-scenarios.md § Scenario 1 (encrypt + persist), § Scenario 2 (getActiveConfiguration with decrypted token)

Round 2: Infrastructure Implementation Decisions

DQ-013: IAM Role Extraction from Root Stack

Context: The root CDK stack (RootConfiguration) contains both DNS hosted zones and the AllowCreatingNSRecordsRole IAM role used for cross-account NS delegation. As this project renames the stack class to RootDnsStack and adds new stacks to the root application, the question is whether to extract the IAM role to a dedicated RootSecurityStack for cleaner separation of concerns.

Option	Description	Trade-offs
A	Extract role to `RootSecurityStack` via two-step deploy	Cleaner separation. But: requires two-step deploy due to IAM physical name collision. Creates a 2-5 minute window where the role doesn’t exist. If step 2 fails, role is gone until manually recreated.
B	Extract role to `RootSecurityStack` via CloudFormation stack refactoring	Cleaner separation. No danger window — the role transfers ownership atomically without delete/recreate. But: requires a manual CloudFormation operation outside the CDK workflow, followed by CDK code realignment. Relatively new AWS feature; should be tested in non-production first.
C	Keep role in `RootDnsStack`	No migration work. Role is conceptually tied to DNS delegation (it enables writing NS records). A `RootSecurityStack` with a single resource doesn’t justify the extra work in this project’s scope.

Recommendation: Option C for this project. Option B is the viable future path when extraction is justified.

Decision: Option C. The AllowCreatingNSRecordsRole stays in RootDnsStack (CloudFormation name: RootConfiguration). The role is functionally tied to DNS delegation and is acceptable in the DNS stack. Extraction is known to be operationally safe via CloudFormation stack refactoring (Option B), but adds complexity for no immediate functional benefit. When a RootSecurityStack is needed for additional security resources, use stack refactoring to move the role atomically. See root-refactor-analysis.md for the full analysis.

Applied to:

infrastructure/root-refactor-analysis.md (full analysis)
infrastructure/specification.md § Task 3 (root stack rename only, no role extraction)
infrastructure/analysis.md § Root Configuration

Round R1-Phase1: External Resources Provisioning Decisions

DQ-R1-001 through DQ-R1-005 resolve the Open Questions in 1-external-resources/specification.md § 5. DQ-R1-007 is an additional Phase 1 decision captured in the same round (vault separation for the Free Kanban Tool server token). All entries follow the DQ-R1-NNN convention introduced in architecture-overview.md § 10.

DQ-R1-001: Drift Workflow Filename

Context: The CI workflow that asserts the live external-resource invariants needs a stable filename. Phase 1 originally raised three candidates (external-resources-drift.yml, phase-1-drift.yml, op-drift.yml).

Decision: external-resources-drift.yml. The filename describes the invariant asserted (drift of the external resources Arda consumes), not the phase that introduced the workflow. This keeps the filename stable across phases as the workflow evolves.

Applied to:

DQ-R1-002: Drift-Check TypeScript Module Location

Context: The drift-check module is dual-purpose: an operator runs it locally with 1Password DesktopAuth; CI runs it with OP_SERVICE_ACCOUNT_TOKEN. Two candidate locations existed in the infrastructure repo: scripts/drift-check.ts (alongside legacy script utilities) or tools/drift-check.ts (a fresh top-level convention).

Decision: tools/drift-check.ts. The module is operator-runnable in addition to CI-runnable, and the tools/ convention better matches the dual-purpose nature than scripts/ (which the prior implementation largely used for one-shot orchestrators). The tools/ convention is forward-compatible with the eventual move of scripts/gha-secrets/ to tools/gha-secret.ts (out of scope of this project but on the trajectory).

Applied to:

DQ-R1-003: Operator Runbook Sign-Off Mechanism

Context: REQ-OPS-003 requires the runbook to capture sign-off (operator name, date, deviations) so the document is itself the audit record. Three encoding options were considered: a code block, a YAML frontmatter field, or a designated Markdown section with a small table.

Decision: A designated ## Operator Sign-Off section containing a Markdown table with columns Step / Operator / Date / Deviations / Notes, with one pre-populated empty row per REQ-EXT-NNN. The table is human-readable, diff-friendly under git, and does not require new tooling. YAML frontmatter would conflict with Starlight’s required frontmatter schema and would not naturally express per-step rows.

Applied to:

1-external-resources/specification.md § Task 5
current-system/oam/postmark-service/operator-runbook.md (Phase 1 deliverable)

DQ-R1-004: Disposition of Legacy Parser-Gated Runbook

Context: The prior Phase-0 implementation maintained a parser-gated operator runbook (HUMAN-STEPS.md) under infrastructure/scripts/postmark-foundations/, whose state was enforced by a TypeScript parser as a CI gate. REQ-OPS-004 retires the parser gate entirely; the runbook in the documentation repo becomes the canonical operator artefact.

Decision: Delete the parser-gated runbook and its parser code in Phase 1, gated on the canonical runbook (current-system/oam/postmark-service/operator-runbook.md) being merged. Two-step ordering preserves operator availability during the cut-over: docs land first, then the legacy artefact is removed in the infrastructure PR (T-C6 in the task plan).

Applied to:

DQ-R1-005: API-Surface Freshness Cadence

Context: The API observations note (postmark-api-observations.md) records observed Postmark API behaviour. Surface drift (Postmark adding/changing endpoints) would invalidate parts of the note. The question is when to refresh: annually, on every Postmark major-update post, or on first drift-test failure attributable to surface drift.

Decision: Refresh on first drift-test failure attributable to surface drift, augmented by an annual review. A scheduled-only cadence (annual) without the failure trigger would let regressions sit unnoticed for up to a year; a per-update cadence would create unnecessary documentation churn since most Postmark updates do not affect the small surface Arda uses. The combination keeps the note current where it matters and bounds staleness.

Applied to:

current-system/oam/postmark-service/postmark-api-observations.md (Phase 1 deliverable; freshness cadence noted in version-pin section)

DQ-R1-007: Vault Separation for Free Kanban Tool Server Token

Context: The Free Kanban Tool sends transactional email from freekanban.arda.ardamails.com. Its Postmark server token is the runtime sending credential — a leak yields the ability to send arbitrary email under that domain. The original cross-cutting design placed this item in Arda-SystemsOAM alongside the OAM-tier credentials (Postmark account tokens, IAC service-account tokens). OP_SERVICE_ACCOUNT_TOKEN — the GitHub Actions secret that authenticates CI to 1Password — is scoped read-only to Arda-SystemsOAM. So the Free Kanban server token sat in the same blast radius as every other OAM credential, contradicting the bounded-blast-radius framing in cross-cutting-design.md § 2.5.

Discovered: 2026-05-05, during the Phase 1 operator-walkthrough preparation. Re-running tools/drift-check.ts locally surfaced the placement and prompted a re-evaluation of vault scoping for runtime credentials.

Option	Description	Trade-offs
A	Keep the item in `Arda-SystemsOAM`.	One vault to manage. But: Free Kanban Tool’s runtime credential is reachable by `OP_SERVICE_ACCOUNT_TOKEN`, which expands the blast radius of any CI compromise to include the live sending key.
B	Move the item to a dedicated `Arda-CorporateOAM` vault. The Free Kanban Tool’s runtime resolves the credential via its own SDK auth path; `OP_SERVICE_ACCOUNT_TOKEN` does not have read access to this vault.	One additional vault to provision. The Free Kanban server token is now isolated from the OAM-tier credentials; a CI / `OP_SERVICE_ACCOUNT_TOKEN` compromise does not yield it. Matches the rev1 design intent: deploy-time / OAM credentials in `Arda-SystemsOAM`; runtime sending credentials in instance-group-scoped vaults.

Recommendation: Option B — bounded blast radius outweighs the single-vault simplicity.

Decision: Option B. The Free Kanban Tool’s Postmark server token lives at:

Field	Value
Vault	`Arda-CorporateOAM`
Item title	`Free-Kanban-Generator-Postmark-Server`
Field	`credential`
Canonical reference	`op://Arda-CorporateOAM/Free-Kanban-Generator-Postmark-Server/credential`

The vault was provisioned 2026-05-05 (operator action by Miguel). The 1Password item itself is created by Phase 3 (Corporate CLI Phase A writes the Postmark server token into the item the first time it runs). Phase 1 does not create or assert the existence of this item.

This decision establishes a vault-naming convention that future instance groups follow: Arda-<InstanceGroup>OAM for runtime sending credentials owned by that instance group. The existing partition-scoped vaults (Arda-DevOAM, Arda-StageOAM, Arda-DemoOAM, Arda-ProdOAM, Arda-SandboxKyle) already follow this pattern; Arda-CorporateOAM extends it to the new Corporate Resource Group.

Clarification on item naming within partition vaults. The Arda-SystemsOAM vault holds both Postmark accounts (Postmark-Prod and Postmark-NonProd) and therefore uses qualified item names (the account suffix disambiguates the two within the single vault). In contrast, each per-partition vault (Arda-DevOAM, Arda-ProdOAM, etc.) holds only one Postmark account reference — the one relevant to that partition — so the service-name-only item title Postmark is used (the vault name itself carries the environment). This follows the workspace CLAUDE.md 1Password vault convention: vaults are scoped by usage; store independently even when the value is currently shared.

Consequences:

Phase 1: the typed reference FREE_KANBAN_POSTMARK_ITEM is removed from infrastructure/src/main/cdk/platform/one-password.ts. Phase 1 declares only the three items it creates (Postmark-Prod, Postmark-NonProd, IAC-SCRIPTS Service Account Token). tools/drift-check.ts and the Phase 1 V-PLAT-002 test surface shrink correspondingly.
Phase 3: Corporate Updates (re)introduces the typed reference with the new vault, item title, and field. Phase 3’s spec explicitly enumerates the SDK auth path the Free Kanban Tool’s runtime uses to read the credential (out of scope of this project’s IaC, but documented for the Free Kanban Tool team).
Threat model: cross-cutting-design.md § 2.1 line 39 (“attacker holding OP_SERVICE_ACCOUNT_TOKEN reads every credential reachable from Arda-SystemsOAM”) remains true; the Free Kanban server token is no longer in that set. § 2.5 is updated to explicitly call out the vault-separation guarantee.

Applied to:

cross-cutting-design.md § 1 (defended threats) — “Free Kanban Tool sending integrity” line updated to name the new vault.
cross-cutting-design.md § 2.5 — Phase A description, item name, and isolation note updated.
cross-cutting-design.md § 4.1 secret-inventory table — vault and reference columns updated.
cross-cutting-design.md § 4.5 rotation runbook — Free Kanban Tool token step updated with new vault.
phases.md Phase 3 J1 interim mechanism — item name and vault.
architecture-overview.md § 8 Postmark-server creation — interim-mechanism description updated.
infrastructure/src/main/cdk/platform/one-password.ts — FREE_KANBAN_POSTMARK_ITEM removed (Phase 1).
infrastructure/tools/drift-check.ts — import + ALL_OP_ITEMS entry removed.
infrastructure/src/main/cdk/platform/platform.test.ts, infrastructure/tools/drift-check.test.ts — test surface adjusted from 4 items to 3.
Phase 3 planning artefacts (when authored) — the typed reference is reintroduced under platform/one-password.ts with the new vault, title, and field.

Round R1-Phase2: Root Updates Decisions

This round captures decisions made while planning Phase 2 — Root Updates.

DQ-R1-006: Locus of Cross-Zone NS-Delegation Writes

Context: The Root account owns the ardamails.com mail-root zone (Phase 2 introduces it) and the four arda.cards family zones. Child zones (arda.ardamails.com for Corporate in Phase 3; {partition}.ardamails.com per partition in Phase 4) need NS-delegation records in the parent zone. The question is which stack writes those NS records:

Option	Description	Trade-offs
A	Root stack writes the per-child NS record set. The parent stack reads each child zone’s `hostedZoneNameServers` via cross-stack import or live API lookup and writes the NS record into the parent zone.	Centralises NS records in one stack. But: creates a Phase-2-on-Phase-3 (and Phase-2-on-Phase-4) deploy-order dependency; Root cannot complete its NS-delegation writes until every child zone has been provisioned. Inverts the natural “owner of a zone owns its delegation” intuition.
B	Child stack writes its own NS record into the parent zone using a cross-account assume-role pattern. Root owns only the assume-role IAM target (`AllowCreatingNSRecordsRole`); each child stack instantiates a `WriteNSRecordsToUpstreamDns` construct that runs a Lambda + Custom Resource in the child account, assumes the Root role, and writes the parent NS record.	Matches the existing `arda.cards` family pattern (every partition’s `IngressStack` already writes its own NS records into Root’s `arda.cards` family zones). Phase 2 is fully self-contained; Phase 3 / Phase 4 depend on Phase 2 only for the role and the parent zone existence. Slightly more constructs per child stack, but the constructs already exist.

Recommendation: Option B — consistency with the existing pattern, clean dependency direction, no joint-deploy requirements between phases.

Decision: Option B. The WriteNSRecordsToUpstreamDns construct (at src/main/cdk/constructs/xgress/write-ns-records-to-upstream-dns.ts) is owned and instantiated by the child zone stack. It internally creates a Lambda execution role in the child account, a NodejsFunction from constructs/inline-lambdas/write-platform-root-ns-record.ts, and a cdk.CustomResource that on stack lifecycle events assumes the Root role (AllowCreatingNSRecordsRole, deterministic name from aws-configuration.ALLOW_WRITE_NS_RECORDS_ROLE.name) and writes / updates / deletes the NS record set in the parent zone. The child zone’s own hostedZoneNameServers token is passed in as the nameServers property — no live cross-zone lookup is required.

Consequences:

Phase 2 does not write NS records for any child zone. Its scope is limited to: renaming the existing app/stack, declaring the ardamails.com zone, exporting the zone ID and the IAM role ARN, and adding the instances/Root/dns.ts declarative configuration.
Phase 3 (Corporate) instantiates WriteNSRecordsToUpstreamDns against the ardamails.com zone with subdomain: "arda" and nameServers: arda.hostedZoneNameServers.
Phase 4 (per-partition) does the same, once per partition, with subdomain: "<partition>" and nameServers: <partition>Zone.hostedZoneNameServers.
Phase 2 → Phase 3 / Phase 4 dependency reduces to deploy order (Root must deploy first because the child stacks’ lambdas assume the Root role at deploy time).

Applied to:

2-root-updates/specification.md — Phase 2 scope explicitly excludes NS-delegation writes.
phases.md § Phase 2 — deliverables list updated; the “NS-delegation for arda.ardamails.com” row replaced with the ardamails.com zone declaration.
phases.md § Phase 3 — Corporate Email stack deliverable extended to mention the WriteNSRecordsToUpstreamDns instantiation.

DQ-R1-008: Adopt vs. Create the existing `ardamails.com` Hosted Zone

Context: When cdk diff was run against the deployed RootConfiguration stack to validate the Phase 2 implementation (Gate 3), it surfaced an additive-only result — as expected by design. But a separate AWS investigation (motivated by an offhand challenge from the operator: “is the zone already there?”) revealed that the ardamails.com hosted zone already existed in the Root account as Z0721066239FWCD47EJDX, with two records (apex NS and SOA) and the four AWS-assigned nameservers (ns-2046.awsdns-63.co.uk, ns-944.awsdns-54.net, ns-158.awsdns-19.com, ns-1497.awsdns-59.org). The zone was auto-created by AWS Route53 Domains when the ardamails.com domain was originally registered through the registrar service.

The original Phase 2 implementation declared a brand-new r53.PublicHostedZone(this, "ArdamailsZone", {...}). Deploying as written would have created a second hosted zone for ardamails.com with a different NS set; the registrar would still have pointed at the original four nameservers, so the new zone would have been orphaned at the DNS level. The deploy-as-coded path was unsafe.

Discovered: 2026-05-05, after Gate 3 cleared (the cdk-diff against deployed also reported additive-only because both zones were missing from the deployed stack and the synthesized template added a new one — the diff couldn’t see the duplication risk because the duplicated resource is in Route53 but not in CloudFormation).

Option	Description	Trade-offs
A	`cdk import` the existing zone into `RootConfiguration` (logical ID `ArdamailsZone1DCDDC15`). Zone becomes CDK/CFN-managed; no duplicate created; registrar’s NS chain preserved.	One-time operator action; zone properties must match the import target exactly; CFN’s IMPORT change-set type doesn’t allow Output additions or other resource modifications, so the deploy is two-phase (import-only template, then full deploy).
B	Reference the existing zone via `r53.HostedZone.fromHostedZoneAttributes()` and export its ID via `CfnOutput` without trying to manage it.	Zone stays outside CDK control; future record additions (root-level SPF, DMARC) require ad-hoc tooling. Doesn’t match the “Phase 2 declares the `ardamails.com` zone” intent in `phases.md`.

Recommendation: Option A.

Decision: Option A. The CDK code at src/main/cdk/stacks/root/root-dns-stack.ts was extended in two ways:

The ArdamailsZone declaration now sets comment: "HostedZone created by Route53 Registrar" — the AWS-default comment string on the live zone — so the IMPORT change-set is read-only (CFN reports Scope: [], no property writes).
applyRemovalPolicy(cdk.RemovalPolicy.RETAIN) defends the imported zone against accidental cdk destroy of the production root stack.

The root-dns-stack.test.ts file’s V-ROOT-001 was extended with a strict-equality assertion that locks the synthesized resource block to the live zone’s properties (Name, HostedZoneConfig.Comment) plus the RETAIN retention policies. Future CDK code changes that drift from the import target fail at test time.

The deployment proceeded in two CFN operations:

IMPORT change-set with a stripped template (deployed-state + just the ArdamailsZone resource added; no Outputs added, no other resource modifications). Executed cleanly: Action: Import, Replacement: null, Scope: []. Stack transitioned to IMPORT_COMPLETE.
Normal cdk deploy with the full synthesized template, adding the ardamailsZone Output (publishing the arda-ardamails-zone CFN export) and reconciling CDKMetadata. Stack transitioned to UPDATE_COMPLETE. Final cdk diff reported zero differences.

Forward implications:

Phase 3’s arda.ardamails.com zone is created fresh by the Corporate Email stack (no pre-existing zone in Route53); no IMPORT detour needed.
Phase 4’s per-partition {partition}.ardamails.com zones are created fresh in each partition’s AWS account (no pre-existing zone); no IMPORT detour needed.
Future zone-creation work in this project follows the standard cdk deploy flow.

Applied to:

infrastructure/src/main/cdk/stacks/root/root-dns-stack.ts — comment + retention policy on ArdamailsZone.
infrastructure/src/main/cdk/stacks/root/root-dns-stack.test.ts — V-ROOT-001 strict-match.
infrastructure/CHANGELOG.md [2.29.0] — Added entry refined to mention the import.
2-root-updates/implementation/learnings.md, alternatives.md, skipped.md — project-completion byproducts.

Round R1-Phase3: Corporate Updates Decisions

This round captures decisions made while planning Phase 3 — Corporate Updates. All decisions resolved during the Pass-1 analysis (3-corporate-updates/analysis.md) on 2026-05-06.

DQ-R1-009: Postmark Domain-Verification Target (Parent vs Leaf)

Context: The Free Kanban Tool sends from freekanban.arda.ardamails.com. Postmark verifies sending domains via DKIM + Return-Path records published in DNS. The verification can target either the leaf sub-domain (freekanban.arda.ardamails.com) or the Corporate-zone parent (arda.ardamails.com). Verifying the parent makes leaves inherit DKIM through the parent’s signing key, removing the need for a per-leaf verification click as future Corporate consumers (HubSpot, marketing) are added.

Option	Description	Trade-offs
A	Verify each leaf sub-domain individually as it is created.	Simple per-leaf isolation; failure of one leaf’s DKIM doesn’t affect siblings. But: each new Corporate consumer requires its own verification click (or API call) and its own DKIM rotation runbook.
B	Verify once at the Corporate-zone parent (`arda.ardamails.com`); leaves inherit DKIM via the parent’s signing key.	One verification step covers every current and future leaf under `arda.ardamails.com`. Single DKIM key rotation runbook. Aligns with Postmark’s parent-domain verification semantics.

Recommendation: Option B — parent verification. Pre-decided 2026-05-05 during the Phase 1 operator-walkthrough preparation.

Decision: Option B. Phase 3’s PostmarkSendingDomain thin-wrapper registers arda.ardamails.com as the Sender Signature in PostmarkProd. The Corporate CLI invokes verifyDkim and verifyReturnPath against this parent. Leaf sub-domains (freekanban.arda.ardamails.com, future siblings) do not receive their own Sender Signature.

Applied to:

3-corporate-updates/analysis.md § “Note on what becomes ‘known to Postmark’” and gaps G-1, G-7, G-8.
operator-domain-verification-checklist.md — the stub already pointed at this decision; the just-in-time expansion at implementation time formalizes the verification target.
Phase 3 specification (Pass 2) — the PostmarkSendingDomain configuration is arda.ardamails.com, not the leaf.

Implementation note (added post-Phase-3): The first implementation pass diverged from this decision — Phase A’s CLI honored it by accident while the CDK construct silently placed the DKIM TXT under the leaf sub-domain. Surfaced by Phase B post-deploy verification when Postmark’s DKIMPendingHost did not match the deployed FQDN. The root cause was that the decision was prose-only (this entry, a docstring, a runbook) with no value or function any code consumed. Resolved by Arda-cards/infrastructure PR #450 commit cd85527: a typed source-of-truth sendingDomainPlacement() function in platform/constructs/postmark/sending-domain.ts is now consumed identically by the CLI, the CDK construct, and the drift check; cross-seam assertions in tools/corporate-drift.ts verify Postmark’s reported state agrees with the placement function. Full narrative at 3-corporate-updates/implementation/dqr1009-divergence.md; the structural lesson is captured in 3-corporate-updates/implementation/learnings.md L-1.

The scope of this decision is the Corporate instance group: verification at the Corporate-zone parent (arda.ardamails.com); leaves under it inherit. Phase 4’s per-partition Sender Signatures apply the same “verify at the instance-group parent” pattern at their own level ({partition}.ardamails.com), with each partition having its own DKIM key for receiver-side reputation isolation. The ardamails.com apex is not a verification target. The Phase 4 granularity decision is pinned in DQ-R1-017 (Round R1-Phase4).

DQ-R1-010: Locus of Corporate’s NS-Delegation Write (Same-Account Case)

Context: DQ-R1-006 settled that the child zone owner writes the NS-delegation record upstream into the parent zone. The construct (WriteNSRecordsToUpstreamDns) was designed for the cross-account case where Application-Runtime partitions (in Alpha001 / Alpha002) write into Root’s ardamails.com zone (in platformRoot). For Phase 3, Corporate currently lives in platformRoot — the same account as Root. The question is whether Corporate’s stack still uses the assume-role construct or writes Route53 directly.

Option	Description	Trade-offs
A	Always go through `WriteNSRecordsToUpstreamDns` and assume the role even when same-account; preserves the pattern uniformly across instance groups.	One extra STS `AssumeRole` call per deploy (~tens of milliseconds, negligible). Construct behavior is invariant under the future Corporate-account migration (architecture-overview § 6.4). DQ-R1-006’s “child writes upstream” intent is preserved.
B	Branch the construct so same-account writes skip the assume-role hop (direct Route53 write).	Slightly faster deploy; no STS call. But: introduces a same-account vs cross-account branch in the construct, expanding the test surface and creating a behavior change at the future Corporate-account migration moment.
C	Write the NS record from Root’s stack instead (revisits DQ-R1-006 for this case).	Simpler in the same-account case. But: re-opens DQ-R1-006 and breaks the “child owns its delegation” invariant.

Recommendation: Option A — uniform pattern.

Decision: Option A. Phase 3’s CorporateMailDns stack instantiates WriteNSRecordsToUpstreamDns exactly as a partition would, with targetAccountId set to platformRoot’s account ID. The assume-role hop fires; the role grants ChangeResourceRecordSets on ardamails.com (the only zone the role’s allowedParentHostedZoneIds whitelists). The construct’s behavior is identical between the same-account (Phase 3 today) and cross-account (future Corporate-account migration) cases.

Applied to:

3-corporate-updates/analysis.md gap G-5.
Phase 3 specification (Pass 2) — CorporateMailDns stack composition.

DQ-R1-011: `route-53-hosted-zone.ts` → `dns-zone.ts` Migration Shape

Context: The existing constructs/xgress/route-53-hosted-zone.ts is the arda.cards-shaped hosted-zone construct (its overrideDomainName defaults to arda.cards). Phase 3 needs a generalized DnsZone construct that supports any registrable domain (ardamails.com, arda.ardamails.com, future). Two construct names cannot survive long-term; the question is the migration shape.

Option	Description	Trade-offs
A	Rename in place: `dns-zone.ts` replaces `route-53-hosted-zone.ts`; existing callers updated in the same PR.	One PR, contained blast radius. The repo’s `validateProps` discipline catches missed callers at synth time.
B	Coexist for a transition window: `dns-zone.ts` is added; `route-53-hosted-zone.ts` becomes a thin re-export with a deprecation notice; followup PR removes the old name.	Smaller per-PR diff, easier review. But: two PRs land in sequence; the deprecation alias outlives any actual deprecation period.
C	Leave the old construct, add the new one; the old continues to serve `arda.cards`-family callers.	No caller migration. But: construct sprawl — two near-identical constructs co-exist indefinitely.

Recommendation: Option A — rename in place.

Decision: Option A. The construct is renamed in the same Phase 3 PR; validateProps catches missed callers at synth, which is exercised by the repo’s CDK matrix in CI.

Applied to:

3-corporate-updates/analysis.md refactor R-1.
Phase 3 specification (Pass 2) — one task carries the rename + caller migration.

DQ-R1-012: Corporate Drift-Workflow Filename and Scope

Context: Phase 1 added external-resources-drift.yml (one workflow that exercises every external resource the platform consumes). Phase 3 introduces the first Corporate asset (Free Kanban Tool); future Corporate assets (HubSpot, marketing-site) follow. The question is whether to scope the drift workflow per asset or per instance group.

Option	Description	Trade-offs
A	`corporate-free-kanban-tool.yml` (asset-specific, one workflow per asset).	One failure isolates to one workflow run. But: workflow file count grows linearly with Corporate assets; each new asset requires a new workflow file.
B	`corporate-drift.yml` (instance-group-scoped, one workflow that exercises every Corporate asset).	Workflow count proportional to instance groups, not assets. The driver script enumerates `instances/Corporate/` and exercises each. New Corporate assets are picked up automatically.
C	`<asset>-drift.yml` per asset with a shared `tools/corporate-drift.ts` driver.	Combines the worst of A and B.

Recommendation: Option B — instance-group-scoped.

Decision: Option B. The workflow file is corporate-drift.yml. The driver enumerates instances/Corporate/ and exercises each asset’s Postmark server, DNS records, and 1Password item. Failures open one issue per failed asset (label includes the asset name).

Applied to:

3-corporate-updates/analysis.md gap G-16.
Phase 3 specification (Pass 2) — workflow + driver.

DQ-R1-013: Phase A Failure Ordering for the Postmark Server Token

Context: Phase A of the Corporate CLI creates a Postmark server (which yields the Server API token), writes the token to 1Password, and writes public values to cdk.context.json. Postmark’s API surfaces the token once at server creation; it cannot be re-retrieved. If the 1Password write fails after the server is created, the token is unrecoverable from Postmark’s side.

Option	Description	Trade-offs
A	Write to 1P first, then `cdk.context.json`; on 1P-write failure, roll back by calling Postmark’s `delete-server` API.	Atomic-looking. But: `delete-server` is a destructive operation that runs against the live Postmark account; a botched rollback (e.g., after partial state was already created) destroys observable history. The rollback path is harder to test than the forward path.
B	Persist the token to a process-local secret-handling buffer immediately on receipt; write to 1P with retries (exponential backoff, finite). Fail loud on permanent 1P-write failure with the buffer’s redacted summary; manual operator action to recover.	Token is never persisted outside 1P. The 1P-write failure surfaces clearly with a redacted alert; the operator pastes the buffer summary into 1P via DesktopAuth or chooses to call `delete-server` deliberately as recovery. Forward path is the only tested path.

Recommendation: Option B — buffer + retries.

Decision: Option B. The Corporate CLI implements a process-local secret buffer for the freshly issued server token. The 1P write retries up to N times with exponential backoff (defaults TBD by implementer; the spec lists the parameter). On exhaustion, the CLI exits with a clearly redacted summary that allows the operator to either manually paste the token into 1P (DesktopAuth) or invoke delete-server to reset. cdk.context.json is written after the 1P write succeeds; a 1P-write failure leaves cdk.context.json untouched.

Applied to:

3-corporate-updates/analysis.md gap G-9.
Phase 3 specification (Pass 2) — corporate-cli.ts Phase A semantics.

DQ-R1-014: `cdk.context.json` Commit Policy for Phase A’s Outputs

Context: Phase A writes postmark.free-kanban.serverId, .dkimSelector, .dkimKey, .returnPathTarget into cdk.context.json. These are public values (DKIM selector and key are published in DNS; serverId is non-sensitive). Standard CDK practice is to commit cdk.context.json so synth is deterministic on a fresh checkout.

Option	Description	Trade-offs
A	Commit `cdk.context.json` — standard CDK practice; deterministic re-synth on a fresh checkout.	New developers / CI checkouts can `cdk synth` without re-running Phase A. The values are public and DNS-published; no leak surface.
B	Local-only with `.gitignore`; CI re-runs Phase A to repopulate.	Eliminates the commit-of-generated-values pattern. But: re-running Phase A in CI requires Postmark Account API credentials in CI’s environment, which is the opposite of the design intent (only `OP_SERVICE_ACCOUNT_TOKEN` should be in CI; everything else is resolved at runtime via the SDK).
C	Commit, but exclude the `postmark.*` keys via a custom serializer.	Adds tooling complexity for no benefit; the public values are not sensitive.

Recommendation: Option A — commit.

Decision: Option A. cdk.context.json is committed to the repo with the postmark.free-kanban.* keys populated by Phase A. The keys are DNS-public; commit is safe. Re-running Phase A is idempotent and updates the file when Postmark issues a new value (e.g., a DKIM-key rotation).

Long-term direction (status: interim): Using cdk.context.json as the channel through which a tool-side pre-deploy step hands DKIM / Return-Path values to CDK synth is the interim mechanism. It is consistent with CDK’s own provider-cache convention (flat key, structured value — see e.g. CDK’s auto-cached hosted-zone:account=...:region=... entries) and with how the Phase A Corporate CLI already populates this file. The target mechanism, when the platform adopts Lambda-backed Custom Resources more widely, is a CustomResource that calls Postmark inside CFN’s Create/Update lifecycle and returns DKIM values as resource attributes — removing the tool-side pre-deploy step and the operator handoff channel entirely. Migration is intentionally deferred: the Custom-Resource path is the long-term answer for all of Phase 3’s Corporate Signature, Phase 4’s per-partition Signatures, and any Phase 5b per-tenant Signatures, so the migration is a coordinated change rather than a per-phase one. Tracked outside this project; do not migrate piecemeal during Phase 4.

Applied to:

3-corporate-updates/analysis.md gap G-9 and refactor R-2.
Phase 3 specification (Pass 2) — the cdk.context.json task explicitly commits.
Phase 4’s per-partition extension reuses this decision; see DQ-R1-025 for the write strategy under the namespaced partitionMail:<infra>:<partition> key.

DQ-R1-015: DMARC Reporting Mailbox (`rua` / `ruf`) for `_dmarc.arda.ardamails.com`

Context: The DMARC record at _dmarc.arda.ardamails.com (per architecture-overview § 5.2) has an initial monitoring policy of p=quarantine; sp=quarantine. The aggregate-report destination (rua=mailto:...) and forensic-report destination (ruf=mailto:..., optional) need a reachable mailbox to be meaningful.

Option	Description	Trade-offs
A	`dmarc-reports@arda.cards` (existing `arda.cards`-family Google Workspace inbox).	Least operational cost; mailbox provisioning is one Google Workspace step. Reports aggregate over time and are reviewed periodically, not in real time.
B	A new `dmarc-reports@ardamails.com` mailbox, hosted independently.	Cleaner naming alignment with the mail-root domain. But: requires standing up MX records on `ardamails.com`, which is currently a sending-only domain; introduces inbound-mail handling that this project deliberately avoids.
C	No `rua` / `ruf` in v1; revisit when DMARC reporting becomes a routine input.	No mailbox to provision. But: DMARC monitoring (p=quarantine) is meaningless without a reporting destination; the policy effectively reduces to “do whatever your local rules say.”

Recommendation: Option A — dmarc-reports@arda.cards.

Decision: Option A. The DMARC record carries rua=mailto:dmarc-reports@arda.cards. The mailbox is provisioned by the operator in Arda’s Google Workspace before Phase B deploy; the operator companion (G-18 in the analysis) captures the step at implementation time. ruf is omitted in v1 (forensic reports are noisier and not actioned today).

Applied to:

3-corporate-updates/analysis.md gaps G-6 and G-20.
Phase 3 specification (Pass 2) — the DMARC TXT record content; the operator companion captures the prerequisite mailbox step.
Operator companion at implementation time — explicit pre-deploy step.

DQ-R1-016: Reserved-Name Registry Scope at `arda.ardamails.com`

Context: Architecture-overview § 6.5 reserves arda at the ardamails.com level so future tenant slugs (in any partition) cannot collide. The question is whether to also reserve sub-domain slugs at the arda.ardamails.com level (freekanban, future hubspot, …): import them into a constants list and have partition validators reject them, or leave the arda.ardamails.com-level registry as documentation only.

Option	Description	Trade-offs
A	Register `freekanban` (and future Corporate slugs) in `platform/ari-configuration.ts`; partition validators import the constant.	Cross-instance-group collision detection is mechanical. But: Application-Runtime partitions and the Corporate instance group become coupled through a shared constants list; any change to the Corporate registry forces a re-deploy of every partition (or at least invalidates their lint).
B	Documentation-only registry at `arda.ardamails.com`; partition validators do not import; `corporate-cli.ts` enforces the registry locally on Phase A entry by listing pre-existing Postmark Sender Signatures, servers, and 1P items.	No cross-instance-group import coupling. The CLI’s Phase A is the only writer; it can enforce uniqueness against live Postmark + 1P state. Adds a conflict-check requirement to the CLI.

Recommendation: Option B — documentation-only with CLI-enforcement.

Decision: Option B. Partition validators do not import a Corporate slug list. corporate-cli.ts Phase A entry includes a conflict-check: it lists existing Postmark Sender Signatures (in the configured account), existing Postmark servers (in the configured account), and existing 1Password items (in Arda-CorporateOAM); if a name collision exists for the asset being created, the CLI exits before any state-mutating call. This catches both intra-Corporate collisions (two assets with overlapping names) and cross-instance-group collisions (a partition somehow registered an arda.ardamails.com slug).

Applied to:

3-corporate-updates/analysis.md gaps G-15 and G-17.
Phase 3 specification (Pass 2) — the conflict-check is a corporate-cli.ts Phase A acceptance criterion.

Round R1-Phase4: Runtime Platform Updates Decisions

This round captures decisions made during Phase 4 — per-partition mail capability for the Application Runtime instance group. Decision IDs DQ-R1-017 through DQ-R1-022 are reserved for this round; all entries are resolved.

DQ-R1-017: Postmark Sender Signature Granularity per Partition

Context: Phase 4 brings per-partition mail capability online for the Application Runtime instance group across four active partitions (prod, demo, dev, stage; kyle excluded per DQ-R1-021). Each partition has its own mail sub-zone {partition}.ardamails.com. The question is whether each partition gets its own Postmark Sender Signature (with its own DKIM key, independent reputation), whether multiple partitions share a parent Signature in the spirit of DQ-R1-009 (which used parent verification for the Corporate instance group), and how per-tenant isolation fits in.

Option	Description	Trade-offs
A	One Signature at `ardamails.com` (root); all partitions and Corporate inherit via parent verification.	One Signature covers the entire tree. But: reputation pools across every environment and the Corporate consumer; abuse on `dev` taints `prod`. Defeats the per-partition isolation goal.
B	One Signature per partition sub-zone (`prod.ardamails.com`, etc.); each carries its own DKIM key; leaves under each partition (per-tenant sub-domains) inherit.	Per-partition reputation independence. Matches the Postmark account split (Prod vs NonProd). Future per-tenant Signatures (for stricter isolation) can be added in Phase 5b without changing this layer.
C	One Signature per tenant from day one.	Strictest isolation. But: thousands of Signatures to manage; per-tenant verification cost; premature when tenant volume is zero.

Recommendation: Option B — per-partition Signature, parent-verified at the partition sub-zone, leaves inherit.

Decision: Option B. Phase 4 registers one Postmark Sender Signature per active partition at the partition’s sub-zone ({partition}.ardamails.com). The Signature is anchored at the partition apex; per-tenant sub-domains within the partition inherit DKIM via the partition’s signing key. Production partitions (prod, demo) land on the PostmarkProd account; non-production partitions (dev, stage) on PostmarkNonProd. The first non-prod Signature (dev.ardamails.com) also satisfies Postmark Compliance’s pending approval for arda-nonprod. Per-tenant Signature granularity is deferred to Phase 5b when tenant volume exists.

Applied to:

phases.md § Phase 4 Scope and Deliverables (Postmark Sender Signature rows).
4-runtime-platform-updates/goal.md Success Criteria #5 (first non-prod Signature verified).
Phase 5b Email module design (whether to add per-tenant Signatures becomes a tractable choice once tenants exist).

DQ-R1-018: `corporate-drift` Rename and Scope

Context: Phase 3 introduced tools/corporate-drift.ts and .github/workflows/corporate-drift.yml — a scheduled drift check that asserts Postmark account state and DNS state for the Corporate instance group, with cross-seam Postmark↔placement-function assertions added by the DQ-R1-009 fix. Phase 4 adds per-partition Postmark Sender Signatures that need equivalent drift coverage. The question is whether corporate-drift is renamed and generalized (e.g., to mail-drift) to cover Corporate + every partition Signature, or kept as Corporate-only with a parallel new workflow added for the partition surfaces.

Option	Description	Trade-offs
A	Rename `corporate-drift` to `mail-drift`; one workflow asserts Corporate + every partition Signature.	One workflow to maintain. Single failure-issue stream. But: future runtime-platform drift checks unrelated to email (e.g., asserting CloudFront cache configuration, asserting Lambda function counts) would need their own naming; `mail-drift` is mail-centric.
B	Keep `corporate-drift` unchanged. Add a new `runtime-platform-drift` workflow in parallel, covering partition surfaces. Share logic via reusable shell scripts or GitHub Actions composite actions.	Names reflect scope (Corporate is one instance group; runtime-platform is another). Future non-mail runtime-platform drift checks plug into `runtime-platform-drift` without mail-centric naming. Two workflows to maintain, but shared logic minimizes drift between them.

Recommendation: Option B — parallel workflows with shared logic.

Decision: Option B. corporate-drift stays as-is. A new .github/workflows/runtime-platform-drift.yml and driver under tools/ (Phase 4 deliverable) asserts the cross-seam Postmark↔DNS↔placement-function invariants for every active partition Signature. The two workflows share reusable shell scripts or GitHub Actions composite actions so the drift-check logic doesn’t drift between them. Future runtime-platform drift checks unrelated to email plug into the same workflow without renaming.

Applied to:

phases.md § Phase 4 Scope and Deliverables.
4-runtime-platform-updates/goal.md Success Criteria #6.
3-corporate-updates/implementation/suggestions.md S-5 (originally suggested the mail-drift rename; this decision supersedes that recommendation).

DQ-R1-019: Per-Partition Email Server-Token Encryption Key

Context: DQ-012 decided that per-tenant Postmark server tokens are encrypted application-side with a partition-wide symmetric key before INSERT, with the key in AWS Secrets Manager and delivered via ESO. DQ-202 fixed the on-disk format as an AES-256-GCM versioned envelope; DQ-203 specified that the SM value is a 64-byte HKDF input. Phase 4 must close three open sub-questions: (1) how the SM secret is named and declared in CDK, (2) what the envelope’s version prefix tracks (algorithm version, secret material version, or both), (3) how rotation works.

The full design is documented in 4-runtime-platform-updates/design/email-server-key-encryption.md. This entry summarizes the three sub-decisions.

Option	Description	Trade-offs
A	Single-axis envelope `vN`, with `vN` coupling algorithm and secret material. Sibling SM secrets per rotation (`-v1`, `-v2`, …).	Operationally simple at the data-model layer. But: every rotation churns the code-side dispatch table, conflating algorithm cadence (rare) with material cadence (frequent).
B	Two-axis envelope `a{N}.k{SM-VERSION-ID}`. One SM secret per partition; rotation via `update-secret` (SM-native versioning). Hot-swap via two `ExternalSecret` mounts (AWSCURRENT + AWSPREVIOUS). Lazy + coroutine migration. SDK fallback for rare older versions.	Algorithm and material lifecycles cleanly separated. SM’s native versioning enables future AWS Rotation Lambdas natively. Operationally clean. But: the dispatch model is slightly more elaborate than Option A.

Recommendation: Option B.

Decision: Option B. Phase 4 deploys one aws_secretsmanager.Secret per partition named {fqn}-I-EmailEncryptionKey (the -I- marker matches the convention as practiced for intra-partition resources), passwordLength: 64, RemovalPolicy.RETAIN. The Phase 5b on-disk envelope is a{N}.k{SM-VERSION-ID}:<base64-payload>; a{N} is the algorithm version (code-indexed; bumps require a release; never retired); k{...} is the AWS SM versionId of the SM version used at write time (runtime-indexed via two ExternalSecret mounts for AWSCURRENT and AWSPREVIOUS, plus a SM SDK fallback for rare older versions). Rotation is aws secretsmanager update-secret; migration is lazy on the first non-up-to-date read + a per-pod coroutine mop-up for the rest of the partition. Automated rotation via AWS SM Rotation Lambdas is enabled by this design and deferred to a future deliverable.

Applied to:

cross-cutting-design.md Secret-handling table row for the Per-partition encryption key (line ~163) and the “Encryption key” rotation subsection (line ~293).
phases.md § Phase 4 Deliverables — explicit row for {fqn}-I-EmailEncryptionKey.
4-runtime-platform-updates/goal.md Open Design Questions table row 3.
4-runtime-platform-updates/design/email-server-key-encryption.md — the canonical design document for this decision.
Phase 5b email module (consumes the per-partition SM secret via two ExternalSecrets; implements TokenCipher with the dispatch + migration model).
An operator runbook (Phase 4 deliverable) documenting the rotation procedure end-to-end.

DQ-R1-020: DNS-Provisioning + SM-Fallback IAM Roles

Context: Phase 4 introduces two new AWS capabilities that the operations component’s pod must exercise at runtime in each partition:

Route53 ChangeResourceRecordSets on the partition’s mail sub-zone ({partition}.ardamails.com) — consumed by the Phase 5b Email module for per-tenant DKIM / Return-Path / DMARC record provisioning.
secretsmanager:GetSecretValue on {fqn}-I-EmailEncryptionKey — consumed by the Phase 5b TokenCipher SDK-fallback path (DQ-R1-019) for the rare case of decrypting envelopes whose k{SM-VERSION-ID} is older than AWSPREVIOUS.

Both permissions target partition-scoped resources and need to be available to the same workload (the operations pod). The decision is the IAM topology: which mechanism authenticates the pod to AWS, and where the permissions live.

Codebase precedent: A search of infrastructure/src/main/cdk/ shows IRSA is the sole adopted pod-identity mechanism (the partition EksStack already configures an OpenIdConnectProvider and exports {fqn}-EksPodRoleArn). Crucially, the exported pod role is never extended with workload-specific permissions anywhere in the codebase. The established pattern — exemplified by infrastructure/src/main/cdk/constructs/storage/image-asset-bucket.ts and public-upload-bucket.ts — is to create a fresh purpose-specific role with a trust policy that lets the pod role assume it via STS:

const preSigningRole = new iam.Role(this, "ImageUploadPreSigningRole", {
  roleName: `${fqn}-ImageUploadPreSigningRole`,
  assumedBy: new iam.AccountPrincipal(account).withConditions({
    ArnLike: { "aws:PrincipalArn": clientRoleArnPattern },  // e.g. `arn:aws:iam::<acct>:role/<fqn>-*`
  }),
});

preSigningRole.addToPolicy(new iam.PolicyStatement({ /* purpose-specific perms */ }));

The pod federates into the partition pod role via IRSA at pod startup; the application code then performs sts:AssumeRole into the purpose-specific role at the call site (DQ-204 STS chain). Permissions live on the purpose role, not the pod role.

Option	Description	Trade-offs
α	Extend the existing `{fqn}-EksPodRole` with the new Route53 and SM `GetSecretValue` statements.	Single role to audit per partition; one stack change. But: not how anything else in the codebase is structured — the pod role is treated as an STS-chain origin, not a permission accumulator. Adopting α here would diverge from established practice.
β	Create two fresh per-purpose roles (`{fqn}-EmailDnsProvisioningRole`, `{fqn}-EmailEncryptionKeyFallbackRole`) with trust policies that allow the partition pod role to assume them via STS. Mirrors `ImageUploadPreSigningRole`.	Aligns with codebase precedent. Cleanest least-privilege: each call path can only chain into the role it needs. Two new CDK roles and two new exports — a normal Phase 4 cost (Phase 4’s purpose is precisely to provision the partition infrastructure 5b needs).
γ	Adopt EKS Pod Identity (`pods.eks.amazonaws.com`) for these new roles.	Simpler trust-policy shape. But: not used anywhere else in Arda; introduces a second pod-identity mechanism alongside IRSA; no concrete benefit for this use case. Reject.
δ	Node-level instance profile / long-lived static keys.	Violates DQ-204; reject.

Recommendation: Option β.

Decision: Option β. Phase 4 declares two fresh per-purpose IAM roles in each partition:

{fqn}-EmailDnsProvisioningRole — permissions: route53:ChangeResourceRecordSets, route53:ListResourceRecordSets on the partition’s mail hosted-zone ARN ({partition}.ardamails.com). Exported as {fqn}-EmailDnsProvisioningRoleArn. (route53:GetChange is intentionally omitted: it requires arn:aws:route53:::change/* resource scope rather than the hosted-zone ARN, and the Email module does not wait on Route53 propagation — Postmark verification is API-driven via verifyDkim / verifyReturnPath, which probe DNS from Postmark’s side.)
{fqn}-EmailEncryptionKeyFallbackRole — permission: secretsmanager:GetSecretValue on ${encryptionKeySecret.secretArn}* (full SM-secret ARN; the trailing wildcard tolerates the SM-appended random 6-character suffix — SM versions are selected at API call time via VersionId/VersionStage, not encoded in the resource ARN). Exported as {fqn}-EmailEncryptionKeyFallbackRoleArn.

Both roles share the same trust-policy shape:

assumedBy: new iam.AccountPrincipal(account).withConditions({
  ArnLike: { "aws:PrincipalArn": `arn:aws:iam::${account}:role/${fqn}-*` },
}),

This mirrors ImageUploadPreSigningRole: any role in the partition that matches the {fqn}-* name prefix may assume the role. In practice, the partition’s pod role ({fqn}-EksPodRole) is the only such role that an operations-component pod can federate into; the ArnLike condition limits the blast radius to the partition without coupling the role declaration to the pod role’s exact name.

The Phase 5b Email module performs sts:AssumeRole into these roles at the call site — same DQ-204 STS-chain pattern that operations already uses for the image-upload presign flow.

Implementation route — construct reuse with byte-identical Root output. The decision above pins the behavior of the DNS-provisioning role (STS-chained, account-principal + ArnLike trust, partition-scoped permissions). The implementation route refined during analysis: rather than hand-rolling a fresh role, reuse the existing AllowCreatingNSRecordsRole construct (Phase 2; constructs/oam/allow-creating-ns-records-role.ts). Despite the name, the construct’s permissions are already generic Route53 record-set CRUD (ChangeResourceRecordSets, ListResourceRecordSets, ListHostedZonesByName) with allowedParentHostedZoneIds scope-tightening. What needs to change: the trust principal, today hard-coded to iam.ServicePrincipal("lambda.amazonaws.com") with an OrgID condition, must be parameterizable so the Phase-4 instantiation can supply iam.AccountPrincipal(account).withConditions({ ArnLike: ... }).

This generalization carries two hard constraints that must hold simultaneously:

Byte-identical Root-account output. The existing Root-account instantiation in root-dns-stack.ts must produce a CloudFormation template that is byte-identical before and after the construct change. Guarded by a CDK Template.fromStack() snapshot equality unit test (in root-dns-stack.test.ts or allow-creating-ns-records-role.test.ts) that pins the Root resource shape; fails closed if the generalization regresses Root output.
Verified zero drift in deployed Root. A post-deploy verification step (operator-driven; tracked as V-PART-NNN in verification.md) diffs the Root account’s currently-deployed CFN template against the synthesized output post-generalization. Expected diff is empty. Runs before any partition-mail deploy so the Root assertion holds with the construct-as-of-Phase-4 code.

The optional construct rename (e.g., AllowCreatingNSRecordsRole → AllowCreatingDnsRecordsRole) is name-only and reflects the construct’s already-generic Route53 record-set CRUD permissions (the “NSRecords” suffix is a Phase-2 historical artefact).

Update (2026-05-12, applied at design time): the rename can land in the same PR as the construct generalization, provided the CDK construct ID at the call site is preserved. CloudFormation logical IDs derive from the construct’s path (parent ID + construct ID), not from the class name. Concretely: the Root call site new AllowCreatingNSRecordsRole(this, "AllowCreatingNSRecordsRole", …) becomes new AllowCreatingDnsRecordsRole(this, "AllowCreatingNSRecordsRole", …) — the second argument (the construct ID string) stays unchanged, so the synthesized template’s logical IDs are unchanged, and the byte-identity guarantee holds.

The earlier note above (“If the rename is desired, it lands as a separate change after Phase 4’s role-reuse work is verified stable”) is superseded by this update. The Phase 4 design (analysis.md G-IAM-1 + specification.md T-I1 step 2) bundles the rename into PR #1 alongside the byte-identity guard (T-I2), with the call-site mitigation above documented inline. No cascading effect on Phase 4’s spec, requirements, or verification regime — the byte-identity test (T-I2 / V-IAC-002) catches any logical-ID regression regardless of whether it originates in the rename or elsewhere.

Open follow-ups (Phase 4 specification, not blocking the decision):

Confirm the arn:aws:iam::${account}:role/${fqn}-* ArnLike pattern matches the partition’s actual pod-role naming convention in every partition (Alpha001 + Alpha002; spot-check both exports during specification).
Decide whether the two roles live in the same partition-email stack or split (recommend: same stack — both are Phase 4 partition-mail deliverables, same lifecycle, same RemovalPolicy).
Confirm whether route53:ListResourceRecordSets is needed in addition to Change* for the Phase 5b idempotency / pre-check path (recommend: yes; the Email module checks existing records before issuing changes).

Applied to:

phases.md § Phase 4 Deliverables — IAM-role row split into two; trust-policy and permission scoping updated to the STS-chain pattern.
4-runtime-platform-updates/goal.md Success Criteria #4, Deliverables list, Open Design Questions row 4 (now Decided).
5b-email-module/pre-existing-decisions.md — Phase 5b consumes the role ARNs and implements the STS-chain calls at the L1 / L2 boundary.

DQ-R1-021: Order of Partition Rollout

Context: Phase 4 fans out across four active partitions. The question is the rollout order across the rollout waves; whether to include the kyle partition (which is suspended at Phase 4 start); and how this aligns with Phase 5b’s deployment cadence.

Option	Description	Trade-offs
A	`dev`, `kyle` first; then `stage`, `demo`; then `prod`. Per the original `phases.md` Phase 5b recommendation.	Standard non-prod-first cascade. But: `kyle` is suspended at Phase 4 start; including it would mean provisioning a partition that has no operational use case.
B	`dev` → `stage` → `demo` → `prod`. Exclude `kyle` entirely.	Matches operational reality (kyle has no live use). `dev` first still satisfies the `arda-nonprod` Postmark account-approval prerequisite. Production lands last after non-prod wave validates the pattern.

Recommendation: Option B.

Decision: Option B. Phase 4 rolls out to dev, stage, demo, prod. The kyle partition is excluded from Phase 4 (suspended; the kyle.ardamails.com sub-zone is not provisioned). kyle stays reserved at the ardamails.com level so it cannot be appropriated as a tenant slug while the partition is suspended; the partition can be re-introduced later by replaying the per-partition deploy procedure if it resumes operation. Phase 5b inherits the same order.

Amendment (2026-05-13) — partial-ordering refinement: the original total order dev → stage → demo → prod is relaxed to the partial order dev → {stage || demo} → prod. dev must go first (it satisfies the arda-nonprod Postmark account-approval prerequisite and validates the design end-to-end in the lowest-blast-radius partition); prod must go last (production deploys after the non-prod wave validates the pattern); but stage and demo carry no technical dependency on each other and may be rolled out in parallel once dev is verified. Rationale: per-partition deploys are independent at the AWS level (separate CFN stacks, no shared resources — see goal.md Constraint #1), and the Postmark Compliance gate (arda-nonprod account approval) only blocks the second non-prod Sender Signature on the arda-nonprod account; it does not block demo or prod (which are on PostmarkProd). Phase 5b inherits the same partial order.

Applied to:

phases.md § Phase 4 Recovery / partial-failure handling — recommended deploy order.
phases.md § Phase 5b Recovery / partial-failure handling — recommended order.
4-runtime-platform-updates/goal.md Scope, Success Criteria #1, and Open Design Questions row 5.

DQ-R1-022: Operator CLI Shape for Phase 4

Context: Phase 3 introduced tools/corporate-cli.ts (a TypeScript CLI for the Corporate instance group’s two-phase Postmark + DNS provisioning). Phase 4 needs an equivalent operator surface for per-partition mail provisioning. The question is whether to generalize corporate-cli over a partition argument, introduce a parallel partition-mail-cli, or integrate the Phase 4 work into the existing amm.sh operator script that already deploys partition-level resources.

Option	Description	Trade-offs
A	Generalize `corporate-cli` to take an asset+partition pair. Both Corporate and partition mail work flow through the same CLI.	One CLI surface. But: stretches `corporate-cli` beyond its Corporate-instance scope; the partition path mixes with the Corporate path in implementation.
B	Introduce a parallel `tools/partition-mail-cli.ts`. Each instance group has its own CLI.	Scope-aligned naming. But: duplicates `corporate-cli`’s structure (idempotency, retries, redaction, conflict checks); adds a maintenance surface.
C	Integrate the Phase 4 partition-mail work into `amm.sh` (the existing Application Runtime deploy script). Phase 4 work follows `amm.sh`’s rules (idempotency, security, pre-flight checks, partition selection). Extract reusable bash + TypeScript utilities from `corporate-cli` so both `amm.sh` and `corporate-cli` share logic.	Aligns with existing operator surface for partition deploys. Familiar workflow. Reusable utilities prevent duplication across the two scripts. Requires refactoring Phase 3 deliverables to extract the shared utilities.

Recommendation: Option C — amm.sh integration with shared utilities.

Decision: Option C. Phase 4 partition-mail provisioning is part of the product runtime platform deployment and is invoked through amm.sh (and its rules: idempotency, security, pre-flight checks, partition selection). Not a standalone partition-mail-cli. Reusable sub-scripts / utilities are extracted from corporate-cli so both amm.sh’s partition path and corporate-cli can share logic; this includes refactoring Phase 3 deliverables as needed to keep each script’s complexity bounded.

Implementation route — TypeScript helpers under tools/, invoked from amm.sh via ts-node. Phase 4 stays with Phase 3’s imperative-then-declarative (Phase A / Phase B) pattern:

The extracted utilities (Postmark Account API client, idempotent list-then-create, retry / backoff, output redaction, conflict-check) live as TypeScript modules under tools/lib/ (or equivalent shared location).
A new entry script — tools/register-partition-mail-signature.ts — composes these utilities into Phase 4’s partition-mail Phase-A flow: read the Postmark account-level token from the partition’s Arda-{Env}OAM 1P vault (using the Phase 1 1P SDK helper), call the Postmark Account API to register the {partition}.ardamails.com Sender Signature (idempotent: list-then-create), capture the DKIM selector / public key / Return-Path target, and write those values into cdk.context.json (committed). The same utilities back corporate-cli’s Phase-A flow.
amm.sh’s direct calls collapse to three: (i) op read the Postmark account-level token (bash; remains in amm.sh for GHA ::add-mask:: hygiene); (ii) npx ts-node tools/register-partition-mail-signature.ts <infrastructure> <partition> (Phase A — Postmark API + context write); (iii) cdk deploy ${infrastructure}-${partition}-Email --parameters PostmarkAccountToken=… (Phase B — declarative CDK deploy).
No bash reimplementation of Postmark / 1P logic. amm.sh stays a thin orchestrator; the TS scripts hold the imperative logic. corporate-cli retains its TS entry-point and its Corporate-specific responsibilities (Free Kanban Tool server provisioning, 1P writes for the server token); only the shared helpers move into tools/lib/.
CR Lambda migration explicitly deferred (the “future architecture” called out in Phase 3 — the PostmarkSendingDomain thin-wrapper’s public surface is designed to be invariant under that migration). Phase 4 does not pull it forward; doing so would materially expand scope without a forcing function. Future migration is a construct-internals change isolated to platform/constructs/postmark/.

Applied to:

phases.md § Phase 4 Scope and Deliverables — “Operator surfaces integrated into amm.sh” bullet; “amm.sh-integrated partition-mail steps” deliverable row.
4-runtime-platform-updates/goal.md Open Design Questions row 6.
Phase 4 implementation work — includes refactoring Phase 3’s corporate-cli to extract reusable utilities consumed by both amm.sh and corporate-cli.

Pre-design follow-ups closed (Round R1-Phase4)

After DQ-R1-017..022 were resolved, planning surfaced eight smaller follow-ups (B1..B5, C1..C3) that needed pinning before Phase 4 design could start. Each is “pick the default and move on” rather than load-bearing; collectively they are recorded here for traceability without individual DQ-R1-NNN entries. Full text in 4-runtime-platform-updates/goal.md § Pre-design follow-ups.

ID	Item	Resolution
B1	Phase 5a `TokenCipher` location	Ships in `common-module` as a general-purpose encrypted-field utility (not Email-specific)
B2	Postmark account-token deploy-time delivery	δ.1 — `amm.sh` reads via `op`, passes to `cdk deploy` as `NoEcho` parameter; `partition-email` stack uses `SecretValue.cfnParameter()`. Mirrors `partitionSecrets.cfn.yaml`
B3	`amm.sh` extraction scope from `corporate-cli`	Minimal: extract only what `amm.sh`’s partition-mail steps need; backfill on demand
B4	`kyle` reservation registry	Extend the Phase 3 mechanism used to reserve `arda` at the `ardamails.com` level
B5	Cross-partition deploy gating in CI	Operator-enforced via `amm.sh`; no `tools/cdk-runner.js` matrix change
C1	CDK stack name	`${infrastructure}-${partition}-Email` (parallels existing `-Secrets`, `-Amplify` stacks); immutable — locked at first deploy
C2	Per-partition DMARC reporting mailbox	Reuse `dmarc-reports@arda.cards` for all partitions (DMARC report content already identifies the source domain)
C3	`runtime-platform-drift` schedule + labels	Daily cron; failure-issue labels `drift` + `runtime-platform`; mirrors `corporate-drift` shape

These resolutions also drive a new Phase 4 deliverable: current-system/oam/security/secret-delivery-pattern.md, documenting the canonical op → amm.sh → CFN NoEcho parameter → SM secret → consumer flow with partitionSecrets.cfn.yaml and the Phase 4 Postmark token as worked examples.

DQ-R1-023: Per-Tenant Postmark Sender Signature Introduction (Phase 5b)

Status: Open — to be confirmed at Phase 5b planning. No Phase 4 dependency; Phase 4 provisions the enabling infrastructure (EmailDnsProvisioningRole, partition mail sub-zone) regardless of which way this decision goes.

Context: DQ-R1-017 (Round R1-Phase4) decided that Phase 4 ships one Postmark Sender Signature per partition ({partition}.ardamails.com) and defers per-tenant Signatures to Phase 5b. The Phase 4 design works for sending — tenants sending from {config}.{tenant}.{partition}.ardamails.com use the partition’s DKIM key via Postmark sub-domain inheritance and DMARC relaxed alignment. The trade-off: all tenants in a partition share the partition’s DKIM-domain reputation at the receiver side (Gmail, Microsoft, Yahoo, etc., track reputation by the DKIM d= domain, not by the Postmark Server identifier).

The question is whether Phase 5b should introduce per-tenant Sender Signatures to give per-tenant reputation isolation, and if so, on what schedule.

Option	Description	Trade-offs
α	Status quo — all tenants in a partition share the partition Signature; per-tenant Servers exist for token / activity-log isolation but DKIM-domain reputation is shared.	No additional Phase 5b work for sending. But: one bad tenant degrades reputation for every tenant in that partition. No remediation path for tenants with persistent bounce / spam issues.
β	Per-tenant Signature from v1 — every tenant onboarded in Phase 5b gets its own Sender Signature registered via the Postmark Account API; per-tenant DKIM TXT + Return-Path CNAME records written at tenant onboarding via `EmailDnsProvisioningRole`.	Best reputation isolation. But: additional tenant-onboarding cost (Postmark API call + DNS write per tenant); operational surface grows linearly with tenant count.
γ	Hybrid — opt-in per-tenant Signature — Phase 5b ships with partition Signature as the default; tenants flagged as high-volume or reputation-sensitive (operator-driven or automated based on send volume) are migrated to per-tenant Signatures on demand.	Balances cost and isolation. But: introduces an operator decision per tenant; migration path needs design.
δ	Remediation-only per-tenant Signature — partition Signature is the default; per-tenant Signature is the remediation when a tenant generates a reputation incident.	Lowest operational cost. But: by the time remediation is needed, reputation damage has already affected siblings.

Recommendation: To be made at Phase 5b planning, informed by:

Actual tenant send volume and bounce / spam rates in Phase 5b’s pilot phase.
Postmark’s own guidance at the time (their best practices may evolve).
Compliance / contractual requirements specific to tenant cohorts (e.g., enterprise tenants may contractually require reputation isolation).

Phase 4 work that this affects: None. The EmailDnsProvisioningRole (G-IAM-3 in 4-runtime-platform-updates/design/analysis.md) is provisioned regardless — it is the explicit enabler for whichever way this decision goes. Phase 4 ships the infrastructure; Phase 5b decides when to exercise it.

Applied to:

5b-email-module/pre-existing-decisions.md — listed as a Phase 5b decision pending.
phases.md § Phase 5b — referenced in the Phase 5b open design questions.
4-runtime-platform-updates/design/analysis.md § 5.5 — explicit forward-reference from C-Postmark-Sending’s out-of-scope edges.

DQ-R1-024: EmailEncryptionKey initial-value generation mechanism

Status: Superseded by DQ-R1-032 (originally Resolved, Round R1-Phase4).

Superseded (Round R1-Phase5b). Option A (CFN-native GenerateSecretString) was implemented in Phase 4 but proved undeployable for the email module: it can only emit a flat {"key": "<random chars>"} value, whereas the operations module’s MaterialRegistryRefresher requires a UUID-keyed { "<versionId>": "<base64 64-byte key>" } registry and fail-fasts on anything else. DQ-R1-032 reverses this decision in favour of Option C (the δ.1 NoEcho-parameter path, sourced from 1Password). The original analysis below is retained for the audit trail.

Context: DQ-R1-019 decided what the SM secret looks like (one aws_secretsmanager.Secret per partition, passwordLength: 64, RemovalPolicy.RETAIN, two-axis envelope a{N}.k{SM-VERSION-ID}). It did not pin how the 64-byte initial value gets generated at first deploy. Three mechanisms are available; the choice affects both implementation cost and the audit story for “immutable post-launch” (V-PART-016).

Option	Mechanism	Trade-offs
A	CFN-native `GenerateSecretString` on `AWS::SecretsManager::Secret`. CDK declares `new sm.Secret({ generateSecretString: { passwordLength: 64, excludePunctuation: true } })`. CFN generates the value at first deploy; re-deploys are no-ops because the `GenerateSecretString` block is identity-stable.	Zero custom code. Existing precedent in the repo (`partition-secrets.ts` uses it for `SentryScrubSalt` with identical shape). CFN guarantees the value never regenerates unless the operator explicitly forces it. `describe-secret` versionId before/after re-deploy is identical — V-PART-016 verifies via CFN’s own behaviour.
B	CDK Custom Resource (inline Lambda) that generates the random value at deploy time and writes it to SM via the SDK.	Maximum flexibility (custom entropy source, key-derivation steps). But: Lambda boilerplate, opaque error handling, no out-of-the-box immutability story — the Lambda’s behaviour on re-deploy is whatever the author writes. Precedent exists for asymmetric keys (`generate-signing-key.ts`) but is heavier than required here.
C	Pre-Deploy script + NoEcho CFN parameter (δ.1 pattern). The Pre-Deploy script generates entropy locally, `op write`s it into 1Password, then `amm.sh` reads it and passes via `--parameters EmailEncryptionKey=$value` (NoEcho).	Mirrors the EmailPostmarkAccountToken delivery path (δ.1). Cleanly separates “operator-supplied” from “CFN-generated” secrets at the operational boundary. But: introduces an out-of-band 1P write that V-PART-016 would have to assert idempotency for; mixes secret-delivery patterns inside the same stack.

Recommendation: Option A.

Decision: Option A — CFN-native GenerateSecretString. PartitionEmailStack declares the encryption-key secret with the same shape as partition-secrets.ts’s SentryScrubSalt:

new sm.Secret(this, "EmailEncryptionKey", {
  secretName: `${publishingPrefix}-I-EmailEncryptionKey`,
  generateSecretString: { secretStringTemplate: JSON.stringify({}), generateStringKey: "key", passwordLength: 64, excludePunctuation: true },
  removalPolicy: cdk.RemovalPolicy.RETAIN,
});

The δ.1 NoEcho pattern is reserved for the EmailPostmarkAccountToken (externally supplied from 1Password). EmailEncryptionKey is CDK-internal generation — no NoEcho parameter, no Custom Resource, no separate amm.sh step.

Applied to:

4-runtime-platform-updates/design/specification.md — T-I4 PartitionEmailStack body refers to this decision when declaring the secret.
4-runtime-platform-updates/design/verification.md — V-PART-014, V-PART-016 procedures assert via describe-secret versionId before/after re-deploy (CFN-native immutability).

DQ-R1-025: Pre-Deploy script’s `cdk.context.json` write strategy

Status: Resolved. Round R1-Phase4.

Context: T-I8 (tools/register-partition-mail-signature.ts) writes DKIM selector / public key / Return-Path target into cdk.context.json after the Postmark Sender Signature is registered. Those values are then consumed by PartitionEmailStack at synth time. DQ-R1-014 decided whether to commit cdk.context.json (yes, with the public values). It did not pin how the entry script writes to it. There is no precedent in this repo for a tool writing context.json — CDK normally writes it at runtime via its providers.

Option	Mechanism	Trade-offs
A	Hand-rolled atomic JSON merge in a new `tools/lib/context-store.ts`. Read `cdk.context.json`, set keys under a namespaced path (e.g. `partitionMail:<infrastructure>:<partition>`), `JSON.stringify(obj, null, 2)`, write to `.tmp`, atomic rename.	Simple, zero deps. Deterministic key ordering via insertion-order. Pure unit-testable. Namespaced keys avoid collisions with CDK provider entries. Atomic-write boilerplate is ~15 LoC.
B	CDK `ContextProvider` helpers. Bootstrap the CDK runtime in the CLI script and use `ContextProvider.getValue()` / `Stage.synth()` mutation paths.	”Official” CDK path. But: the provider API is read-only at the public surface; mutation requires reaching into private CDK internals. Heavyweight (bootstraps a CDK runtime in a CLI tool).
C	`npx cdk context --set` CLI wrap. Invoke the CDK CLI’s `context --set <key>=<value>` per field.	Uses CDK’s own mutation surface. But: writes flat key=value pairs (strings only) — forces flattening nested objects into `partitionMail.Alpha002.dev.dkimSelector=...`, which is ugly to grep and brittle. One CLI invocation per field inflates run time.

Recommendation: Option A.

Decision: Option A — hand-rolled atomic JSON merge. Phase 4 lands a tools/lib/context-store.ts helper with the signature:

export function writePartitionMailContext(
  infrastructure: string,
  partition: string,
  values: { dkimSelector: string; dkimPublicKey: string; returnPathTarget: string },
): void;

It reads cdk.context.json from the repo root, sets the path partitionMail:<infrastructure>:<partition> to the values object, and writes back atomically (.tmp + rename) with JSON.stringify(..., null, 2). The corresponding read accessor in PartitionEmailStack reads the same namespaced key at synth time. Pure unit-testable with a temp directory; no CDK runtime bootstrap.

Applied to:

4-runtime-platform-updates/design/specification.md — T-I8 entry-script body references this helper instead of inlining the JSON merge.
4-runtime-platform-updates/design/verification.md — V-CLI-001 (entry-script tests) includes a context-write assertion against a temp cdk.context.json fixture.

DQ-R1-026: Consolidation of Per-Partition Rollout Runs into a Single Operator-Cascade Run

Status: Resolved. Round R1-Phase4, post-PR-462 amendment (2026-05-26).

Context: The original Phase 4 plan decomposed the partition rollout into four runs: Run-2 (dev, code + dev deploy), Run-3 (stage), Run-4 (demo), Run-5 (prod). After PR #462 (Run-2) entered review, two facts became clear:

PR #462 already ships the code for all four active partitions: platforms.ts carries the mail block for dev, stage, demo, and prod; apps/Al1x/partition.ts iterates every partition whose mail block is set; PartitionEmailStack, the Pre-Deploy CLI, and the amm.sh step are all parameterised by <infrastructure> <partition>. cdk synth against each of the four {infra}-{partition}-Email stacks succeeds today against the as-built Run-2 tree. No code diff is needed to deploy stage, demo, or prod.
Each remaining run’s PR diff is, by default, a single CHANGELOG line. Runs 3 / 4 / 5 are operator-driven deploys (run amm.sh against a partition; capture the resulting cdk.context.json values; verify), not code work. Three sequential CHANGELOG-only PRs would multiply review and approval steps without commensurate technical content per PR.

Option	Description	Trade-offs
A	Keep Runs 3 / 4 / 5 as separate CHANGELOG-only PRs. Each PR is one bullet under `### Added`; operator runs one partition deploy per PR; sign-off per PR.	Per-partition isolation in approval flow; partition-by-partition retreat path if anything diverges. But: three review cycles for what is mechanically the same shape; the meaningful artefact (the execution log, `cdk.context.json` updates, any code fixes) accumulates across all four partitions and is fragmented across three PRs.
B	Collapse into a single Run-3 (operator cascade): one PR captures the CHANGELOG entry reflecting the deployed-system state, the accumulated `cdk.context.json` updates from all four partition deploys, and any code fixes that surfaced during execution. The execution log becomes the run’s primary deliverable (an artefact document, not the PR body).	Single PR approval gate; the deliverable artefact (the execution log) is intact, not split. The operator still has full per-partition retreat: if `prod` fails to deploy, only its sub-section of the log is unpopulated; the run does not close until all four partitions are verified. Cost: a single PR review must cover four partition deploys; production deploy approval is part of the same gate as stage/demo.
C	Collapse 3 + 4 only; keep `prod` (Run-5) separate. Stage and demo land together; production is its own approval gate.	Splits the difference. But: stage and demo are on different Postmark accounts (`PostmarkNonProd` vs `PostmarkProd`); pairing them does not reduce risk asymmetry the way the natural split (`{dev, stage}` on NonProd vs `{demo, prod}` on Prod) might. The collapsed shape still leaves the worst review-churn case (three PRs in total: one code, one cascade, one prod) without the benefit of a clean cascade artefact.

Recommendation: Option B.

Decision: Option B. Runs 3, 4, and 5 collapse into a single Run-3 (operator cascade). The run’s deliverables are:

An execution log (4-runtime-platform-updates/plan/runs/run-3-operator-cascade/execution-log.md) — one section per partition (dev, stage, demo, prod), capturing pre-flight outcomes, amm.sh run output, post-deploy verification (dig checks, Postmark Console state, CFN export presence), and any anomalies. The log is written as the operator executes each partition; it is the primary artefact of the run.
A single infra PR opened from the existing phase-4/infrastructure-run-3 worktree, base = main (auto-retargets when PR #462 merges), containing:
- CHANGELOG.md — one new entry (### Added) describing the deployed-system state across all four partitions.
- cdk.context.json — populated with each partition’s partitionMail:<infra>:<partition> block (DKIM + Return-Path target) captured during the operator’s Pre-Deploy step.
- Any code fixes that emerged during partition deploys (e.g., a PostmarkProd-account quirk surfaced on demo).
Operator sign-off rows in 4-runtime-platform-updates/design/verification.md populated for the per-partition V-checks (V-OPS-005-dev, V-OPS-005-stage, V-OPS-005-demo, V-OPS-005-prod) as each partition lands.

Note on Run-2 boundary: PR #462’s operator step (running ./amm.sh Alpha002 dev after merge to deploy dev) is part of the new Run-3 cascade, not a post-action of Run-2. Run-2 closes when PR #462 merges; Run-3 begins immediately afterwards with the dev partition as its first cascade entry.

Note on Run-6 and Run-7 numbering: Runs 6 (drift workflow) and 7 (documentation) keep their original numbers. Renumbering them to 4 and 5 would churn specification.md, verification.md, phases.md, and PR #462’s existing CHANGELOG entry text for no operational benefit. The gap (runs 4 and 5 vacant) is documented here and in the phases.md Run table.

Note on retreat path: If a partition’s deploy fails mid-cascade (CFN rollback, Postmark error, DNS propagation gap), the operator captures the failure in the execution log, files any necessary code fix, and resumes from the failed partition once the underlying issue is addressed. The cascade does not roll back successfully-deployed partitions: per-partition isolation that drove the original Phase-4 decomposition (DQ-R1-021) still holds at the resource level.

Applied to:

4-runtime-platform-updates/plan/choreography.md — Run table, dependency diagram, hand-offs, deliverables, failure modes all collapsed to a single Run-3 row.
4-runtime-platform-updates/plan/evaluation.md — Run table reduced; working-directory count revised.
4-runtime-platform-updates/design/specification.md § 3 — worktree list collapses to a single phase-4/infrastructure-run-3 entry.
4-runtime-platform-updates/plan/runs/run-2-dev-rollout/project-plan.md — T-O4 reference updated to “unblocks Run-3 cascade (stage entry)”.
4-runtime-platform-updates/plan/runs/run-6-drift-workflow/project-plan.md — entry criterion now references “Run-3 cascade has at least dev partition live”.
plan/runs/run-3-stage-rollout/, plan/runs/run-4-demo-rollout/, plan/runs/run-5-prod-rollout/ — directories removed.
New plan/runs/run-3-operator-cascade/ — consolidated project plan, execution log skeleton, and validate-exit.sh.

Round R1-Phase5a: Component Library Updates Decisions

This round captures decisions made during Phase 5a — additive helpers to the common-module library consumed by the Phase 5b Email module. Decision IDs DQ-R1-027 through DQ-R1-031 are reserved for this round; all entries are resolved. Phase 5b’s consumer-side adoption work (lifting the new common-module version, applying the Email module’s classification of Internal.IncompatibleState sites, wiring the typed idempotency view) lands separately under Phase 5b decisions.

DQ-R1-027: `AppError.Application` Introduction

Status: Resolved. Round R1-Phase5a, 2026-05-28.

Context: The existing AppError hierarchy splits caller error (Invocation — ArgumentValidation, NullArgument, NotFound, Duplicate, Authorization; reportable() = emptyList()) from system error (Internal — Implementation, Infrastructure, IncompatibleState, InternalService, InternalTimeout, Transient; reportable() = listOf(this)). Neither captures the third real category: the call was well-formed, the system is healthy, but the application’s current state does not allow this operation right now. Today these get squeezed into Invocation.GeneralValidation (which misclassifies a domain-state outcome as caller error) or Internal.IncompatibleState (which misclassifies a recoverable application outcome as a bug-class signal). The misclassification has real cost: Internal.IncompatibleState pages on-call; the L4 mapping table can’t distinguish “caller passed bad input” from “system in unexpected state”.

Option	Description	Trade-offs
A	Keep the two-branch hierarchy. Document the convention that application-state outcomes use `Invocation.GeneralValidation`.	No new types. But: the type system stops carrying the information; reviewers must enforce the convention by inspection; `reportable()` still pages on-call for cases that aren’t bugs.
B	Add `Application` as a third top-level branch, peer to `Internal` and `Invocation`. Three concrete subtypes — `PreconditionFailed` (operation requires prior state the system doesn’t have), `PolicyRejected` (operation disallowed by policy), `ConflictingState` (operation race-lost or expectation drifted). `reportable()` returns empty list for the whole branch.	Three new types. Source-incompatibility for exhaustive-`when` consumers of `AppError`. But: the type system carries the classification; `reportable()` is correct by construction; the L4 mapping table dispatches on subtype.
C	Add `Application` as a flat peer of three `data class`es (no enclosing `sealed class Application`).	Avoids the source-incompatibility from sealed-class nesting. But: loses the shared `reportable() = emptyList()` override; loses the typed-dispatch case at L4.

Recommendation: Option B.

Decision: Option B. sealed class Application is added as a third top-level branch under AppError, with three concrete subtypes (PreconditionFailed, PolicyRejected, ConflictingState). All three are data classes with message: String, context: LazyMessage? = null, cause: Throwable? = null. reportable() returns emptyList() at the Application level (single inheritance point). REST-status mapping is the single responsibility of the L4 mapping table (HttpErrorResponses.kt); Application subtypes do not carry HTTP-status hints. The companion Internal.IncompatibleState reclassification sweep is tracked separately (DQ-R1-028).

Applied to:

design/index.md § 1 — API surface and consumer guidance.
task-plan.md PR #1 — additive common-module minor release introducing the three subtypes (Added category; 9.2.0).
Phase 5b email module — L3 services in the Email module use Application.PreconditionFailed / Application.ConflictingState where appropriate.

DQ-R1-028: `Internal.IncompatibleState` Reclassification Sweep

Status: Resolved. Round R1-Phase5a, 2026-05-28.

Context: DQ-R1-027 introduces AppError.Application. The existing 62 construction sites of Internal.IncompatibleState in common-module/lib/src/main (and additional sites in operations) need case-by-case judgement: each site is one of (a) a genuine bug-class invariant violation that should keep Internal.IncompatibleState, (b) a recoverable application outcome that should move to Application.ConflictingState, or (c) a caller-error that should move to Invocation.GeneralValidation. The question is how to land the sweep — as a single bulk PR, as a series of per-area PRs, or as the last PR of Phase 5a after the other helpers are in.

Option	Description	Trade-offs
A	Sweep + `AppError.Application` introduction in a single PR. One review captures both the new types and the reclassification.	Locks the rationale to the same review. But: bigger PR, harder to atomically revert one decision without the other.
B	Sweep as one of several parallel PRs that each carry an additive helper. Each PR is `Added`-only; the sweep is `Changed`. Five PRs land in any order; consumers absorb when convenient.	Sweep can land mid-stream; would force a major bump (`Changed`) in the middle of the additive-minor sequence. Confusing release history.
C	Sweep as the last PR of Phase 5a, after the four `Added`-only helpers. PR #1, #3, #4, #5 each `Added` and minor (9.2.0 → 9.5.0); the sweep is PR #2 by number but lands last and consolidates the major bump (10.0.0).	Each minor PR is small and reviewable in isolation; the sweep accumulates the discovery and reclassification across all 62+ sites in one focused review; consumers absorb one major + four minors in their adoption bump. Closest match to OQ-V’s “consumers absorb everything at once”.

Recommendation: Option C.

Decision: Option C. The sweep lands as the final Phase 5a PR (numbered PR #2 by design topic, sequenced last by release order). Discovery is a separate planning step that builds the inventory of Internal.IncompatibleState construction sites in common-module and classifies each. The Phase 5a sweep covers common-module only; the matching sweep in operations is part of Phase 5b’s consumer adoption work. CHANGELOG category is Changed (consumers doing exhaustive when over AppError.Internal see reclassified sites move out); version bump is major (10.0.0 from the prior 9.x.x).

Applied to:

design/index.md § 2 — sweep methodology and per-bucket criteria.
task-plan.md PR #2 — sequenced last; Changed; major bump.
Phase 5b — the matching sweep within operations is part of consumer adoption.

DQ-R1-029: `sanitizeHeader` Value-Cleaning Primitive

Status: Resolved. Round R1-Phase5a, 2026-05-28.

Context: The Phase 5b Email module accepts inbound HTTP requests carrying headers that get persisted or routed into downstream operations (idempotency keys via Idempotency-Key, tenant correlation via X-Tenant-Id, etc.). The existing HeadersAllowList (in common-module’s lib/runtime/observability/) controls which headers are safe to log; it does not control what values are safe to read into business logic or persist. The two concerns are independent and compose: deny-by-name first (observability scoping), then per-value cleaning. Phase 5a needs a primitive that owns the value-cleaning step.

Option	Description	Trade-offs
A	Extend `HeadersAllowList` to also clean values.	One artefact. But: conflates two responsibilities (observability scoping vs. persistence safety); the allowlist’s existing scope (Sentry-payload-shaping) is not the same as L3 / persistence input cleaning.
B	Add `sanitizeHeader(name, value): Result<String?>` in a new package (`lib/api/headers/`). Composes downstream of `HeadersAllowList`. Reject by-value (control characters, length cap, charset) returning `Result.failure(AppError.Invocation.GeneralValidation)`; clean (trim, normalize) returning `Result.success(cleaned)`.	Two artefacts with crisp responsibilities. `sanitizeHeader` is callable from L4 inbound and (by package placement) L4 outbound when needed. Composes with the existing allowlist without coupling.

Recommendation: Option B.

Decision: Option B. sanitizeHeader(name: String, value: String): Result<String?> ships in lib/api/headers/ (a new package). Returns Result.success(cleanedValue) on accept, Result.success(null) on policy reject (header dropped silently, no error), Result.failure(AppError.Invocation.GeneralValidation) on hard-rejection (control characters, oversize, charset violation). The composition pattern at L4 inbound is: HeadersAllowList.filter(...) first (drop disallowed-by-name; observability scoping), then sanitizeHeader(name, value) per surviving header (clean / reject by value; in-transaction). HeadersAllowList is left unchanged.

Applied to:

design/index.md § 3 — API surface, test plan, composition pattern.
task-plan.md PR #3 — additive; 9.3.0.
Phase 5b L4 endpoints (email-configuration, email-job, postmark-events) — consume sanitizeHeader for inbound header handling.

DQ-R1-030: `TokenCipher` + `Hmac` Cryptographic Helpers

Status: Resolved. Round R1-Phase5a, 2026-05-29.

Context: DQ-R1-019 pinned the per-partition encryption-key design and named TokenCipher as the Phase 5a primitive that implements the two-axis envelope (a{N}.k{SM-VERSION-ID}:<base64-payload>). The decision left four implementation-level shape questions for Phase 5a: (1) factory shape (the canonical Arda companion inline operator fun <reified T> invoke(...) pattern doesn’t apply because TokenCipher is non-generic); (2) how the cipher resolves a versionId that is not present in the in-memory MaterialRegistry at decrypt time; (3) auth-tag-failure classification (AES-GCM tag-verification failure on decrypt — Internal.IncompatibleState vs Application.ConflictingState); (4) whether to extract an Hmac micro-helper or leave the two existing JDK-Mac callsites duplicated.

Option	Description	Trade-offs
Factory: A	Use plain primary constructor. Validation (non-empty `info`, registry contains `currentVersionId`) becomes the caller’s responsibility.	Total constructor. But: pushes validation up to every caller; loses an obvious place to centralize the contract.
Factory: B	`companion operator fun invoke(info: String, materials: MaterialRegistry, currentVersionId: UUID): Result<TokenCipher>`. Callers write `TokenCipher(info, materials, currentVersionId)` and get `Result<TokenCipher>` back.	Constructor-shaped call site; `Result<T>` carries validation failures; consistent with the workspace `kotlin-coding` standard. The Arda reified-inline pattern is for resolving type parameters and doesn’t apply here.
Resolution: A	Project every live key-material version into the in-memory `MaterialRegistry` from a single ESO-projected JSON map; cipher fails fast (transient) on a miss.	Single secret-delivery path (ESO); no application-side path to AWS Secrets Manager; bounded `common-module` surface (no SDK dependency, no fallback abstraction). Propagation lag covered by the existing transient-retry layer.
Resolution: B	Maintain a smaller in-memory registry (e.g., `AWSCURRENT` + `AWSPREVIOUS` only); fall back to a caller-supplied hook (typically AWS SDK direct call) for older versions.	Smaller registry, but introduces a parallel-to-ESO secret-delivery path with the future-drift concern that other features may start using the same path.
Auth-tag: A	Classify auth-tag mismatch as `Application.ConflictingState`. L3 caller handles re-read / retry from a different version.	Recoverable framing. But: misclassifies what is actually data corruption (or active tampering); doesn’t page on-call when it should.
Auth-tag: B	Classify auth-tag mismatch as `Internal.IncompatibleState`. Bug-worthy; pages on-call.	Pages on the rare-but-serious failure modes (storage corruption, key-material desync, active tampering). If operational reality reveals spurious tag failures, reclassify — but bug-worthy is the safer starting position.
Hmac DRY: A	Leave the two JDK-`Mac` callsites (`OpaqueId.kt:67`, `S3AssetService.kt:143`) inline-duplicated.	No new helper. But: two near-identical copies that drift independently; future HKDF wrapper inside `TokenCipher` needs the same pattern.
Hmac DRY: B	Extract `Hmac` micro-helper in `lib/crypto/` (the package `TokenCipher` lives in). Both existing sites migrate. `TokenCipher` uses it internally for HKDF derivation.	One helper, three callsites consistent. `TokenCipher`’s HKDF logic is testable in isolation. Internal refactor only; no external API change at the migrated sites.

Recommendation: Factory B + Resolution A + Auth-tag B + Hmac DRY B.

Decision: Factory B (companion operator fun invoke(info, materials, currentVersionId): Result<TokenCipher>), Resolution A (single ESO-projected registry; no application-side path to AWS Secrets Manager), Auth-tag B (Internal.IncompatibleState for AES-GCM tag mismatch), Hmac DRY B (extract Hmac to lib/crypto/). Public surface lives in cards.arda.common.lib.crypto:

TokenCipher — two-axis envelope a{N}.k{SM-VERSION-ID}:<base64>; HKDF-SHA256 key derivation; AES-256-GCM encrypt / decrypt; MaterialRegistry keyed by versionId; private constructor + companion operator fun invoke(info, materials, currentVersionId). The cipher does not consult any external system at runtime — the MaterialRegistry is populated by the caller from a single ESO-projected JSON map carrying every live key-material version, and the cipher reads only what is currently in the registry. The caller may mutate the registry at runtime in response to ESO refresh events.
MaterialRegistry — versioned 64-byte key-material store; add and of enforce material length.
Hmac — thin wrapper over javax.crypto.Mac for HmacSHA256. Used by TokenCipher for HKDF; OpaqueId.kt and S3AssetService.kt migrate to it as part of the same PR.
Hmac exposure as a standalone helper for non-TokenCipher consumers is deferred; HKDF stays internal to TokenCipher in v1.

Decrypt failure classification (two distinct modes):

Auth-tag mismatch → Result.failure(AppError.Internal.IncompatibleState(...)) — bug-worthy; pages on-call.
Unknown versionId → Result.failure(AppError.Transient.FailoverFailed(cause)) where cause is a synthetic Throwable whose message names the missing version. Bounded transient (propagation lag between AWS Secrets Manager and ESO’s projection into the pod); existing transient-retry layers (Postmark webhook, outbound idempotency, L4 client retries) fire after timescales that exceed ESO’s reconciliation interval and find the registry refreshed on the next attempt. Class name FailoverFailed is observability noise — diagnostic info lives in the cause’s message and structured logging at the catch site; a more specific subtype is intentionally not added to avoid the breaking change of extending the sealed Transient hierarchy.

Rotation enablement (the JSON-map schema for the SM secret value, the disposition of the deployed EmailEncryptionKeyFallbackRole, the operator rotation script, and the future AWS SM Rotation Lambda) is tracked as a follow-up in PDEV-659.

Applied to:

design/index.md § 4 — envelope format, key-derivation, error classification, internal refactor sites.
task-plan.md PR #4 — additive common-module minor release (Added). The OpaqueId.kt / S3AssetService.kt migration is a private-call-site refactor with no external API change; CHANGELOG remains Added-only.
Phase 5b email module (consumes TokenCipher per DQ-R1-019).

DQ-R1-031: Idempotency Helpers with Native JsonElement + Typed Wrapper

Status: Resolved. Round R1-Phase5a, 2026-05-28.

Context: Phase 5b’s Email module needs idempotency at two seams: inbound HTTP requests (Intent (a) — caller supplies an Idempotency-Key header; the L3 service de-duplicates retries) and outbound side-effecting calls (Intent (b) — the L3 service generates a deterministic idempotency key from a natural key for the downstream API). Phase 5a ships the primitives; Phase 5b wires them.

Two structural questions dominate the design:

Type shape at the store boundary — whether the store is parameterised by the consumer’s Req/Res types (forcing the store to own kotlinx serialization via reified KSerializer<Req>/KSerializer<Res> references) or operates on JsonElement symmetrically with a typed wrapper on top.
Schema evolution — when a consumer changes the shape of Req in a serialisation-affecting way (field rename, type change), in-flight idempotency records hash differently after the change. The defensive mitigation can be a request_schema_version column on every row (store-side knob), procedural drain-before-deploy (operational), or the caller projects Req to a stable hash shape explicitly (caller-controlled).

Option	Description	Trade-offs
A	Typed store `IdempotencyStore<Req, Res>` with reified factory. Schema-evolution is procedural (drain-before-deploy).	Typed-by-default; naive callers get correctness for free. But: conflates request DTO with hash projection; schema-evolution mitigation is store-wide procedure; store API depends on `KSerializer<Req>`.
B	Native `RawIdempotencyStore` operating on `JsonElement` symmetrically (Req and Res). Typed wrapper `IdempotencyStore<Req, Res>` produced by an inline extension `fun <reified Req, reified Res> RawIdempotencyStore.typedAs(json: Json = JsonConfig.standardJson): IdempotencyStore<Req, Res>`. Caller chooses the layer; schema-evolution is the caller’s responsibility (custom adapter or stable JsonElement projection).	Explicit separation of request DTO from idempotency-hash projection. Schema-evolution is caller-controlled per consumer. Native store API surface is one type narrower (no `KSerializer<Req>`). One extra line of caller-side encoding boilerplate at the call site (folds into a per-service helper).

Recommendation: Option B.

Decision: Option B. The native interface is RawIdempotencyStore (operating on JsonElement symmetrically); the typed view is IdempotencyStore<Req, Res> produced by inline fun <reified Req, reified Res> RawIdempotencyStore.typedAs(json: Json = JsonConfig.standardJson): IdempotencyStore<Req, Res>. The typed wrapper holds resolved KSerializer<Req>/KSerializer<Res> references captured once at wiring time. On replay-time decode failure (Req or Res bytes no longer match the current type), the wrapper returns Result.failure(AppError.Internal.IncompatibleState(...)) — decode failures are bugs (the consumer changed schema without a coordinated drain or adapter), not normal operational outcomes. The native store always carries Mismatch.recordedRequest: JsonElement so consumers (typed or raw) can log the recorded request for debugging on hash collision. Schema-evolution defence is caller-controlled: a consumer that wants stable hashes across Req versions writes a custom KSerializer<Req> (passed via a refined Json to typedAs) or projects Req to a stable JsonElement shape before calling the native store. No store-wide request_schema_version column ships in v1.

Other implementation-level decisions:

error_payload projection (failure-path) — encoded via a well-structured @Serializable data class. The exact projection shape is fixed in the implementation PR; consumers see IdempotencyOutcome.PriorError(error: AppError) on replay (durable contract). The on-disk bytes are private to the store.
Mismatch.equals/hashCode override — kept. ByteArray content-equality is required for tests and consumer-side caches to compare Mismatch values meaningfully.
replayWindowOverride: Duration? = null parameter on begin() — kept. Operator-driven retry endpoints that want a shorter replay window than the per-store default pass it per-request. Default null means use store default.
result_payload / error_payload storage column type — JSONB in PostgreSQL. PG validates payloads at write time; SQL-level inspection produces readable output; JSONB has native indexing if ever needed. Switching to a binary format later would be a column-type migration, judged unlikely.
IdempotencyKeyMinter (Intent (b)) — separate helper taking parts: List<String>. Deterministic SHA-256-based; orthogonal to either store interface; not parameterised by Req/Res.

Applied to:

design/idempotency-design.md — full implementation-shaped design.
task-plan.md PR #5 — additive common-module minor release introducing the package cards.arda.common.lib.runtime.idempotency (Added; 9.5.0). The idempotency_record Flyway migration ships in Phase 5b’s consumer adoption (operations), not in common-module, since common-module has no production-side migrations.
Phase 5b email module — L3 services in the Email module consume the typed view (IdempotencyStore<EmailSendRequest, EmailJob>) and the IdempotencyKeyMinter for outbound Postmark retry safety.

Round R1-Phase5b: Email Module Decisions

DQ-R1-032: EmailEncryptionKey registry delivery and source of truth

Status: Resolved. Round R1-Phase5b. Supersedes DQ-R1-024.

Context: DQ-R1-024 chose CFN-native GenerateSecretString (Option A) to populate the per-partition EmailEncryptionKey secret. During Phase 5b email-module deployment this proved undeployable. GenerateSecretString can only emit a flat {"<fixedKey>": "<random characters>"} value, but the operations module’s MaterialRegistryRefresher reads the secret as a key registry — a non-empty JSON map { "<versionId-UUID>": "<base64 of exactly 64 bytes>" }, ordered newest-last, where the current encryption version is the last key. It fail-fasts (AppError.Infrastructure, pod dies) on any other shape: the key isn’t a UUID, the value is 64 characters (~48 bytes) not base64 of 64 bytes, and there’s no version map. CFN cannot generate a UUID key, base64-of-N-random-bytes, or an ordered map, so Option A can never satisfy the consumer. The registry must therefore originate outside CFN.

Option	Mechanism	Trade-offs
A (DQ-R1-024)	CFN-native `GenerateSecretString`.	Zero custom code, but structurally incapable of the registry shape. Rejected — undeployable.
B	CDK Custom Resource (Lambda) generates the registry at deploy time and writes it to SM.	Could produce the right shape, but adds Lambda boilerplate, an opaque immutability/rotation story, and a second secret-generation pattern in the stack.
C	Pre-Deploy script + NoEcho CFN parameter (δ.1 pattern). The registry JSON is operator-provisioned in the partition vault (`op://Arda-{Env}OAM/EmailEncryptionKey/registry`); `amm.sh` resolves it through the Pre-Deploy script (`--encryption-key-out`) and passes it as the NoEcho `EmailEncryptionKeyJson` parameter backing the secret.	Mirrors the `EmailPostmarkAccountToken` delivery path exactly. 1Password is a single, auditable source of truth; key material never appears in templates, change sets, or stack events. Cost: an out-of-band operator step to seed the registry, and rotation requires a redeploy to propagate.

Recommendation: Option C.

Decision: Option C. 1Password is the source of truth for the encryption-key registry. The flow is strictly one-directional:

op://Arda-{Env}OAM/EmailEncryptionKey/registry   (source of truth, per partition)
   → amm.sh Pre-Deploy resolves it → NoEcho CFN parameter EmailEncryptionKeyJson
   → {Infra}-{ns}-I-EmailEncryptionKey Secrets Manager secret (deploy-time projection)
   → ESO ExternalSecret sync → /app/secret/email/values.json (pod projection)
   → MaterialRegistryRefresher (current version = entries.keys.last())

Consequences that bind future work:

Rotation = appending a new {uuid: base64(64 bytes)} entry to the 1Password registry, newest-last, retaining prior entries so tokens encrypted under older versions still decrypt. Secrets Manager and the pod are downstream copies, refreshed on the next amm.sh deploy (--force, since a parameter-only change does not alter the synthesised template) and ESO sync. Do not hand-edit the Secrets Manager value — it is overwritten from 1Password on every deploy.
Order is significant: the module treats the last key in the JSON as the active version, so tooling must append (never re-sort).
In-pod live reload without restart remains the operations-side concern (email S09 / RotatingTokenCipher + watch loop); even then the material still originates from 1Password.

The -API-EmailEncryptionKeyArn export, the secret name, and RemovalPolicy.RETAIN are unchanged from DQ-R1-024, so no consumer-facing identifier moves.

Applied to:

infrastructure — src/main/cdk/stacks/purpose/partition-email.ts (NoEcho EmailEncryptionKeyJson parameter + secretStringValue), src/main/cdk/platforms.ts (per-partition mail.encryptionKeyOpReference + helper), tools/lib/partition-mail-signature.ts (--encryption-key-out), amm.sh (parameter override). Shipped in infrastructure#485, CHANGELOG [3.4.3]. Tracked as PDEV-880.
operations — MaterialRegistryRefresher is the consumer whose contract drives the shape (no change required by this decision).
Operator prerequisite & rotation tooling: the registry must be seeded once per partition vault (item EmailEncryptionKey, field registry). Generation/rotation tooling is tracked as PDEV-881.

Summary

#	Summary	Status	Downstream Impact	Decision
DQ-001	Tenant sending domain shape	Resolved	DNS zone structure, CDK stacks, tenant provisioning scripts, supplier-facing FQDNs	`<tenant>.<partition>.{mail-root-domain}` uniformly (revised per DQ-010)
DQ-002	Multi-config domain strategy for v2+	Resolved	`tenant_email_config` schema (nullable `config_slug`), DNS record structure, v2+ provisioning	Sub-subdomain (`<conf>.<tenant>.<partition>.{mail-root-domain}`); v1 provisions at tenant level only
DQ-003	Tenant slug source	Resolved	Provisioning request shape, slug derivation logic	From request (tenantEId, tenantName, tenantSlug); derivation algorithm deferred
DQ-004	Reply-To editability in send dialog	Resolved	Send dialog UI, BFF route contract, GEN::EML and PRO::EML use cases	Read-only; system-resolved from procurement contact or user email
DQ-005	Email order send paths (copy-paste vs system)	Resolved	SPA side panel UX, backend submit signal handler, PRO::EML use cases	Both coexist; copy-paste preserved for email orders, system send added as new path
DQ-006	CS alerting scope in v1	Resolved	Observability infrastructure, GEN::EML::0004 use case scoping	ESP OOTB alerting in v1; Arda-built is v2+
DQ-007	Document generation responsibility	Resolved	Email service interface contract, PO submit workflow, GEN::EML::0002 use case	Calling feature generates document, passes Blob/URL to email capability
DQ-008	Send dialog interaction model	Resolved	SPA dialog component, GEN::EML::0001 scenario structure	Single-step dialog; cancel prompts if edits were made
DQ-009	Mail root domain choice	Resolved	DNS zone creation, registrar delegation, all tenant FQDNs, infrastructure.md parameter resolution	`ardamails.com` (standalone, already owned); implementation parametric
DQ-010	Prod tenant zone placement	Resolved	Root zone content, IAM scoping, cross-account access, DQ-001 FQDN shape	Own partition zone; root zone stays static/CDK-only
DQ-011	Webhook authentication mechanism	Resolved	Postmark-events endpoint auth, provisioning flow (Step 5), Webhooks API usage	Bearer token via modern Webhooks API; reuses existing ARDA_API_KEY validation
DQ-012	Per-tenant server token storage	Resolved	Secrets Manager scope, IAM roles, provisioning flow, emailConfiguration service, DB schema	Encrypted in DB with partition-wide key (via ESO); no per-tenant SM writes; emailConfiguration decrypts for emailJob
DQ-013	IAM role extraction from root stack	Resolved	Root CDK application structure, deployment procedure	Do not extract; role stays in RootDnsStack (CF name: RootConfiguration). Extraction deferred to future need.
DQ-R1-006	Locus of cross-zone NS-delegation writes	Resolved	Phase 2 / Phase 3 / Phase 4 ownership boundaries; deploy-order dependency between Root and child stacks	Child stack writes upstream via `WriteNSRecordsToUpstreamDns`; Root only owns the assume-role IAM target
DQ-R1-007	Vault separation for Free Kanban Tool server token	Resolved	Phase 1 typed surface (item removed); Phase 3 reintroduces with new location; threat model — credential out of `OP_SERVICE_ACCOUNT_TOKEN` blast radius	`op://Arda-CorporateOAM/Free-Kanban-Generator-Postmark-Server/credential` (separate vault from `Arda-SystemsOAM`)
DQ-R1-008	Adopt vs create the existing `ardamails.com` zone	Resolved	`RootConfiguration` stack composition; deployment workflow (IMPORT change-set + normal deploy); registrar NS chain preserved	Adopt via `cdk import` against `Z0721066239FWCD47EJDX`; CDK code mirrors the live AWS-default comment to keep the import read-only; `RemovalPolicy.RETAIN` defends against accidental destroy
DQ-R1-009	Postmark domain-verification target (parent vs leaf)	Resolved	`PostmarkSendingDomain` configuration; operator companion; future Corporate-consumer onboarding	Verify at the Corporate-zone parent (`arda.ardamails.com`); leaves inherit DKIM
DQ-R1-010	Locus of Corporate’s NS-delegation write (same-account)	Resolved	`CorporateMailDns` stack composition; behavior under future Corporate-account migration	Always go through `WriteNSRecordsToUpstreamDns` and assume the Root role even when same-account
DQ-R1-011	`route-53-hosted-zone.ts` → `dns-zone.ts` migration shape	Resolved	Construct catalogue; Phase 3 PR scope; existing callers (partitions + Root)	Rename in place; existing callers updated in the same PR
DQ-R1-012	Corporate drift-workflow filename and scope	Resolved	`.github/workflows/` shape; `tools/corporate-drift.ts` driver design; future Corporate-asset onboarding	`corporate-drift.yml` — one workflow per instance group, exercising every asset listed in `instances/Corporate/`
DQ-R1-013	Phase A failure ordering for the Postmark server token	Resolved	`corporate-cli.ts` Phase A semantics; recovery path on 1P-write failure; testability	In-memory buffer + retries on the 1P write; fail loud with redacted summary on permanent failure; manual operator action to recover
DQ-R1-014	`cdk.context.json` commit policy	Resolved	Repo `.gitignore`; CI re-synth determinism	Commit `cdk.context.json` with the `postmark.free-kanban.*` keys (public values)
DQ-R1-015	DMARC reporting mailbox	Resolved	DMARC TXT record content at `_dmarc.arda.ardamails.com`; operator companion (mailbox provisioning prerequisite)	`rua=mailto:dmarc-reports@arda.cards`; operator provisions the mailbox in Arda’s Google Workspace before Phase B deploy
DQ-R1-016	Reserved-name registry scope at `arda.ardamails.com`	Resolved	Cross-instance-group import coupling; `corporate-cli.ts` Phase A acceptance criteria	Documentation-only registry; CLI enforces locally via a Phase-A conflict-check against pre-existing Sender Signatures, servers, and 1P items
DQ-R1-017	Postmark Sender Signature granularity per partition	Resolved	Phase 4 partition-email stacks; Postmark account split; per-tenant deferral to Phase 5b	One Signature per partition sub-zone; leaves inherit DKIM; per-tenant Signatures deferred
DQ-R1-018	`corporate-drift` rename and scope	Resolved	`.github/workflows/` shape; future runtime-platform drift checks unrelated to email	Keep `corporate-drift`; add parallel `runtime-platform-drift` with shared reusable scripts
DQ-R1-019	Per-partition email server-token encryption key	Resolved	Phase 4 SM secret; Phase 5b `TokenCipher` + Helm `ExternalSecret` mounts; future AWS Rotation Lambda	Single SM secret per partition with native versioning; two-axis envelope `a{N}.k{SM-VERSION-ID}`; hot-swap dual-mount; lazy + coroutine migration; SDK fallback
DQ-R1-020	DNS-provisioning + SM-fallback IAM roles	Resolved	Phase 4 partition-email stack IAM declarations; Phase 5b STS-chain call sites in the Email module; `AllowCreatingNSRecordsRole` construct generalization (R-4) with Root no-drift guard	Two per-purpose roles per partition: DNS-records role via reuse of the existing `AllowCreatingNSRecordsRole` construct (generalized for a configurable trust principal; Root output byte-identical, guarded by unit test + verification); `EmailEncryptionKeyFallbackRole` fresh. Both STS-chained from the partition pod role; trust policy = account principal + `ArnLike` on `{fqn}-*`; mirrors the `ImageUploadPreSigningRole` pattern
DQ-R1-021	Order of partition rollout	Resolved	Phase 4 + Phase 5b deploy order; `kyle` suspension	Partial order `dev → {stage \|\| demo} → prod`; `kyle` excluded (per 2026-05-13 amendment)
DQ-R1-022	Operator CLI shape for Phase 4	Resolved	Phase 4 operator surface; refactoring of Phase 3 `corporate-cli` to extract shared utilities	Integrate into `amm.sh`; share utilities with `corporate-cli`; no standalone `partition-mail-cli`
DQ-R1-023	Per-tenant Postmark Sender Signature introduction (Phase 5b)	Open — TBC at Phase 5b planning	Phase 5b tenant-onboarding flow; per-tenant reputation isolation strategy; whether `EmailDnsProvisioningRole` is exercised per-tenant or held in reserve	Four options (α / β / γ / δ); no Phase 4 dependency. To be confirmed when Phase 5b sees pilot data on tenant send volume and bounce / spam rates.
DQ-R1-024	EmailEncryptionKey initial-value generation mechanism	Superseded by DQ-R1-032	Phase 4 `PartitionEmailStack` secret declaration; V-PART-016 immutability test shape	~~CFN-native `GenerateSecretString`~~ — undeployable for the module (cannot emit the UUID-keyed registry shape); reversed by DQ-R1-032.
DQ-R1-025	Pre-Deploy script’s `cdk.context.json` write strategy	Resolved	T-I8 entry-script implementation; `tools/lib/context-store.ts` helper; V-CLI-001 test shape	Hand-rolled atomic JSON merge namespaced under `partitionMail:<infrastructure>:<partition>`. No CDK runtime bootstrap; no `cdk context --set` per field.
DQ-R1-026	Consolidation of per-partition rollout runs	Resolved	Run table in choreography / evaluation / specification; Run-3 deliverables (execution log + single infra PR); retirement of run-3-stage / run-4-demo / run-5-prod plans	Collapse Runs 3 / 4 / 5 into a single Run-3 operator cascade. The cascade walks the partial order from DQ-R1-021 — `dev` first, then `stage` and `demo` in either order or in parallel, then `prod` last — inside the single PR. One PR captures CHANGELOG + accumulated `cdk.context.json` + any code fixes. Run-6 / Run-7 numbering retained (gap intentional).
DQ-R1-027	`AppError.Application` introduction	Resolved	`common-module` `AppError` hierarchy gains a third top-level branch; L4 error-mapping table updated; Phase 5b L3 services in the Email module use `Application.PreconditionFailed` / `ConflictingState`	Add `sealed class Application` with three subtypes (`PreconditionFailed`, `PolicyRejected`, `ConflictingState`); `reportable() = emptyList()` at the branch root; no HTTP-status hints on subtypes
DQ-R1-028	`Internal.IncompatibleState` reclassification sweep	Resolved	`common-module` (62+ construction sites) reclassified per Phase 5a methodology; `operations` sweep is Phase 5b’s responsibility; CHANGELOG `Changed` -> major bump (10.0.0)	Discovery-then-classify methodology; sweep lands as the final Phase 5a PR (after the four additive minors); covers `common-module` only
DQ-R1-029	`sanitizeHeader` value-cleaning primitive	Resolved	New `lib/api/headers/` package in `common-module`; Phase 5b L4 endpoints consume it; composes downstream of the existing `HeadersAllowList`	`sanitizeHeader(name, value): Result<String?>`; accept / silent-drop / hard-reject; separate from `HeadersAllowList` (observability scoping); minor release (`Added`; 9.3.0)
DQ-R1-030	`TokenCipher` + `Hmac` cryptographic helpers	Resolved	New `lib/crypto/` package in `common-module` with two-axis envelope per DQ-R1-019; `MaterialRegistry` populated by the caller from a single ESO-projected JSON map (no application-side path to AWS Secrets Manager); `OpaqueId.kt` and `S3AssetService.kt` internally refactored to use the new `Hmac` helper (no external API change at those sites); Phase 5b consumes `TokenCipher` for at-rest server-token encryption; rotation enablement tracked as PDEV-659	`companion operator fun invoke(info, materials, currentVersionId)` factory returning `Result<TokenCipher>`; AES-GCM auth-tag failure classified as `Internal.IncompatibleState` (bug-worthy); unknown `versionId` on decrypt classified as `Transient.FailoverFailed` (bounded propagation lag handled by existing retry layers); `Hmac` extracted and shared; `Added`
DQ-R1-031	Idempotency helpers with native JsonElement + typed wrapper	Resolved	New `lib/runtime/idempotency/` package in `common-module`; Phase 5b consumer wires `IdempotencyStore<EmailSendRequest, EmailJob>` and `IdempotencyKeyMinter`; the `idempotency_record` Flyway migration ships in Phase 5b’s consumer adoption, not in `common-module`	Two-tier API: `RawIdempotencyStore` (native `JsonElement`, symmetric Req/Res) + `IdempotencyStore<Req, Res>` typed wrapper via `inline fun typedAs()`. Decode-failure -> `Result.failure(AppError.Internal.IncompatibleState)`; `Mismatch` carries `recordedRequest: JsonElement`; schema-evolution is caller-controlled; `JSONB` storage columns; `Added`; 9.5.0
DQ-R1-032	EmailEncryptionKey registry delivery and source of truth	Resolved (supersedes DQ-R1-024)	`PartitionEmailStack` secret declaration; `platforms.ts` op-references; `amm.sh` + Pre-Deploy script; operator vault-seeding step; key-rotation tooling (PDEV-881)	Deliver the `{versionId: base64(64-byte key)}` registry via a NoEcho CFN parameter (δ.1) sourced from `op://Arda-{Env}OAM/EmailEncryptionKey/registry`. 1Password is the source of truth; SM + pod are deploy-time projections; rotation appends entries newest-last and redeploys.

Decision Log: Email Integration

Purpose

Decision Table

Round 1: Initial Design Decisions

DQ-001: Tenant Sending Domain Structure

DQ-002: Multi-Configuration Domain Strategy

DQ-003: Tenant Slug Source

DQ-004: Reply-To Editability

DQ-005: Email Order Send Paths

DQ-006: CS Alerting Scope in v1

DQ-007: Document Generation Responsibility

DQ-008: Send Dialog Interaction Model

DQ-009: Mail Root Domain Choice

DQ-010: Prod Tenant Zone Placement

DQ-011: Webhook Authentication Mechanism

DQ-012: Per-Tenant Server Token Storage

Round 2: Infrastructure Implementation Decisions

DQ-013: IAM Role Extraction from Root Stack

Round R1-Phase1: External Resources Provisioning Decisions

DQ-R1-001: Drift Workflow Filename

DQ-R1-002: Drift-Check TypeScript Module Location

DQ-R1-003: Operator Runbook Sign-Off Mechanism

DQ-R1-004: Disposition of Legacy Parser-Gated Runbook

DQ-R1-005: API-Surface Freshness Cadence

DQ-R1-007: Vault Separation for Free Kanban Tool Server Token

Round R1-Phase2: Root Updates Decisions

DQ-R1-006: Locus of Cross-Zone NS-Delegation Writes

DQ-R1-008: Adopt vs. Create the existing ardamails.com Hosted Zone

Round R1-Phase3: Corporate Updates Decisions

DQ-R1-009: Postmark Domain-Verification Target (Parent vs Leaf)

DQ-R1-010: Locus of Corporate’s NS-Delegation Write (Same-Account Case)

DQ-R1-011: route-53-hosted-zone.ts → dns-zone.ts Migration Shape

DQ-R1-012: Corporate Drift-Workflow Filename and Scope

DQ-R1-013: Phase A Failure Ordering for the Postmark Server Token

DQ-R1-014: cdk.context.json Commit Policy for Phase A’s Outputs

DQ-R1-015: DMARC Reporting Mailbox (rua / ruf) for _dmarc.arda.ardamails.com

DQ-R1-016: Reserved-Name Registry Scope at arda.ardamails.com

Round R1-Phase4: Runtime Platform Updates Decisions

DQ-R1-017: Postmark Sender Signature Granularity per Partition

DQ-R1-018: corporate-drift Rename and Scope

DQ-R1-019: Per-Partition Email Server-Token Encryption Key

DQ-R1-020: DNS-Provisioning + SM-Fallback IAM Roles

DQ-R1-021: Order of Partition Rollout

DQ-R1-022: Operator CLI Shape for Phase 4

Pre-design follow-ups closed (Round R1-Phase4)

DQ-R1-023: Per-Tenant Postmark Sender Signature Introduction (Phase 5b)

DQ-R1-024: EmailEncryptionKey initial-value generation mechanism

DQ-R1-025: Pre-Deploy script’s cdk.context.json write strategy

DQ-R1-026: Consolidation of Per-Partition Rollout Runs into a Single Operator-Cascade Run

Round R1-Phase5a: Component Library Updates Decisions

DQ-R1-027: AppError.Application Introduction

DQ-R1-028: Internal.IncompatibleState Reclassification Sweep

DQ-R1-029: sanitizeHeader Value-Cleaning Primitive

DQ-R1-030: TokenCipher + Hmac Cryptographic Helpers

DQ-R1-031: Idempotency Helpers with Native JsonElement + Typed Wrapper

Round R1-Phase5b: Email Module Decisions

DQ-R1-032: EmailEncryptionKey registry delivery and source of truth

Summary

DQ-R1-008: Adopt vs. Create the existing `ardamails.com` Hosted Zone

DQ-R1-011: `route-53-hosted-zone.ts` → `dns-zone.ts` Migration Shape

DQ-R1-014: `cdk.context.json` Commit Policy for Phase A’s Outputs

DQ-R1-015: DMARC Reporting Mailbox (`rua` / `ruf`) for `_dmarc.arda.ardamails.com`

DQ-R1-016: Reserved-Name Registry Scope at `arda.ardamails.com`

DQ-R1-018: `corporate-drift` Rename and Scope

DQ-R1-025: Pre-Deploy script’s `cdk.context.json` write strategy

DQ-R1-027: `AppError.Application` Introduction

DQ-R1-028: `Internal.IncompatibleState` Reclassification Sweep

DQ-R1-029: `sanitizeHeader` Value-Cleaning Primitive

DQ-R1-030: `TokenCipher` + `Hmac` Cryptographic Helpers