Decision Log: Email Integration
Purpose
Section titled “Purpose”Tracks design decisions for the Arda Email Integration project, covering domain structure, sending model, tenant isolation, address handling, and subsystem responsibilities.
Decision Table
Section titled “Decision Table”| # | Question | Status | Decision | Round |
|---|---|---|---|---|
| DQ-001 | Tenant sending domain structure | Decided | <tenant>.<partition>.{mail-root-domain} uniformly (see DQ-010) | R1 |
| DQ-002 | Multi-config domain strategy | Decided | Sub-subdomain, deferred to v2+ | R1 |
| DQ-003 | Tenant slug source | Decided | From provisioning request (tenantEId, tenantName, tenantSlug); algorithm deferred | R1 |
| DQ-004 | Reply-To editability | Decided | Not user-editable | R1 |
| DQ-005 | Email order send paths | Decided | Copy-paste (existing) + system send (new) | R1 |
| DQ-006 | CS alerting scope in v1 | Decided | ESP OOTB only; Arda-built is v2+ | R1 |
| DQ-007 | Document generation responsibility | Decided | Calling feature, not email capability | R1 |
| DQ-008 | Send dialog interaction model | Decided | Single-step (no separate confirm) | R1 |
| DQ-009 | Mail root domain choice | Decided | ardamails.com (implementation parametric) | R1 |
| DQ-010 | Prod tenant zone placement | Decided | Own partition zone (prod.{mail-root-domain}), not root zone | R1 |
| DQ-011 | Webhook authentication mechanism | Decided | Bearer token via Postmark modern Webhooks API | R1 |
| DQ-012 | Per-tenant server token storage | Decided | Encrypted in DB (application-level), not Secrets Manager | R1 |
| DQ-013 | IAM role extraction from root stack | Decided | Do not extract; role stays in RootDnsStack | R2 |
| DQ-R1-001 | Drift workflow filename | Decided | external-resources-drift.yml — describes the asserted invariant | R1-Phase1 |
| DQ-R1-002 | Drift-check TypeScript location | Decided | tools/drift-check.ts — operator- and CI-runnable | R1-Phase1 |
| DQ-R1-003 | Operator runbook sign-off mechanism | Decided | Markdown “Operator Sign-off” section with name/date/deviations table | R1-Phase1 |
| DQ-R1-004 | Disposition of legacy parser-gated runbook | Decided | Delete in Phase 1 — no parser gate remains | R1-Phase1 |
| DQ-R1-005 | API-surface freshness cadence | Decided | At first drift-test failure attributable to surface drift, augmented by an annual review | R1-Phase1 |
| DQ-R1-006 | Locus of cross-zone NS-delegation writes | Decided | Child zone owner writes upstream via WriteNSRecordsToUpstreamDns; Root only owns the assume-role target | R1-Phase2 |
| DQ-R1-007 | Vault separation for Free Kanban Tool server token | Decided | Lives in Arda-CorporateOAM (separate vault), not Arda-SystemsOAM | R1-Phase1 |
| DQ-R1-008 | Adopt-vs-create the existing ardamails.com zone | Decided | Adopt via cdk import against Z0721066239FWCD47EJDX; CDK code mirrors the live zone’s AWS-default comment to keep the import read-only | R1-Phase2 |
| DQ-R1-009 | Postmark domain-verification target (parent vs leaf) | Decided | Verify at the Corporate-zone parent (arda.ardamails.com); leaf sub-domains inherit DKIM | R1-Phase3 |
| DQ-R1-010 | Locus of Corporate’s NS-delegation write (same-account) | Decided | Always go through WriteNSRecordsToUpstreamDns and assume the Root role even when same-account; preserves the pattern under future Corporate-account migration | R1-Phase3 |
| DQ-R1-011 | route-53-hosted-zone.ts → dns-zone.ts migration shape | Decided | Rename in place; existing callers updated in the same PR | R1-Phase3 |
| DQ-R1-012 | Corporate drift-workflow filename and scope | Decided | corporate-drift.yml — one workflow per instance group, exercising every asset listed in instances/Corporate/ | R1-Phase3 |
| DQ-R1-013 | Phase A failure ordering for the Postmark server token | Decided | In-memory buffer + retries on the 1Password write; fail loud with redacted summary on permanent failure; manual operator action to recover | R1-Phase3 |
| DQ-R1-014 | cdk.context.json commit policy for Phase A’s outputs | Decided | Commit cdk.context.json — public values only, standard CDK convention, deterministic re-synth on a fresh checkout | R1-Phase3 |
| DQ-R1-015 | DMARC reporting mailbox (rua / ruf) for _dmarc.arda.ardamails.com | Decided | dmarc-reports@arda.cards; operator action to create the mailbox in Arda’s Google Workspace before Phase B deploy | R1-Phase3 |
| DQ-R1-016 | Reserved-name registry scope at arda.ardamails.com | Decided | Documentation-only; corporate-cli.ts enforces locally via a conflict-check at Phase A entry against pre-existing Postmark Sender Signatures, servers, and 1Password items | R1-Phase3 |
| DQ-R1-017 | Postmark Sender Signature granularity per partition | Decided | One Signature per partition sub-zone; leaves inherit DKIM; per-tenant Signatures deferred to Phase 5b | R1-Phase4 |
| DQ-R1-018 | corporate-drift rename and scope | Decided | Keep corporate-drift; add a parallel runtime-platform-drift workflow with shared reusable scripts | R1-Phase4 |
| DQ-R1-019 | Per-partition email server-token encryption key | Decided | Single SM secret per partition with native versioning; two-axis envelope a{N}.k{SM-VERSION-ID}; hot-swap via AWSCURRENT+AWSPREVIOUS mounts; lazy + coroutine migration; SDK fallback | R1-Phase4 |
| DQ-R1-020 | DNS-provisioning + SM-fallback IAM roles | Decided | Fresh per-purpose roles assumed via STS from the operations pod role (mirroring the image-asset-bucket preSigningRole pattern); trust policy = account principal + ArnLike on the partition role-name prefix | R1-Phase4 |
| DQ-R1-021 | Order of partition rollout | Decided | Partial order dev → (stage or demo, either order or in parallel) → prod; kyle excluded (partition suspended). Total order relaxed to partial order in 2026-05-13 amendment. | R1-Phase4 |
| DQ-R1-022 | Operator CLI shape for Phase 4 | Decided | Integrate into amm.sh; extract reusable utilities shared with corporate-cli (no standalone partition-mail-cli) | R1-Phase4 |
| DQ-R1-023 | Per-tenant Postmark Sender Signature introduction (Phase 5b) | Open — TBC at Phase 5b planning | Four options (α status quo / β per-tenant v1 / γ hybrid opt-in / δ remediation-only). No Phase 4 dependency. | R1-Phase5b |
| DQ-R1-027 | AppError.Application introduction | Decided | Add sealed class Application with PreconditionFailed, PolicyRejected, ConflictingState subtypes; reportable() = emptyList() at branch root; no HTTP-status hints. | R1-Phase5a |
| DQ-R1-028 | Internal.IncompatibleState reclassification sweep | Decided | Discovery-then-classify methodology; sweep lands as the final Phase 5a PR (after the four additive minors); covers common-module only; major bump (10.0.0). | R1-Phase5a |
| DQ-R1-029 | sanitizeHeader value-cleaning primitive | Decided | sanitizeHeader(name, value): Result<String?> in new lib/api/headers/ package; composes downstream of HeadersAllowList; minor (Added; 9.3.0). | R1-Phase5a |
| DQ-R1-030 | TokenCipher + Hmac cryptographic helpers | Decided | companion operator fun invoke(info, materials, currentVersionId) factory returning Result<TokenCipher>; AES-GCM auth-tag failure -> Internal.IncompatibleState; unknown versionId on decrypt -> Transient.FailoverFailed (bounded propagation lag handled by existing retry layers); Hmac extracted to share between TokenCipher, OpaqueId, S3AssetService. Minor (Added). | R1-Phase5a |
| DQ-R1-031 | Idempotency helpers with native JsonElement + typed wrapper | Decided | RawIdempotencyStore (native JsonElement) + IdempotencyStore<Req, Res> via inline fun typedAs(); decode-failure -> Result.failure(AppError.Internal.IncompatibleState); Mismatch carries recordedRequest; schema-evolution caller-controlled; JSONB storage. Minor (Added; 9.5.0). | R1-Phase5a |
Round 1: Initial Design Decisions
Section titled “Round 1: Initial Design Decisions”DQ-001: Tenant Sending Domain Structure
Section titled “DQ-001: Tenant Sending Domain Structure”Context: Each tenant needs an isolated sending domain for DKIM, SPF, and DMARC. The domain shape affects FQDN length, DNS zone management, and future extensibility. The choice of mail root domain itself is a separate decision (see DQ-009); this decision addresses the structure beneath whatever root is chosen.
| Option | Description | Trade-offs |
|---|---|---|
| A | <tenant>.{mail-root-domain} (prod), <tenant>.<partition>.{mail-root-domain} (non-prod) | Short prod FQDNs but requires prod tenant records in the root zone (cross-account writes, mixed static/dynamic records). |
| B | <tenant>.<partition>.{mail-root-domain} uniformly for all partitions | One extra label in prod FQDNs. Consistent structure, clean IAM scoping, root zone stays static. |
| C | Full canonical: <partition>.<infra>.{mail-root-domain} per tenant | Consistent with existing Arda pattern but longest FQDNs; tenant identity buried in subdomain hierarchy. |
Recommendation: Option A initially; revised to Option B after DQ-010.
Decision: Option B. Uniform <tenant>.<partition>.{mail-root-domain} across all partitions. The one-label cost in prod is outweighed by consistent zone structure, clean IAM, and a static root zone. See DQ-010 for the detailed rationale.
Applied to:
- preliminary-exploration.md § Domain Structure (Working Assumption C) — note: exploration doc predates this revision
- infrastructure.md § DNS, Tenant Domain Shape
- DNS-structure diagram: see
mail-dns-structure.drawio.svginpublic/assets/diagrams/(rendered inline inexploration/infrastructure.md§ DNS).
DQ-002: Multi-Configuration Domain Strategy
Section titled “DQ-002: Multi-Configuration Domain Strategy”Context: A tenant may eventually need multiple email configurations (e.g., separate sending domains for procurement vs. shipping). The v1 domain structure must not block this. Builds on DQ-001 and DQ-010, which fix the canonical Application-Runtime-tenant shape as <tenant-slug>.<partition>.{mail-root-domain}; this decision adds the <conf-slug> label.
| Option | Description | Trade-offs |
|---|---|---|
| A | Sub-subdomain: <conf-slug>.<tenant-slug>.<partition>.{mail-root-domain} | Each config gets independent DKIM key and reputation. DMARC can apply at tenant level with subdomain policy. DNS hierarchy is explicit and parseable. Adds a label. |
| B | Composite slug: <conf-slug>-<tenant-slug>.<partition>.{mail-root-domain} | Flat structure at the conf-tenant boundary, shorter. But: hyphen boundary is ambiguous (parsing fragility), all configs share one DKIM key (defeats isolation purpose), no per-config DMARC override. |
Recommendation: Option A — sub-subdomain preserves DKIM isolation and DNS hierarchy.
Decision: Option A. v1 provisions at <tenant-slug>.<partition>.{mail-root-domain} (single config, no <conf-slug> label; partition included per DQ-001 / DQ-010). Schema includes nullable config_slug field for v2+. Adding <conf-slug>.<tenant-slug>.<partition>.{mail-root-domain} later is additive — no migration of existing domains. Trade-off noted: if v2+ wants the default config to also live at a sub-subdomain, existing supplier address books would need updating, but this is opt-in, not forced.
Applied to:
- (No surviving design artefact references this decision; recorded here for traceability.)
DQ-003: Tenant Slug Source
Section titled “DQ-003: Tenant Slug Source”Context: The sending domain uses a tenant slug (<slug>.{mail-root-domain}). The slug must be DNS-safe (lowercase alphanumeric + hyphens), validated against reserved words, and permanent (changing it requires DKIM reputation re-warming and supplier address book updates).
| Option | Description | Trade-offs |
|---|---|---|
| A | New field on Tenant entity | Explicit, decoupled from display name. Requires schema change and UI for CS to set it. |
| B | Derived from tenant name automatically | No new field. But: tenant names may contain spaces/special chars, derivation rules need defining, name changes would create inconsistency. |
| C | Provided by CS at provisioning time as separate input | No schema change on Tenant. Slug stored only in tenant_email_config. But: not visible in tenant management UI, potential for typos. |
Recommendation: Option A — a permanent identifier deserves an explicit field with validation.
Decision: The tenant slug is provided as part of the provisioning request alongside tenantEId and tenantName. The slug and name may be null; the emailConfiguration service determines the final slug using a combination of the three inputs. The specific derivation algorithm is deferred to implementation. The slug is stored on the EmailConfiguration entity, not on the Tenant entity.
Applied to:
- functional.md § Service:
emailConfiguration, ProvisionRequest data type - architectural-scenarios.md § Scenario 1
DQ-004: Reply-To Editability
Section titled “DQ-004: Reply-To Editability”Context: When sending an order by email, should the user be able to edit the Reply-To address in the send dialog?
| Option | Description | Trade-offs |
|---|---|---|
| A | Editable (To, Cc, Reply-To all editable) | Maximum flexibility. Risk: user sets Reply-To to an address they don’t control, replies go to wrong person. |
| B | Read-only (To and Cc editable, Reply-To resolved by system) | Controlled. Reply-To is always the procurement contact or user’s own email. v2+: tenant-configured functional address. |
Recommendation: Option B — Reply-To should be system-controlled to prevent misdirected replies.
Decision: Option B. Reply-To resolved in order: (1) procurement.email from order header, (2) user email from JWT/ApplicationContext. Displayed as read-only in send dialog. v2+: tenant may configure a functional Reply-To (e.g., “procurement inbox”).
Applied to:
product/features/general-behaviors/email-communications.md(feature; not yet authored) § Sending Modelproduct/features/procurement/email-orders.md(feature; not yet authored) § Recipient Resolution, Requirements FR-0004product/use-cases/general-behaviors/email-communications.md(use cases; not yet authored) § GEN::EML::0001::0003product/use-cases/procurement/email-orders.md(use cases; not yet authored) § PRO::EML::0001::0004
DQ-005: Email Order Send Paths
Section titled “DQ-005: Email Order Send Paths”Context: Email orders currently use a copy-paste workflow (side panel renders text, user copies to their own client). The new email capability adds system-send. Should copy-paste be removed?
| Option | Description | Trade-offs |
|---|---|---|
| A | Replace copy-paste with system send | Simpler UX, one path. But: breaks existing workflow, users who prefer their own client lose that option. |
| B | Both paths coexist | Backward compatible. Copy-paste preserved for email orders; system send added as new option. PO orders are system-send only (no existing copy-paste path for PO). |
Recommendation: Option B — backward compatibility with no user disruption.
Decision: Option B. Copy-paste is the existing path that stays as-is. System send is a new parallel path. For orderMethod=PURCHASE_ORDER, only system send is available (PDF attachment requires system involvement).
Applied to:
product/features/procurement/email-orders.md(feature; not yet authored) § Overview, Requirements FR-0011, FR-0012product/use-cases/procurement/email-orders.md(use cases; not yet authored) § PRO::EML::0002::0002
DQ-006: CS Alerting Scope in v1
Section titled “DQ-006: CS Alerting Scope in v1”Context: The feature specifies bounce rate > 5% and complaint rate > 0.1% thresholds triggering CS alerts. Should Arda build this alerting in v1?
| Option | Description | Trade-offs |
|---|---|---|
| A | Arda-built alerting from day one | Full control, custom thresholds. Engineering cost in v1. |
| B | Rely on ESP’s built-in alerting in v1, Arda-built in v2+ | Postmark provides bounce/complaint alerting OOTB via its console. No engineering cost. Less customizable. |
Recommendation: Option B — Postmark’s console alerting is sufficient for v1 at 100-150 tenants.
Decision: Option B. v1 relies on Postmark’s built-in alerting. Arda-built alerting with configurable thresholds is v2+.
Applied to:
product/features/general-behaviors/email-communications.md(feature; not yet authored) § Administrationproduct/use-cases/general-behaviors/email-communications.md(use cases; not yet authored) § GEN::EML::0004::0003
DQ-007: Document Generation Responsibility
Section titled “DQ-007: Document Generation Responsibility”Context: For PO-by-email, a PDF must be generated and attached. Should the general email capability generate documents, or receive them pre-generated?
| Option | Description | Trade-offs |
|---|---|---|
| A | Email capability generates documents | Centralized, but couples email to PDF pipeline. Email capability needs to know about order rendering. |
| B | Calling feature generates document, passes Blob/URL to email capability | Clean separation. Email capability is document-agnostic. Calling feature handles generation errors before invoking email. |
Recommendation: Option B — email capability should not know about document types.
Decision: Option B. The calling feature generates the PDF and passes it as a Blob or URL. If generation fails, the calling feature handles the error; email capability is never invoked.
Applied to:
product/use-cases/general-behaviors/email-communications.md(use cases; not yet authored) § GEN::EML::0002::0002product/use-cases/procurement/email-orders.md(use cases; not yet authored) § PRO::EML::0003::0002
DQ-008: Send Dialog Interaction Model
Section titled “DQ-008: Send Dialog Interaction Model”Context: Should the send flow have separate “edit addresses” and “confirm send” steps, or a single combined dialog?
| Option | Description | Trade-offs |
|---|---|---|
| A | Two steps: address resolution → confirmation dialog | Explicit separation. But: unnecessary friction if defaults are correct — user clicks through two dialogs to send. |
| B | Single-step dialog with editable fields + preview | One interaction: if defaults are correct, user just hits “Send.” Cancel with edits prompts for confirmation. |
Recommendation: Option B — minimize friction for the happy path.
Decision: Option B. Single-step send dialog with To/Cc editable, Reply-To read-only, content preview. Cancel prompts if edits were made.
Applied to:
product/use-cases/general-behaviors/email-communications.md(use cases; not yet authored) § GEN::EML::0001::0003 (merged from former 0003+0004)product/use-cases/procurement/email-orders.md(use cases; not yet authored) § PRO::EML::0001::0001
DQ-009: Mail Root Domain Choice
Section titled “DQ-009: Mail Root Domain Choice”Context: All tenant sending domains are subdomains of a root mail domain. The choice of root domain affects reputation separability from the app domain (arda.cards), DNS delegation mechanics, and FQDN length.
| Option | Description | Trade-offs |
|---|---|---|
| A | mail.arda.cards (subdomain of app domain) | No new domain registration. Shorter FQDNs if tenants are already familiar with arda.cards. But: shares reputation baseline with arda.cards — a deliverability incident on the app domain could affect mail, and vice versa. NS delegation from GoDaddy apex. |
| B | Standalone domain (e.g., arda-mail.com or similar) | Fully independent reputation from app domain. Clean separation for compliance or brand reasons. But: requires new domain registration and management. Tenants see an unfamiliar domain. |
| C | Other subdomain of arda.cards (e.g., email.arda.cards, send.arda.cards) | Same trade-offs as Option A with a different label. |
Recommendation: Option B — standalone domain for full reputation separation.
Decision: Option B. ardamails.com (already owned, registered with Route53 in platformRoot account). Implementation must be parametric on the root domain value so it can be changed later if needed. The {mail-root-domain} parameter in infrastructure.md resolves to ardamails.com.
Applied to:
- infrastructure.md § Parameters (entire document parametrized)
- All documents using
{mail-root-domain}notation
DQ-010: Prod Tenant Zone Placement
Section titled “DQ-010: Prod Tenant Zone Placement”Context: The original design (exploration doc, Working Assumption C) placed prod tenant records directly in the root zone ({mail-root-domain}) to achieve shorter prod FQDNs (4 labels: acme.ardamails.com). Non-prod partitions each had their own delegated zone. This creates an asymmetry where the root zone contains both static infrastructure records (SPF, DMARC, NS delegations) and runtime-provisioned tenant records, and the operations service in Alpha001 needs write access to a zone in platformRoot.
| Option | Description | Trade-offs |
|---|---|---|
| A | Prod tenants in root zone (original) | Shorter prod FQDNs (4 labels). But: root zone mixes static and dynamic records, prod provisioning needs cross-account write access to platformRoot, IAM scoping is more complex, root zone is not CDK-only. |
| B | Prod gets its own partition zone (prod.{mail-root-domain}) | One extra label in prod FQDNs (5 labels: acme.prod.ardamails.com). Uniform structure across all partitions, clean IAM (Alpha001 writes to its own zones), root zone stays static/CDK-only, no cross-account writes for tenant records. |
Recommendation: Option B — consistency, clean IAM boundaries, and a static root zone outweigh one label of FQDN length.
Decision: Option B. All partitions (dev, stage, demo, prod) get their own delegated zone under {mail-root-domain}. The root zone contains only NS delegations and parent SPF/DMARC records — no runtime-provisioned records. This supersedes the “Working Assumption C” FQDN shape from the exploration doc for prod.
Applied to:
- DQ-001 — revised from Option A to Option B
- infrastructure.md § DNS (Tenant Domain Shape table, zone tables, IAM scoping)
- DNS-structure diagram: see
mail-dns-structure.drawio.svginpublic/assets/diagrams/(rendered inline inexploration/infrastructure.md§ DNS).
DQ-011: Webhook Authentication Mechanism
Section titled “DQ-011: Webhook Authentication Mechanism”Context: Postmark sends delivery status events (Delivery, Bounce, SpamComplaint) to a webhook URL on the Arda backend. The endpoint must verify that incoming requests are genuinely from Postmark. Postmark does not sign webhook payloads (no HMAC/signature). Two authentication approaches are available.
| Option | Description | Trade-offs |
|---|---|---|
| A | HTTP Basic Auth credentials embedded in the webhook URL | Supported via legacy server-level fields (DeliveryWebhook etc.) and modern API (HttpAuth field). Credentials appear in URL strings, which may be logged by proxies and access logs. Requires a new credential type separate from existing API auth. |
| B | Bearer token via HttpHeaders on the modern Webhooks API | Configured via POST /webhooks with HttpHeaders: [{"Name": "Authorization", "Value": "Bearer <token>"}]. Reuses the existing ARDA_API_KEY validation already implemented in the backend. No credentials in URL strings. Requires the modern Webhooks API (not the legacy server-level fields). |
Recommendation: Option B — reuses existing auth infrastructure, cleaner security posture.
Decision: Option B. Use Bearer token authentication via the modern Postmark Webhooks API. The token can be the same ARDA_API_KEY already used for API authentication, validated by the same backend mechanism. Webhooks are configured per server during provisioning via POST /webhooks (Server Token), not via the legacy server-level URL fields.
Applied to:
- postmark-service.md § Webhook Authentication, § Step 5: Configure Webhooks, § Provisioning Sequence, § Legacy vs Modern Webhook API
- functional.md § postmark-events endpoint, § Tenant Provisioning
DQ-012: Per-Tenant Server Token Storage
Section titled “DQ-012: Per-Tenant Server Token Storage”Context: Each Postmark server has an API token used at runtime to send email. This token must be stored securely. The operations service follows the ESO pattern where secrets are delivered to the pod at startup via External Secrets Operator, not fetched from Secrets Manager at runtime.
| Option | Description | Trade-offs |
|---|---|---|
| A | Per-tenant Secrets Manager secrets | Each provisioning creates a new SM secret. ESO would need to sync all per-tenant secrets, or the service would need runtime SM read access (breaking the ESO pattern). IAM write access needed during provisioning. Scales poorly (N secrets per N tenants). |
| B | Encrypted in database (Aurora volume encryption only) | Tokens stored as plaintext columns, encrypted at the storage layer by Aurora’s KMS-backed volume encryption. Sufficient against disk theft, but plaintext to any DB user with SELECT access. No additional key management. |
| C | Encrypted in database (application-level encryption) | Service encrypts tokens with a partition-wide symmetric key before INSERT, decrypts after SELECT. The encryption key is a single static secret delivered via ESO at startup. DB dumps and SQL injection do not expose raw tokens. Key rotation is one key, not N. |
Recommendation: Option C — maintains the ESO pattern (one static secret), eliminates per-tenant SM writes, and adds defense-in-depth beyond Aurora volume encryption.
Decision: Option C. Per-tenant server tokens are encrypted with a partition-wide encryption key and stored in the serverTokenEncrypted column of tenant_email_config. The encryption key is created by CDK in Secrets Manager and delivered to the pod via ESO as extras.email.encryptionKey in HOCON config. Only the emailConfiguration service handles encryption/decryption; the emailJob service calls emailConfiguration.getActiveConfiguration() to receive the decrypted token as an in-memory value.
Applied to:
- infrastructure.md § AWS Secrets Manager (SM-3, SM-4 added; per-tenant SM writes removed; IAM-3 removed)
- functional.md § Email Configuration (Secret Storage section, internal service method)
- postmark-service.md § Authentication, § Step 6, § Provisioning Sequence
- architectural-scenarios.md § Scenario 1 (encrypt + persist), § Scenario 2 (getActiveConfiguration with decrypted token)
Round 2: Infrastructure Implementation Decisions
Section titled “Round 2: Infrastructure Implementation Decisions”DQ-013: IAM Role Extraction from Root Stack
Section titled “DQ-013: IAM Role Extraction from Root Stack”Context: The root CDK stack (RootConfiguration) contains both DNS hosted zones and the AllowCreatingNSRecordsRole IAM role used for cross-account NS delegation. As this project renames the stack class to RootDnsStack and adds new stacks to the root application, the question is whether to extract the IAM role to a dedicated RootSecurityStack for cleaner separation of concerns.
| Option | Description | Trade-offs |
|---|---|---|
| A | Extract role to RootSecurityStack via two-step deploy | Cleaner separation. But: requires two-step deploy due to IAM physical name collision. Creates a 2-5 minute window where the role doesn’t exist. If step 2 fails, role is gone until manually recreated. |
| B | Extract role to RootSecurityStack via CloudFormation stack refactoring | Cleaner separation. No danger window — the role transfers ownership atomically without delete/recreate. But: requires a manual CloudFormation operation outside the CDK workflow, followed by CDK code realignment. Relatively new AWS feature; should be tested in non-production first. |
| C | Keep role in RootDnsStack | No migration work. Role is conceptually tied to DNS delegation (it enables writing NS records). A RootSecurityStack with a single resource doesn’t justify the extra work in this project’s scope. |
Recommendation: Option C for this project. Option B is the viable future path when extraction is justified.
Decision: Option C. The AllowCreatingNSRecordsRole stays in RootDnsStack (CloudFormation name: RootConfiguration). The role is functionally tied to DNS delegation and is acceptable in the DNS stack. Extraction is known to be operationally safe via CloudFormation stack refactoring (Option B), but adds complexity for no immediate functional benefit. When a RootSecurityStack is needed for additional security resources, use stack refactoring to move the role atomically. See root-refactor-analysis.md for the full analysis.
Applied to:
- infrastructure/root-refactor-analysis.md (full analysis)
- infrastructure/specification.md § Task 3 (root stack rename only, no role extraction)
- infrastructure/analysis.md § Root Configuration
Round R1-Phase1: External Resources Provisioning Decisions
Section titled “Round R1-Phase1: External Resources Provisioning Decisions”DQ-R1-001 through DQ-R1-005 resolve the Open Questions in 1-external-resources/specification.md § 5. DQ-R1-007 is an additional Phase 1 decision captured in the same round (vault separation for the Free Kanban Tool server token). All entries follow the DQ-R1-NNN convention introduced in architecture-overview.md § 10.
DQ-R1-001: Drift Workflow Filename
Section titled “DQ-R1-001: Drift Workflow Filename”Context: The CI workflow that asserts the live external-resource invariants needs a stable filename. Phase 1 originally raised three candidates (external-resources-drift.yml, phase-1-drift.yml, op-drift.yml).
Decision: external-resources-drift.yml. The filename describes the invariant asserted (drift of the external resources Arda consumes), not the phase that introduced the workflow. This keeps the filename stable across phases as the workflow evolves.
Applied to:
DQ-R1-002: Drift-Check TypeScript Module Location
Section titled “DQ-R1-002: Drift-Check TypeScript Module Location”Context: The drift-check module is dual-purpose: an operator runs it locally with 1Password DesktopAuth; CI runs it with OP_SERVICE_ACCOUNT_TOKEN. Two candidate locations existed in the infrastructure repo: scripts/drift-check.ts (alongside legacy script utilities) or tools/drift-check.ts (a fresh top-level convention).
Decision: tools/drift-check.ts. The module is operator-runnable in addition to CI-runnable, and the tools/ convention better matches the dual-purpose nature than scripts/ (which the prior implementation largely used for one-shot orchestrators). The tools/ convention is forward-compatible with the eventual move of scripts/gha-secrets/ to tools/gha-secret.ts (out of scope of this project but on the trajectory).
Applied to:
DQ-R1-003: Operator Runbook Sign-Off Mechanism
Section titled “DQ-R1-003: Operator Runbook Sign-Off Mechanism”Context: REQ-OPS-003 requires the runbook to capture sign-off (operator name, date, deviations) so the document is itself the audit record. Three encoding options were considered: a code block, a YAML frontmatter field, or a designated Markdown section with a small table.
Decision: A designated ## Operator Sign-Off section containing a Markdown table with columns Step / Operator / Date / Deviations / Notes, with one pre-populated empty row per REQ-EXT-NNN. The table is human-readable, diff-friendly under git, and does not require new tooling. YAML frontmatter would conflict with Starlight’s required frontmatter schema and would not naturally express per-step rows.
Applied to:
1-external-resources/specification.md§ Task 5current-system/oam/postmark-service/operator-runbook.md(Phase 1 deliverable)
DQ-R1-004: Disposition of Legacy Parser-Gated Runbook
Section titled “DQ-R1-004: Disposition of Legacy Parser-Gated Runbook”Context: The prior Phase-0 implementation maintained a parser-gated operator runbook (HUMAN-STEPS.md) under infrastructure/scripts/postmark-foundations/, whose state was enforced by a TypeScript parser as a CI gate. REQ-OPS-004 retires the parser gate entirely; the runbook in the documentation repo becomes the canonical operator artefact.
Decision: Delete the parser-gated runbook and its parser code in Phase 1, gated on the canonical runbook (current-system/oam/postmark-service/operator-runbook.md) being merged. Two-step ordering preserves operator availability during the cut-over: docs land first, then the legacy artefact is removed in the infrastructure PR (T-C6 in the task plan).
Applied to:
1-external-resources/specification.md§ Task 51-external-resources/plan/task-plan.md§ T-C61-external-resources/verification.md§ V-OPS-002
DQ-R1-005: API-Surface Freshness Cadence
Section titled “DQ-R1-005: API-Surface Freshness Cadence”Context: The API observations note (postmark-api-observations.md) records observed Postmark API behaviour. Surface drift (Postmark adding/changing endpoints) would invalidate parts of the note. The question is when to refresh: annually, on every Postmark major-update post, or on first drift-test failure attributable to surface drift.
Decision: Refresh on first drift-test failure attributable to surface drift, augmented by an annual review. A scheduled-only cadence (annual) without the failure trigger would let regressions sit unnoticed for up to a year; a per-update cadence would create unnecessary documentation churn since most Postmark updates do not affect the small surface Arda uses. The combination keeps the note current where it matters and bounds staleness.
Applied to:
current-system/oam/postmark-service/postmark-api-observations.md(Phase 1 deliverable; freshness cadence noted in version-pin section)
DQ-R1-007: Vault Separation for Free Kanban Tool Server Token
Section titled “DQ-R1-007: Vault Separation for Free Kanban Tool Server Token”Context: The Free Kanban Tool sends transactional email from freekanban.arda.ardamails.com. Its Postmark server token is the runtime sending credential — a leak yields the ability to send arbitrary email under that domain. The original cross-cutting design placed this item in Arda-SystemsOAM alongside the OAM-tier credentials (Postmark account tokens, IAC service-account tokens). OP_SERVICE_ACCOUNT_TOKEN — the GitHub Actions secret that authenticates CI to 1Password — is scoped read-only to Arda-SystemsOAM. So the Free Kanban server token sat in the same blast radius as every other OAM credential, contradicting the bounded-blast-radius framing in cross-cutting-design.md § 2.5.
Discovered: 2026-05-05, during the Phase 1 operator-walkthrough preparation. Re-running tools/drift-check.ts locally surfaced the placement and prompted a re-evaluation of vault scoping for runtime credentials.
| Option | Description | Trade-offs |
|---|---|---|
| A | Keep the item in Arda-SystemsOAM. | One vault to manage. But: Free Kanban Tool’s runtime credential is reachable by OP_SERVICE_ACCOUNT_TOKEN, which expands the blast radius of any CI compromise to include the live sending key. |
| B | Move the item to a dedicated Arda-CorporateOAM vault. The Free Kanban Tool’s runtime resolves the credential via its own SDK auth path; OP_SERVICE_ACCOUNT_TOKEN does not have read access to this vault. | One additional vault to provision. The Free Kanban server token is now isolated from the OAM-tier credentials; a CI / OP_SERVICE_ACCOUNT_TOKEN compromise does not yield it. Matches the rev1 design intent: deploy-time / OAM credentials in Arda-SystemsOAM; runtime sending credentials in instance-group-scoped vaults. |
Recommendation: Option B — bounded blast radius outweighs the single-vault simplicity.
Decision: Option B. The Free Kanban Tool’s Postmark server token lives at:
| Field | Value |
|---|---|
| Vault | Arda-CorporateOAM |
| Item title | Free-Kanban-Generator-Postmark-Server |
| Field | credential |
| Canonical reference | op://Arda-CorporateOAM/Free-Kanban-Generator-Postmark-Server/credential |
The vault was provisioned 2026-05-05 (operator action by Miguel). The 1Password item itself is created by Phase 3 (Corporate CLI Phase A writes the Postmark server token into the item the first time it runs). Phase 1 does not create or assert the existence of this item.
This decision establishes a vault-naming convention that future instance groups follow: Arda-<InstanceGroup>OAM for runtime sending credentials owned by that instance group. The existing partition-scoped vaults (Arda-DevOAM, Arda-StageOAM, Arda-DemoOAM, Arda-ProdOAM, Arda-SandboxKyle) already follow this pattern; Arda-CorporateOAM extends it to the new Corporate Resource Group.
Clarification on item naming within partition vaults. The Arda-SystemsOAM vault holds both Postmark accounts (Postmark-Prod and Postmark-NonProd) and therefore uses qualified item names (the account suffix disambiguates the two within the single vault). In contrast, each per-partition vault (Arda-DevOAM, Arda-ProdOAM, etc.) holds only one Postmark account reference — the one relevant to that partition — so the service-name-only item title Postmark is used (the vault name itself carries the environment). This follows the workspace CLAUDE.md 1Password vault convention: vaults are scoped by usage; store independently even when the value is currently shared.
Consequences:
- Phase 1: the typed reference
FREE_KANBAN_POSTMARK_ITEMis removed frominfrastructure/src/main/cdk/platform/one-password.ts. Phase 1 declares only the three items it creates (Postmark-Prod,Postmark-NonProd,IAC-SCRIPTS Service Account Token).tools/drift-check.tsand the Phase 1 V-PLAT-002 test surface shrink correspondingly. - Phase 3: Corporate Updates (re)introduces the typed reference with the new vault, item title, and field. Phase 3’s spec explicitly enumerates the SDK auth path the Free Kanban Tool’s runtime uses to read the credential (out of scope of this project’s IaC, but documented for the Free Kanban Tool team).
- Threat model:
cross-cutting-design.md§ 2.1 line 39 (“attacker holdingOP_SERVICE_ACCOUNT_TOKENreads every credential reachable fromArda-SystemsOAM”) remains true; the Free Kanban server token is no longer in that set. § 2.5 is updated to explicitly call out the vault-separation guarantee.
Applied to:
cross-cutting-design.md§ 1 (defended threats) — “Free Kanban Tool sending integrity” line updated to name the new vault.cross-cutting-design.md§ 2.5 — Phase A description, item name, and isolation note updated.cross-cutting-design.md§ 4.1 secret-inventory table — vault and reference columns updated.cross-cutting-design.md§ 4.5 rotation runbook — Free Kanban Tool token step updated with new vault.phases.mdPhase 3 J1 interim mechanism — item name and vault.architecture-overview.md§ 8 Postmark-server creation — interim-mechanism description updated.infrastructure/src/main/cdk/platform/one-password.ts—FREE_KANBAN_POSTMARK_ITEMremoved (Phase 1).infrastructure/tools/drift-check.ts— import +ALL_OP_ITEMSentry removed.infrastructure/src/main/cdk/platform/platform.test.ts,infrastructure/tools/drift-check.test.ts— test surface adjusted from 4 items to 3.- Phase 3 planning artefacts (when authored) — the typed reference is reintroduced under
platform/one-password.tswith the new vault, title, and field.
Round R1-Phase2: Root Updates Decisions
Section titled “Round R1-Phase2: Root Updates Decisions”This round captures decisions made while planning Phase 2 — Root Updates.
DQ-R1-006: Locus of Cross-Zone NS-Delegation Writes
Section titled “DQ-R1-006: Locus of Cross-Zone NS-Delegation Writes”Context: The Root account owns the ardamails.com mail-root zone (Phase 2 introduces it) and the four arda.cards family zones. Child zones (arda.ardamails.com for Corporate in Phase 3; {partition}.ardamails.com per partition in Phase 4) need NS-delegation records in the parent zone. The question is which stack writes those NS records:
| Option | Description | Trade-offs |
|---|---|---|
| A | Root stack writes the per-child NS record set. The parent stack reads each child zone’s hostedZoneNameServers via cross-stack import or live API lookup and writes the NS record into the parent zone. | Centralises NS records in one stack. But: creates a Phase-2-on-Phase-3 (and Phase-2-on-Phase-4) deploy-order dependency; Root cannot complete its NS-delegation writes until every child zone has been provisioned. Inverts the natural “owner of a zone owns its delegation” intuition. |
| B | Child stack writes its own NS record into the parent zone using a cross-account assume-role pattern. Root owns only the assume-role IAM target (AllowCreatingNSRecordsRole); each child stack instantiates a WriteNSRecordsToUpstreamDns construct that runs a Lambda + Custom Resource in the child account, assumes the Root role, and writes the parent NS record. | Matches the existing arda.cards family pattern (every partition’s IngressStack already writes its own NS records into Root’s arda.cards family zones). Phase 2 is fully self-contained; Phase 3 / Phase 4 depend on Phase 2 only for the role and the parent zone existence. Slightly more constructs per child stack, but the constructs already exist. |
Recommendation: Option B — consistency with the existing pattern, clean dependency direction, no joint-deploy requirements between phases.
Decision: Option B. The WriteNSRecordsToUpstreamDns construct (at src/main/cdk/constructs/xgress/write-ns-records-to-upstream-dns.ts) is owned and instantiated by the child zone stack. It internally creates a Lambda execution role in the child account, a NodejsFunction from constructs/inline-lambdas/write-platform-root-ns-record.ts, and a cdk.CustomResource that on stack lifecycle events assumes the Root role (AllowCreatingNSRecordsRole, deterministic name from aws-configuration.ALLOW_WRITE_NS_RECORDS_ROLE.name) and writes / updates / deletes the NS record set in the parent zone. The child zone’s own hostedZoneNameServers token is passed in as the nameServers property — no live cross-zone lookup is required.
Consequences:
- Phase 2 does not write NS records for any child zone. Its scope is limited to: renaming the existing app/stack, declaring the
ardamails.comzone, exporting the zone ID and the IAM role ARN, and adding theinstances/Root/dns.tsdeclarative configuration. - Phase 3 (Corporate) instantiates
WriteNSRecordsToUpstreamDnsagainst theardamails.comzone withsubdomain: "arda"andnameServers: arda.hostedZoneNameServers. - Phase 4 (per-partition) does the same, once per partition, with
subdomain: "<partition>"andnameServers: <partition>Zone.hostedZoneNameServers. - Phase 2 → Phase 3 / Phase 4 dependency reduces to deploy order (Root must deploy first because the child stacks’ lambdas assume the Root role at deploy time).
Applied to:
2-root-updates/specification.md— Phase 2 scope explicitly excludes NS-delegation writes.phases.md§ Phase 2 — deliverables list updated; the “NS-delegation forarda.ardamails.com” row replaced with theardamails.comzone declaration.phases.md§ Phase 3 — Corporate Email stack deliverable extended to mention theWriteNSRecordsToUpstreamDnsinstantiation.
DQ-R1-008: Adopt vs. Create the existing ardamails.com Hosted Zone
Section titled “DQ-R1-008: Adopt vs. Create the existing ardamails.com Hosted Zone”Context: When cdk diff was run against the deployed RootConfiguration stack to validate the Phase 2 implementation (Gate 3), it surfaced an additive-only result — as expected by design. But a separate AWS investigation (motivated by an offhand challenge from the operator: “is the zone already there?”) revealed that the ardamails.com hosted zone already existed in the Root account as Z0721066239FWCD47EJDX, with two records (apex NS and SOA) and the four AWS-assigned nameservers (ns-2046.awsdns-63.co.uk, ns-944.awsdns-54.net, ns-158.awsdns-19.com, ns-1497.awsdns-59.org). The zone was auto-created by AWS Route53 Domains when the ardamails.com domain was originally registered through the registrar service.
The original Phase 2 implementation declared a brand-new r53.PublicHostedZone(this, "ArdamailsZone", {...}). Deploying as written would have created a second hosted zone for ardamails.com with a different NS set; the registrar would still have pointed at the original four nameservers, so the new zone would have been orphaned at the DNS level. The deploy-as-coded path was unsafe.
Discovered: 2026-05-05, after Gate 3 cleared (the cdk-diff against deployed also reported additive-only because both zones were missing from the deployed stack and the synthesized template added a new one — the diff couldn’t see the duplication risk because the duplicated resource is in Route53 but not in CloudFormation).
| Option | Description | Trade-offs |
|---|---|---|
| A | cdk import the existing zone into RootConfiguration (logical ID ArdamailsZone1DCDDC15). Zone becomes CDK/CFN-managed; no duplicate created; registrar’s NS chain preserved. | One-time operator action; zone properties must match the import target exactly; CFN’s IMPORT change-set type doesn’t allow Output additions or other resource modifications, so the deploy is two-phase (import-only template, then full deploy). |
| B | Reference the existing zone via r53.HostedZone.fromHostedZoneAttributes() and export its ID via CfnOutput without trying to manage it. | Zone stays outside CDK control; future record additions (root-level SPF, DMARC) require ad-hoc tooling. Doesn’t match the “Phase 2 declares the ardamails.com zone” intent in phases.md. |
Recommendation: Option A.
Decision: Option A. The CDK code at src/main/cdk/stacks/root/root-dns-stack.ts was extended in two ways:
- The
ArdamailsZonedeclaration now setscomment: "HostedZone created by Route53 Registrar"— the AWS-default comment string on the live zone — so the IMPORT change-set is read-only (CFN reportsScope: [], no property writes). applyRemovalPolicy(cdk.RemovalPolicy.RETAIN)defends the imported zone against accidentalcdk destroyof the production root stack.
The root-dns-stack.test.ts file’s V-ROOT-001 was extended with a strict-equality assertion that locks the synthesized resource block to the live zone’s properties (Name, HostedZoneConfig.Comment) plus the RETAIN retention policies. Future CDK code changes that drift from the import target fail at test time.
The deployment proceeded in two CFN operations:
- IMPORT change-set with a stripped template (
deployed-state+ just theArdamailsZoneresource added; no Outputs added, no other resource modifications). Executed cleanly:Action: Import,Replacement: null,Scope: []. Stack transitioned toIMPORT_COMPLETE. - Normal
cdk deploywith the full synthesized template, adding theardamailsZoneOutput (publishing thearda-ardamails-zoneCFN export) and reconcilingCDKMetadata. Stack transitioned toUPDATE_COMPLETE. Finalcdk diffreported zero differences.
Forward implications:
- Phase 3’s
arda.ardamails.comzone is created fresh by the Corporate Email stack (no pre-existing zone in Route53); no IMPORT detour needed. - Phase 4’s per-partition
{partition}.ardamails.comzones are created fresh in each partition’s AWS account (no pre-existing zone); no IMPORT detour needed. - Future zone-creation work in this project follows the standard
cdk deployflow.
Applied to:
infrastructure/src/main/cdk/stacks/root/root-dns-stack.ts— comment + retention policy onArdamailsZone.infrastructure/src/main/cdk/stacks/root/root-dns-stack.test.ts— V-ROOT-001 strict-match.infrastructure/CHANGELOG.md[2.29.0]— Added entry refined to mention the import.2-root-updates/implementation/learnings.md,alternatives.md,skipped.md— project-completion byproducts.
Round R1-Phase3: Corporate Updates Decisions
Section titled “Round R1-Phase3: Corporate Updates Decisions”This round captures decisions made while planning Phase 3 — Corporate Updates. All decisions resolved during the Pass-1 analysis (3-corporate-updates/analysis.md) on 2026-05-06.
DQ-R1-009: Postmark Domain-Verification Target (Parent vs Leaf)
Section titled “DQ-R1-009: Postmark Domain-Verification Target (Parent vs Leaf)”Context: The Free Kanban Tool sends from freekanban.arda.ardamails.com. Postmark verifies sending domains via DKIM + Return-Path records published in DNS. The verification can target either the leaf sub-domain (freekanban.arda.ardamails.com) or the Corporate-zone parent (arda.ardamails.com). Verifying the parent makes leaves inherit DKIM through the parent’s signing key, removing the need for a per-leaf verification click as future Corporate consumers (HubSpot, marketing) are added.
| Option | Description | Trade-offs |
|---|---|---|
| A | Verify each leaf sub-domain individually as it is created. | Simple per-leaf isolation; failure of one leaf’s DKIM doesn’t affect siblings. But: each new Corporate consumer requires its own verification click (or API call) and its own DKIM rotation runbook. |
| B | Verify once at the Corporate-zone parent (arda.ardamails.com); leaves inherit DKIM via the parent’s signing key. | One verification step covers every current and future leaf under arda.ardamails.com. Single DKIM key rotation runbook. Aligns with Postmark’s parent-domain verification semantics. |
Recommendation: Option B — parent verification. Pre-decided 2026-05-05 during the Phase 1 operator-walkthrough preparation.
Decision: Option B. Phase 3’s PostmarkSendingDomain thin-wrapper registers arda.ardamails.com as the Sender Signature in PostmarkProd. The Corporate CLI invokes verifyDkim and verifyReturnPath against this parent. Leaf sub-domains (freekanban.arda.ardamails.com, future siblings) do not receive their own Sender Signature.
Applied to:
3-corporate-updates/analysis.md§ “Note on what becomes ‘known to Postmark’” and gaps G-1, G-7, G-8.operator-domain-verification-checklist.md— the stub already pointed at this decision; the just-in-time expansion at implementation time formalizes the verification target.- Phase 3 specification (Pass 2) — the
PostmarkSendingDomainconfiguration isarda.ardamails.com, not the leaf.
Implementation note (added post-Phase-3): The first implementation pass diverged from this decision — Phase A’s CLI honored it by accident while the CDK construct silently placed the DKIM TXT under the leaf sub-domain. Surfaced by Phase B post-deploy verification when Postmark’s DKIMPendingHost did not match the deployed FQDN. The root cause was that the decision was prose-only (this entry, a docstring, a runbook) with no value or function any code consumed. Resolved by Arda-cards/infrastructure PR #450 commit cd85527: a typed source-of-truth sendingDomainPlacement() function in platform/constructs/postmark/sending-domain.ts is now consumed identically by the CLI, the CDK construct, and the drift check; cross-seam assertions in tools/corporate-drift.ts verify Postmark’s reported state agrees with the placement function. Full narrative at 3-corporate-updates/implementation/dqr1009-divergence.md; the structural lesson is captured in 3-corporate-updates/implementation/learnings.md L-1.
The scope of this decision is the Corporate instance group: verification at the Corporate-zone parent (arda.ardamails.com); leaves under it inherit. Phase 4’s per-partition Sender Signatures apply the same “verify at the instance-group parent” pattern at their own level ({partition}.ardamails.com), with each partition having its own DKIM key for receiver-side reputation isolation. The ardamails.com apex is not a verification target. The Phase 4 granularity decision is pinned in DQ-R1-017 (Round R1-Phase4).
DQ-R1-010: Locus of Corporate’s NS-Delegation Write (Same-Account Case)
Section titled “DQ-R1-010: Locus of Corporate’s NS-Delegation Write (Same-Account Case)”Context: DQ-R1-006 settled that the child zone owner writes the NS-delegation record upstream into the parent zone. The construct (WriteNSRecordsToUpstreamDns) was designed for the cross-account case where Application-Runtime partitions (in Alpha001 / Alpha002) write into Root’s ardamails.com zone (in platformRoot). For Phase 3, Corporate currently lives in platformRoot — the same account as Root. The question is whether Corporate’s stack still uses the assume-role construct or writes Route53 directly.
| Option | Description | Trade-offs |
|---|---|---|
| A | Always go through WriteNSRecordsToUpstreamDns and assume the role even when same-account; preserves the pattern uniformly across instance groups. | One extra STS AssumeRole call per deploy (~tens of milliseconds, negligible). Construct behavior is invariant under the future Corporate-account migration (architecture-overview § 6.4). DQ-R1-006’s “child writes upstream” intent is preserved. |
| B | Branch the construct so same-account writes skip the assume-role hop (direct Route53 write). | Slightly faster deploy; no STS call. But: introduces a same-account vs cross-account branch in the construct, expanding the test surface and creating a behavior change at the future Corporate-account migration moment. |
| C | Write the NS record from Root’s stack instead (revisits DQ-R1-006 for this case). | Simpler in the same-account case. But: re-opens DQ-R1-006 and breaks the “child owns its delegation” invariant. |
Recommendation: Option A — uniform pattern.
Decision: Option A. Phase 3’s CorporateMailDns stack instantiates WriteNSRecordsToUpstreamDns exactly as a partition would, with targetAccountId set to platformRoot’s account ID. The assume-role hop fires; the role grants ChangeResourceRecordSets on ardamails.com (the only zone the role’s allowedParentHostedZoneIds whitelists). The construct’s behavior is identical between the same-account (Phase 3 today) and cross-account (future Corporate-account migration) cases.
Applied to:
3-corporate-updates/analysis.mdgap G-5.- Phase 3 specification (Pass 2) —
CorporateMailDnsstack composition.
DQ-R1-011: route-53-hosted-zone.ts → dns-zone.ts Migration Shape
Section titled “DQ-R1-011: route-53-hosted-zone.ts → dns-zone.ts Migration Shape”Context: The existing constructs/xgress/route-53-hosted-zone.ts is the arda.cards-shaped hosted-zone construct (its overrideDomainName defaults to arda.cards). Phase 3 needs a generalized DnsZone construct that supports any registrable domain (ardamails.com, arda.ardamails.com, future). Two construct names cannot survive long-term; the question is the migration shape.
| Option | Description | Trade-offs |
|---|---|---|
| A | Rename in place: dns-zone.ts replaces route-53-hosted-zone.ts; existing callers updated in the same PR. | One PR, contained blast radius. The repo’s validateProps discipline catches missed callers at synth time. |
| B | Coexist for a transition window: dns-zone.ts is added; route-53-hosted-zone.ts becomes a thin re-export with a deprecation notice; followup PR removes the old name. | Smaller per-PR diff, easier review. But: two PRs land in sequence; the deprecation alias outlives any actual deprecation period. |
| C | Leave the old construct, add the new one; the old continues to serve arda.cards-family callers. | No caller migration. But: construct sprawl — two near-identical constructs co-exist indefinitely. |
Recommendation: Option A — rename in place.
Decision: Option A. The construct is renamed in the same Phase 3 PR; validateProps catches missed callers at synth, which is exercised by the repo’s CDK matrix in CI.
Applied to:
3-corporate-updates/analysis.mdrefactor R-1.- Phase 3 specification (Pass 2) — one task carries the rename + caller migration.
DQ-R1-012: Corporate Drift-Workflow Filename and Scope
Section titled “DQ-R1-012: Corporate Drift-Workflow Filename and Scope”Context: Phase 1 added external-resources-drift.yml (one workflow that exercises every external resource the platform consumes). Phase 3 introduces the first Corporate asset (Free Kanban Tool); future Corporate assets (HubSpot, marketing-site) follow. The question is whether to scope the drift workflow per asset or per instance group.
| Option | Description | Trade-offs |
|---|---|---|
| A | corporate-free-kanban-tool.yml (asset-specific, one workflow per asset). | One failure isolates to one workflow run. But: workflow file count grows linearly with Corporate assets; each new asset requires a new workflow file. |
| B | corporate-drift.yml (instance-group-scoped, one workflow that exercises every Corporate asset). | Workflow count proportional to instance groups, not assets. The driver script enumerates instances/Corporate/ and exercises each. New Corporate assets are picked up automatically. |
| C | <asset>-drift.yml per asset with a shared tools/corporate-drift.ts driver. | Combines the worst of A and B. |
Recommendation: Option B — instance-group-scoped.
Decision: Option B. The workflow file is corporate-drift.yml. The driver enumerates instances/Corporate/ and exercises each asset’s Postmark server, DNS records, and 1Password item. Failures open one issue per failed asset (label includes the asset name).
Applied to:
3-corporate-updates/analysis.mdgap G-16.- Phase 3 specification (Pass 2) — workflow + driver.
DQ-R1-013: Phase A Failure Ordering for the Postmark Server Token
Section titled “DQ-R1-013: Phase A Failure Ordering for the Postmark Server Token”Context: Phase A of the Corporate CLI creates a Postmark server (which yields the Server API token), writes the token to 1Password, and writes public values to cdk.context.json. Postmark’s API surfaces the token once at server creation; it cannot be re-retrieved. If the 1Password write fails after the server is created, the token is unrecoverable from Postmark’s side.
| Option | Description | Trade-offs |
|---|---|---|
| A | Write to 1P first, then cdk.context.json; on 1P-write failure, roll back by calling Postmark’s delete-server API. | Atomic-looking. But: delete-server is a destructive operation that runs against the live Postmark account; a botched rollback (e.g., after partial state was already created) destroys observable history. The rollback path is harder to test than the forward path. |
| B | Persist the token to a process-local secret-handling buffer immediately on receipt; write to 1P with retries (exponential backoff, finite). Fail loud on permanent 1P-write failure with the buffer’s redacted summary; manual operator action to recover. | Token is never persisted outside 1P. The 1P-write failure surfaces clearly with a redacted alert; the operator pastes the buffer summary into 1P via DesktopAuth or chooses to call delete-server deliberately as recovery. Forward path is the only tested path. |
Recommendation: Option B — buffer + retries.
Decision: Option B. The Corporate CLI implements a process-local secret buffer for the freshly issued server token. The 1P write retries up to N times with exponential backoff (defaults TBD by implementer; the spec lists the parameter). On exhaustion, the CLI exits with a clearly redacted summary that allows the operator to either manually paste the token into 1P (DesktopAuth) or invoke delete-server to reset. cdk.context.json is written after the 1P write succeeds; a 1P-write failure leaves cdk.context.json untouched.
Applied to:
3-corporate-updates/analysis.mdgap G-9.- Phase 3 specification (Pass 2) —
corporate-cli.tsPhase A semantics.
DQ-R1-014: cdk.context.json Commit Policy for Phase A’s Outputs
Section titled “DQ-R1-014: cdk.context.json Commit Policy for Phase A’s Outputs”Context: Phase A writes postmark.free-kanban.serverId, .dkimSelector, .dkimKey, .returnPathTarget into cdk.context.json. These are public values (DKIM selector and key are published in DNS; serverId is non-sensitive). Standard CDK practice is to commit cdk.context.json so synth is deterministic on a fresh checkout.
| Option | Description | Trade-offs |
|---|---|---|
| A | Commit cdk.context.json — standard CDK practice; deterministic re-synth on a fresh checkout. | New developers / CI checkouts can cdk synth without re-running Phase A. The values are public and DNS-published; no leak surface. |
| B | Local-only with .gitignore; CI re-runs Phase A to repopulate. | Eliminates the commit-of-generated-values pattern. But: re-running Phase A in CI requires Postmark Account API credentials in CI’s environment, which is the opposite of the design intent (only OP_SERVICE_ACCOUNT_TOKEN should be in CI; everything else is resolved at runtime via the SDK). |
| C | Commit, but exclude the postmark.* keys via a custom serializer. | Adds tooling complexity for no benefit; the public values are not sensitive. |
Recommendation: Option A — commit.
Decision: Option A. cdk.context.json is committed to the repo with the postmark.free-kanban.* keys populated by Phase A. The keys are DNS-public; commit is safe. Re-running Phase A is idempotent and updates the file when Postmark issues a new value (e.g., a DKIM-key rotation).
Long-term direction (status: interim): Using cdk.context.json as the channel through which a tool-side pre-deploy step hands DKIM / Return-Path values to CDK synth is the interim mechanism. It is consistent with CDK’s own provider-cache convention (flat key, structured value — see e.g. CDK’s auto-cached hosted-zone:account=...:region=... entries) and with how the Phase A Corporate CLI already populates this file. The target mechanism, when the platform adopts Lambda-backed Custom Resources more widely, is a CustomResource that calls Postmark inside CFN’s Create/Update lifecycle and returns DKIM values as resource attributes — removing the tool-side pre-deploy step and the operator handoff channel entirely. Migration is intentionally deferred: the Custom-Resource path is the long-term answer for all of Phase 3’s Corporate Signature, Phase 4’s per-partition Signatures, and any Phase 5b per-tenant Signatures, so the migration is a coordinated change rather than a per-phase one. Tracked outside this project; do not migrate piecemeal during Phase 4.
Applied to:
3-corporate-updates/analysis.mdgap G-9 and refactor R-2.- Phase 3 specification (Pass 2) — the
cdk.context.jsontask explicitly commits. - Phase 4’s per-partition extension reuses this decision; see DQ-R1-025 for the write strategy under the namespaced
partitionMail:<infra>:<partition>key.
DQ-R1-015: DMARC Reporting Mailbox (rua / ruf) for _dmarc.arda.ardamails.com
Section titled “DQ-R1-015: DMARC Reporting Mailbox (rua / ruf) for _dmarc.arda.ardamails.com”Context: The DMARC record at _dmarc.arda.ardamails.com (per architecture-overview § 5.2) has an initial monitoring policy of p=quarantine; sp=quarantine. The aggregate-report destination (rua=mailto:...) and forensic-report destination (ruf=mailto:..., optional) need a reachable mailbox to be meaningful.
| Option | Description | Trade-offs |
|---|---|---|
| A | dmarc-reports@arda.cards (existing arda.cards-family Google Workspace inbox). | Least operational cost; mailbox provisioning is one Google Workspace step. Reports aggregate over time and are reviewed periodically, not in real time. |
| B | A new dmarc-reports@ardamails.com mailbox, hosted independently. | Cleaner naming alignment with the mail-root domain. But: requires standing up MX records on ardamails.com, which is currently a sending-only domain; introduces inbound-mail handling that this project deliberately avoids. |
| C | No rua / ruf in v1; revisit when DMARC reporting becomes a routine input. | No mailbox to provision. But: DMARC monitoring (p=quarantine) is meaningless without a reporting destination; the policy effectively reduces to “do whatever your local rules say.” |
Recommendation: Option A — dmarc-reports@arda.cards.
Decision: Option A. The DMARC record carries rua=mailto:dmarc-reports@arda.cards. The mailbox is provisioned by the operator in Arda’s Google Workspace before Phase B deploy; the operator companion (G-18 in the analysis) captures the step at implementation time. ruf is omitted in v1 (forensic reports are noisier and not actioned today).
Applied to:
3-corporate-updates/analysis.mdgaps G-6 and G-20.- Phase 3 specification (Pass 2) — the DMARC TXT record content; the operator companion captures the prerequisite mailbox step.
- Operator companion at implementation time — explicit pre-deploy step.
DQ-R1-016: Reserved-Name Registry Scope at arda.ardamails.com
Section titled “DQ-R1-016: Reserved-Name Registry Scope at arda.ardamails.com”Context: Architecture-overview § 6.5 reserves arda at the ardamails.com level so future tenant slugs (in any partition) cannot collide. The question is whether to also reserve sub-domain slugs at the arda.ardamails.com level (freekanban, future hubspot, …): import them into a constants list and have partition validators reject them, or leave the arda.ardamails.com-level registry as documentation only.
| Option | Description | Trade-offs |
|---|---|---|
| A | Register freekanban (and future Corporate slugs) in platform/ari-configuration.ts; partition validators import the constant. | Cross-instance-group collision detection is mechanical. But: Application-Runtime partitions and the Corporate instance group become coupled through a shared constants list; any change to the Corporate registry forces a re-deploy of every partition (or at least invalidates their lint). |
| B | Documentation-only registry at arda.ardamails.com; partition validators do not import; corporate-cli.ts enforces the registry locally on Phase A entry by listing pre-existing Postmark Sender Signatures, servers, and 1P items. | No cross-instance-group import coupling. The CLI’s Phase A is the only writer; it can enforce uniqueness against live Postmark + 1P state. Adds a conflict-check requirement to the CLI. |
Recommendation: Option B — documentation-only with CLI-enforcement.
Decision: Option B. Partition validators do not import a Corporate slug list. corporate-cli.ts Phase A entry includes a conflict-check: it lists existing Postmark Sender Signatures (in the configured account), existing Postmark servers (in the configured account), and existing 1Password items (in Arda-CorporateOAM); if a name collision exists for the asset being created, the CLI exits before any state-mutating call. This catches both intra-Corporate collisions (two assets with overlapping names) and cross-instance-group collisions (a partition somehow registered an arda.ardamails.com slug).
Applied to:
3-corporate-updates/analysis.mdgaps G-15 and G-17.- Phase 3 specification (Pass 2) — the conflict-check is a
corporate-cli.tsPhase A acceptance criterion.
Round R1-Phase4: Runtime Platform Updates Decisions
Section titled “Round R1-Phase4: Runtime Platform Updates Decisions”This round captures decisions made during Phase 4 — per-partition mail capability for the Application Runtime instance group. Decision IDs DQ-R1-017 through DQ-R1-022 are reserved for this round; all entries are resolved.
DQ-R1-017: Postmark Sender Signature Granularity per Partition
Section titled “DQ-R1-017: Postmark Sender Signature Granularity per Partition”Context: Phase 4 brings per-partition mail capability online for the Application Runtime instance group across four active partitions (prod, demo, dev, stage; kyle excluded per DQ-R1-021). Each partition has its own mail sub-zone {partition}.ardamails.com. The question is whether each partition gets its own Postmark Sender Signature (with its own DKIM key, independent reputation), whether multiple partitions share a parent Signature in the spirit of DQ-R1-009 (which used parent verification for the Corporate instance group), and how per-tenant isolation fits in.
| Option | Description | Trade-offs |
|---|---|---|
| A | One Signature at ardamails.com (root); all partitions and Corporate inherit via parent verification. | One Signature covers the entire tree. But: reputation pools across every environment and the Corporate consumer; abuse on dev taints prod. Defeats the per-partition isolation goal. |
| B | One Signature per partition sub-zone (prod.ardamails.com, etc.); each carries its own DKIM key; leaves under each partition (per-tenant sub-domains) inherit. | Per-partition reputation independence. Matches the Postmark account split (Prod vs NonProd). Future per-tenant Signatures (for stricter isolation) can be added in Phase 5b without changing this layer. |
| C | One Signature per tenant from day one. | Strictest isolation. But: thousands of Signatures to manage; per-tenant verification cost; premature when tenant volume is zero. |
Recommendation: Option B — per-partition Signature, parent-verified at the partition sub-zone, leaves inherit.
Decision: Option B. Phase 4 registers one Postmark Sender Signature per active partition at the partition’s sub-zone ({partition}.ardamails.com). The Signature is anchored at the partition apex; per-tenant sub-domains within the partition inherit DKIM via the partition’s signing key. Production partitions (prod, demo) land on the PostmarkProd account; non-production partitions (dev, stage) on PostmarkNonProd. The first non-prod Signature (dev.ardamails.com) also satisfies Postmark Compliance’s pending approval for arda-nonprod. Per-tenant Signature granularity is deferred to Phase 5b when tenant volume exists.
Applied to:
phases.md§ Phase 4 Scope and Deliverables (Postmark Sender Signature rows).4-runtime-platform-updates/goal.mdSuccess Criteria #5 (first non-prod Signature verified).- Phase 5b Email module design (whether to add per-tenant Signatures becomes a tractable choice once tenants exist).
DQ-R1-018: corporate-drift Rename and Scope
Section titled “DQ-R1-018: corporate-drift Rename and Scope”Context: Phase 3 introduced tools/corporate-drift.ts and .github/workflows/corporate-drift.yml — a scheduled drift check that asserts Postmark account state and DNS state for the Corporate instance group, with cross-seam Postmark↔placement-function assertions added by the DQ-R1-009 fix. Phase 4 adds per-partition Postmark Sender Signatures that need equivalent drift coverage. The question is whether corporate-drift is renamed and generalized (e.g., to mail-drift) to cover Corporate + every partition Signature, or kept as Corporate-only with a parallel new workflow added for the partition surfaces.
| Option | Description | Trade-offs |
|---|---|---|
| A | Rename corporate-drift to mail-drift; one workflow asserts Corporate + every partition Signature. | One workflow to maintain. Single failure-issue stream. But: future runtime-platform drift checks unrelated to email (e.g., asserting CloudFront cache configuration, asserting Lambda function counts) would need their own naming; mail-drift is mail-centric. |
| B | Keep corporate-drift unchanged. Add a new runtime-platform-drift workflow in parallel, covering partition surfaces. Share logic via reusable shell scripts or GitHub Actions composite actions. | Names reflect scope (Corporate is one instance group; runtime-platform is another). Future non-mail runtime-platform drift checks plug into runtime-platform-drift without mail-centric naming. Two workflows to maintain, but shared logic minimizes drift between them. |
Recommendation: Option B — parallel workflows with shared logic.
Decision: Option B. corporate-drift stays as-is. A new .github/workflows/runtime-platform-drift.yml and driver under tools/ (Phase 4 deliverable) asserts the cross-seam Postmark↔DNS↔placement-function invariants for every active partition Signature. The two workflows share reusable shell scripts or GitHub Actions composite actions so the drift-check logic doesn’t drift between them. Future runtime-platform drift checks unrelated to email plug into the same workflow without renaming.
Applied to:
phases.md§ Phase 4 Scope and Deliverables.4-runtime-platform-updates/goal.mdSuccess Criteria #6.3-corporate-updates/implementation/suggestions.mdS-5 (originally suggested themail-driftrename; this decision supersedes that recommendation).
DQ-R1-019: Per-Partition Email Server-Token Encryption Key
Section titled “DQ-R1-019: Per-Partition Email Server-Token Encryption Key”Context: DQ-012 decided that per-tenant Postmark server tokens are encrypted application-side with a partition-wide symmetric key before INSERT, with the key in AWS Secrets Manager and delivered via ESO. DQ-202 fixed the on-disk format as an AES-256-GCM versioned envelope; DQ-203 specified that the SM value is a 64-byte HKDF input. Phase 4 must close three open sub-questions: (1) how the SM secret is named and declared in CDK, (2) what the envelope’s version prefix tracks (algorithm version, secret material version, or both), (3) how rotation works.
The full design is documented in 4-runtime-platform-updates/design/email-server-key-encryption.md. This entry summarizes the three sub-decisions.
| Option | Description | Trade-offs |
|---|---|---|
| A | Single-axis envelope vN, with vN coupling algorithm and secret material. Sibling SM secrets per rotation (-v1, -v2, …). | Operationally simple at the data-model layer. But: every rotation churns the code-side dispatch table, conflating algorithm cadence (rare) with material cadence (frequent). |
| B | Two-axis envelope a{N}.k{SM-VERSION-ID}. One SM secret per partition; rotation via update-secret (SM-native versioning). Hot-swap via two ExternalSecret mounts (AWSCURRENT + AWSPREVIOUS). Lazy + coroutine migration. SDK fallback for rare older versions. | Algorithm and material lifecycles cleanly separated. SM’s native versioning enables future AWS Rotation Lambdas natively. Operationally clean. But: the dispatch model is slightly more elaborate than Option A. |
Recommendation: Option B.
Decision: Option B. Phase 4 deploys one aws_secretsmanager.Secret per partition named {fqn}-I-EmailEncryptionKey (the -I- marker matches the convention as practiced for intra-partition resources), passwordLength: 64, RemovalPolicy.RETAIN. The Phase 5b on-disk envelope is a{N}.k{SM-VERSION-ID}:<base64-payload>; a{N} is the algorithm version (code-indexed; bumps require a release; never retired); k{...} is the AWS SM versionId of the SM version used at write time (runtime-indexed via two ExternalSecret mounts for AWSCURRENT and AWSPREVIOUS, plus a SM SDK fallback for rare older versions). Rotation is aws secretsmanager update-secret; migration is lazy on the first non-up-to-date read + a per-pod coroutine mop-up for the rest of the partition. Automated rotation via AWS SM Rotation Lambdas is enabled by this design and deferred to a future deliverable.
Applied to:
cross-cutting-design.mdSecret-handling table row for the Per-partition encryption key (line ~163) and the “Encryption key” rotation subsection (line ~293).phases.md§ Phase 4 Deliverables — explicit row for{fqn}-I-EmailEncryptionKey.4-runtime-platform-updates/goal.mdOpen Design Questions table row 3.4-runtime-platform-updates/design/email-server-key-encryption.md— the canonical design document for this decision.- Phase 5b email module (consumes the per-partition SM secret via two
ExternalSecrets; implementsTokenCipherwith the dispatch + migration model). - An operator runbook (Phase 4 deliverable) documenting the rotation procedure end-to-end.
DQ-R1-020: DNS-Provisioning + SM-Fallback IAM Roles
Section titled “DQ-R1-020: DNS-Provisioning + SM-Fallback IAM Roles”Context: Phase 4 introduces two new AWS capabilities that the operations component’s pod must exercise at runtime in each partition:
- Route53
ChangeResourceRecordSetson the partition’s mail sub-zone ({partition}.ardamails.com) — consumed by the Phase 5b Email module for per-tenant DKIM / Return-Path / DMARC record provisioning. secretsmanager:GetSecretValueon{fqn}-I-EmailEncryptionKey— consumed by the Phase 5bTokenCipherSDK-fallback path (DQ-R1-019) for the rare case of decrypting envelopes whosek{SM-VERSION-ID}is older thanAWSPREVIOUS.
Both permissions target partition-scoped resources and need to be available to the same workload (the operations pod). The decision is the IAM topology: which mechanism authenticates the pod to AWS, and where the permissions live.
Codebase precedent: A search of infrastructure/src/main/cdk/ shows IRSA is the sole adopted pod-identity mechanism (the partition EksStack already configures an OpenIdConnectProvider and exports {fqn}-EksPodRoleArn). Crucially, the exported pod role is never extended with workload-specific permissions anywhere in the codebase. The established pattern — exemplified by infrastructure/src/main/cdk/constructs/storage/image-asset-bucket.ts and public-upload-bucket.ts — is to create a fresh purpose-specific role with a trust policy that lets the pod role assume it via STS:
const preSigningRole = new iam.Role(this, "ImageUploadPreSigningRole", { roleName: `${fqn}-ImageUploadPreSigningRole`, assumedBy: new iam.AccountPrincipal(account).withConditions({ ArnLike: { "aws:PrincipalArn": clientRoleArnPattern }, // e.g. `arn:aws:iam::<acct>:role/<fqn>-*` }),});
preSigningRole.addToPolicy(new iam.PolicyStatement({ /* purpose-specific perms */ }));The pod federates into the partition pod role via IRSA at pod startup; the application code then performs sts:AssumeRole into the purpose-specific role at the call site (DQ-204 STS chain). Permissions live on the purpose role, not the pod role.
| Option | Description | Trade-offs |
|---|---|---|
| α | Extend the existing {fqn}-EksPodRole with the new Route53 and SM GetSecretValue statements. | Single role to audit per partition; one stack change. But: not how anything else in the codebase is structured — the pod role is treated as an STS-chain origin, not a permission accumulator. Adopting α here would diverge from established practice. |
| β | Create two fresh per-purpose roles ({fqn}-EmailDnsProvisioningRole, {fqn}-EmailEncryptionKeyFallbackRole) with trust policies that allow the partition pod role to assume them via STS. Mirrors ImageUploadPreSigningRole. | Aligns with codebase precedent. Cleanest least-privilege: each call path can only chain into the role it needs. Two new CDK roles and two new exports — a normal Phase 4 cost (Phase 4’s purpose is precisely to provision the partition infrastructure 5b needs). |
| γ | Adopt EKS Pod Identity (pods.eks.amazonaws.com) for these new roles. | Simpler trust-policy shape. But: not used anywhere else in Arda; introduces a second pod-identity mechanism alongside IRSA; no concrete benefit for this use case. Reject. |
| δ | Node-level instance profile / long-lived static keys. | Violates DQ-204; reject. |
Recommendation: Option β.
Decision: Option β. Phase 4 declares two fresh per-purpose IAM roles in each partition:
{fqn}-EmailDnsProvisioningRole— permissions:route53:ChangeResourceRecordSets,route53:ListResourceRecordSetson the partition’s mail hosted-zone ARN ({partition}.ardamails.com). Exported as{fqn}-EmailDnsProvisioningRoleArn. (route53:GetChangeis intentionally omitted: it requiresarn:aws:route53:::change/*resource scope rather than the hosted-zone ARN, and the Email module does not wait on Route53 propagation — Postmark verification is API-driven viaverifyDkim/verifyReturnPath, which probe DNS from Postmark’s side.){fqn}-EmailEncryptionKeyFallbackRole— permission:secretsmanager:GetSecretValueon${encryptionKeySecret.secretArn}*(full SM-secret ARN; the trailing wildcard tolerates the SM-appended random 6-character suffix — SM versions are selected at API call time viaVersionId/VersionStage, not encoded in the resource ARN). Exported as{fqn}-EmailEncryptionKeyFallbackRoleArn.
Both roles share the same trust-policy shape:
assumedBy: new iam.AccountPrincipal(account).withConditions({ ArnLike: { "aws:PrincipalArn": `arn:aws:iam::${account}:role/${fqn}-*` },}),This mirrors ImageUploadPreSigningRole: any role in the partition that matches the {fqn}-* name prefix may assume the role. In practice, the partition’s pod role ({fqn}-EksPodRole) is the only such role that an operations-component pod can federate into; the ArnLike condition limits the blast radius to the partition without coupling the role declaration to the pod role’s exact name.
The Phase 5b Email module performs sts:AssumeRole into these roles at the call site — same DQ-204 STS-chain pattern that operations already uses for the image-upload presign flow.
Implementation route — construct reuse with byte-identical Root output. The decision above pins the behavior of the DNS-provisioning role (STS-chained, account-principal + ArnLike trust, partition-scoped permissions). The implementation route refined during analysis: rather than hand-rolling a fresh role, reuse the existing AllowCreatingNSRecordsRole construct (Phase 2; constructs/oam/allow-creating-ns-records-role.ts). Despite the name, the construct’s permissions are already generic Route53 record-set CRUD (ChangeResourceRecordSets, ListResourceRecordSets, ListHostedZonesByName) with allowedParentHostedZoneIds scope-tightening. What needs to change: the trust principal, today hard-coded to iam.ServicePrincipal("lambda.amazonaws.com") with an OrgID condition, must be parameterizable so the Phase-4 instantiation can supply iam.AccountPrincipal(account).withConditions({ ArnLike: ... }).
This generalization carries two hard constraints that must hold simultaneously:
- Byte-identical Root-account output. The existing Root-account instantiation in
root-dns-stack.tsmust produce a CloudFormation template that is byte-identical before and after the construct change. Guarded by a CDKTemplate.fromStack()snapshot equality unit test (inroot-dns-stack.test.tsorallow-creating-ns-records-role.test.ts) that pins the Root resource shape; fails closed if the generalization regresses Root output. - Verified zero drift in deployed Root. A post-deploy verification step (operator-driven; tracked as
V-PART-NNNinverification.md) diffs the Root account’s currently-deployed CFN template against the synthesized output post-generalization. Expected diff is empty. Runs before any partition-mail deploy so the Root assertion holds with the construct-as-of-Phase-4 code.
The optional construct rename (e.g., AllowCreatingNSRecordsRole → AllowCreatingDnsRecordsRole) is name-only and reflects the construct’s already-generic Route53 record-set CRUD permissions (the “NSRecords” suffix is a Phase-2 historical artefact).
Update (2026-05-12, applied at design time): the rename can land in the same PR as the construct generalization, provided the CDK construct ID at the call site is preserved. CloudFormation logical IDs derive from the construct’s path (parent ID + construct ID), not from the class name. Concretely: the Root call site new AllowCreatingNSRecordsRole(this, "AllowCreatingNSRecordsRole", …) becomes new AllowCreatingDnsRecordsRole(this, "AllowCreatingNSRecordsRole", …) — the second argument (the construct ID string) stays unchanged, so the synthesized template’s logical IDs are unchanged, and the byte-identity guarantee holds.
The earlier note above (“If the rename is desired, it lands as a separate change after Phase 4’s role-reuse work is verified stable”) is superseded by this update. The Phase 4 design (analysis.md G-IAM-1 + specification.md T-I1 step 2) bundles the rename into PR #1 alongside the byte-identity guard (T-I2), with the call-site mitigation above documented inline. No cascading effect on Phase 4’s spec, requirements, or verification regime — the byte-identity test (T-I2 / V-IAC-002) catches any logical-ID regression regardless of whether it originates in the rename or elsewhere.
Open follow-ups (Phase 4 specification, not blocking the decision):
- Confirm the
arn:aws:iam::${account}:role/${fqn}-*ArnLike pattern matches the partition’s actual pod-role naming convention in every partition (Alpha001 + Alpha002; spot-check both exports during specification). - Decide whether the two roles live in the same
partition-emailstack or split (recommend: same stack — both are Phase 4 partition-mail deliverables, same lifecycle, same RemovalPolicy). - Confirm whether
route53:ListResourceRecordSetsis needed in addition toChange*for the Phase 5b idempotency / pre-check path (recommend: yes; the Email module checks existing records before issuing changes).
Applied to:
phases.md§ Phase 4 Deliverables — IAM-role row split into two; trust-policy and permission scoping updated to the STS-chain pattern.4-runtime-platform-updates/goal.mdSuccess Criteria #4, Deliverables list, Open Design Questions row 4 (nowDecided).5b-email-module/pre-existing-decisions.md— Phase 5b consumes the role ARNs and implements the STS-chain calls at the L1 / L2 boundary.
DQ-R1-021: Order of Partition Rollout
Section titled “DQ-R1-021: Order of Partition Rollout”Context: Phase 4 fans out across four active partitions. The question is the rollout order across the rollout waves; whether to include the kyle partition (which is suspended at Phase 4 start); and how this aligns with Phase 5b’s deployment cadence.
| Option | Description | Trade-offs |
|---|---|---|
| A | dev, kyle first; then stage, demo; then prod. Per the original phases.md Phase 5b recommendation. | Standard non-prod-first cascade. But: kyle is suspended at Phase 4 start; including it would mean provisioning a partition that has no operational use case. |
| B | dev → stage → demo → prod. Exclude kyle entirely. | Matches operational reality (kyle has no live use). dev first still satisfies the arda-nonprod Postmark account-approval prerequisite. Production lands last after non-prod wave validates the pattern. |
Recommendation: Option B.
Decision: Option B. Phase 4 rolls out to dev, stage, demo, prod. The kyle partition is excluded from Phase 4 (suspended; the kyle.ardamails.com sub-zone is not provisioned). kyle stays reserved at the ardamails.com level so it cannot be appropriated as a tenant slug while the partition is suspended; the partition can be re-introduced later by replaying the per-partition deploy procedure if it resumes operation. Phase 5b inherits the same order.
Amendment (2026-05-13) — partial-ordering refinement: the original total order dev → stage → demo → prod is relaxed to the partial order dev → {stage || demo} → prod. dev must go first (it satisfies the arda-nonprod Postmark account-approval prerequisite and validates the design end-to-end in the lowest-blast-radius partition); prod must go last (production deploys after the non-prod wave validates the pattern); but stage and demo carry no technical dependency on each other and may be rolled out in parallel once dev is verified. Rationale: per-partition deploys are independent at the AWS level (separate CFN stacks, no shared resources — see goal.md Constraint #1), and the Postmark Compliance gate (arda-nonprod account approval) only blocks the second non-prod Sender Signature on the arda-nonprod account; it does not block demo or prod (which are on PostmarkProd). Phase 5b inherits the same partial order.
Applied to:
phases.md§ Phase 4 Recovery / partial-failure handling — recommended deploy order.phases.md§ Phase 5b Recovery / partial-failure handling — recommended order.4-runtime-platform-updates/goal.mdScope, Success Criteria #1, and Open Design Questions row 5.
DQ-R1-022: Operator CLI Shape for Phase 4
Section titled “DQ-R1-022: Operator CLI Shape for Phase 4”Context: Phase 3 introduced tools/corporate-cli.ts (a TypeScript CLI for the Corporate instance group’s two-phase Postmark + DNS provisioning). Phase 4 needs an equivalent operator surface for per-partition mail provisioning. The question is whether to generalize corporate-cli over a partition argument, introduce a parallel partition-mail-cli, or integrate the Phase 4 work into the existing amm.sh operator script that already deploys partition-level resources.
| Option | Description | Trade-offs |
|---|---|---|
| A | Generalize corporate-cli to take an asset+partition pair. Both Corporate and partition mail work flow through the same CLI. | One CLI surface. But: stretches corporate-cli beyond its Corporate-instance scope; the partition path mixes with the Corporate path in implementation. |
| B | Introduce a parallel tools/partition-mail-cli.ts. Each instance group has its own CLI. | Scope-aligned naming. But: duplicates corporate-cli’s structure (idempotency, retries, redaction, conflict checks); adds a maintenance surface. |
| C | Integrate the Phase 4 partition-mail work into amm.sh (the existing Application Runtime deploy script). Phase 4 work follows amm.sh’s rules (idempotency, security, pre-flight checks, partition selection). Extract reusable bash + TypeScript utilities from corporate-cli so both amm.sh and corporate-cli share logic. | Aligns with existing operator surface for partition deploys. Familiar workflow. Reusable utilities prevent duplication across the two scripts. Requires refactoring Phase 3 deliverables to extract the shared utilities. |
Recommendation: Option C — amm.sh integration with shared utilities.
Decision: Option C. Phase 4 partition-mail provisioning is part of the product runtime platform deployment and is invoked through amm.sh (and its rules: idempotency, security, pre-flight checks, partition selection). Not a standalone partition-mail-cli. Reusable sub-scripts / utilities are extracted from corporate-cli so both amm.sh’s partition path and corporate-cli can share logic; this includes refactoring Phase 3 deliverables as needed to keep each script’s complexity bounded.
Implementation route — TypeScript helpers under tools/, invoked from amm.sh via ts-node. Phase 4 stays with Phase 3’s imperative-then-declarative (Phase A / Phase B) pattern:
- The extracted utilities (Postmark Account API client, idempotent list-then-create, retry / backoff, output redaction, conflict-check) live as TypeScript modules under
tools/lib/(or equivalent shared location). - A new entry script —
tools/register-partition-mail-signature.ts— composes these utilities into Phase 4’s partition-mail Phase-A flow: read the Postmark account-level token from the partition’sArda-{Env}OAM1P vault (using the Phase 1 1P SDK helper), call the Postmark Account API to register the{partition}.ardamails.comSender Signature (idempotent: list-then-create), capture the DKIM selector / public key / Return-Path target, and write those values intocdk.context.json(committed). The same utilities backcorporate-cli’s Phase-A flow. amm.sh’s direct calls collapse to three: (i)op readthe Postmark account-level token (bash; remains inamm.shfor GHA::add-mask::hygiene); (ii)npx ts-node tools/register-partition-mail-signature.ts <infrastructure> <partition>(Phase A — Postmark API + context write); (iii)cdk deploy ${infrastructure}-${partition}-Email --parameters PostmarkAccountToken=…(Phase B — declarative CDK deploy).- No bash reimplementation of Postmark / 1P logic. amm.sh stays a thin orchestrator; the TS scripts hold the imperative logic.
corporate-cliretains its TS entry-point and its Corporate-specific responsibilities (Free Kanban Tool server provisioning, 1P writes for the server token); only the shared helpers move intotools/lib/. - CR Lambda migration explicitly deferred (the “future architecture” called out in Phase 3 — the
PostmarkSendingDomainthin-wrapper’s public surface is designed to be invariant under that migration). Phase 4 does not pull it forward; doing so would materially expand scope without a forcing function. Future migration is a construct-internals change isolated toplatform/constructs/postmark/.
Applied to:
phases.md§ Phase 4 Scope and Deliverables — “Operator surfaces integrated intoamm.sh” bullet; “amm.sh-integrated partition-mail steps” deliverable row.4-runtime-platform-updates/goal.mdOpen Design Questions row 6.- Phase 4 implementation work — includes refactoring Phase 3’s
corporate-clito extract reusable utilities consumed by bothamm.shandcorporate-cli.
Pre-design follow-ups closed (Round R1-Phase4)
Section titled “Pre-design follow-ups closed (Round R1-Phase4)”After DQ-R1-017..022 were resolved, planning surfaced eight smaller follow-ups (B1..B5, C1..C3) that needed pinning before Phase 4 design could start. Each is “pick the default and move on” rather than load-bearing; collectively they are recorded here for traceability without individual DQ-R1-NNN entries. Full text in 4-runtime-platform-updates/goal.md § Pre-design follow-ups.
| ID | Item | Resolution |
|---|---|---|
| B1 | Phase 5a TokenCipher location | Ships in common-module as a general-purpose encrypted-field utility (not Email-specific) |
| B2 | Postmark account-token deploy-time delivery | δ.1 — amm.sh reads via op, passes to cdk deploy as NoEcho parameter; partition-email stack uses SecretValue.cfnParameter(). Mirrors partitionSecrets.cfn.yaml |
| B3 | amm.sh extraction scope from corporate-cli | Minimal: extract only what amm.sh’s partition-mail steps need; backfill on demand |
| B4 | kyle reservation registry | Extend the Phase 3 mechanism used to reserve arda at the ardamails.com level |
| B5 | Cross-partition deploy gating in CI | Operator-enforced via amm.sh; no tools/cdk-runner.js matrix change |
| C1 | CDK stack name | ${infrastructure}-${partition}-Email (parallels existing -Secrets, -Amplify stacks); immutable — locked at first deploy |
| C2 | Per-partition DMARC reporting mailbox | Reuse dmarc-reports@arda.cards for all partitions (DMARC report content already identifies the source domain) |
| C3 | runtime-platform-drift schedule + labels | Daily cron; failure-issue labels drift + runtime-platform; mirrors corporate-drift shape |
These resolutions also drive a new Phase 4 deliverable: current-system/oam/security/secret-delivery-pattern.md, documenting the canonical op → amm.sh → CFN NoEcho parameter → SM secret → consumer flow with partitionSecrets.cfn.yaml and the Phase 4 Postmark token as worked examples.
DQ-R1-023: Per-Tenant Postmark Sender Signature Introduction (Phase 5b)
Section titled “DQ-R1-023: Per-Tenant Postmark Sender Signature Introduction (Phase 5b)”Status: Open — to be confirmed at Phase 5b planning. No Phase 4 dependency; Phase 4 provisions the enabling infrastructure (EmailDnsProvisioningRole, partition mail sub-zone) regardless of which way this decision goes.
Context: DQ-R1-017 (Round R1-Phase4) decided that Phase 4 ships one Postmark Sender Signature per partition ({partition}.ardamails.com) and defers per-tenant Signatures to Phase 5b. The Phase 4 design works for sending — tenants sending from {config}.{tenant}.{partition}.ardamails.com use the partition’s DKIM key via Postmark sub-domain inheritance and DMARC relaxed alignment. The trade-off: all tenants in a partition share the partition’s DKIM-domain reputation at the receiver side (Gmail, Microsoft, Yahoo, etc., track reputation by the DKIM d= domain, not by the Postmark Server identifier).
The question is whether Phase 5b should introduce per-tenant Sender Signatures to give per-tenant reputation isolation, and if so, on what schedule.
| Option | Description | Trade-offs |
|---|---|---|
| α | Status quo — all tenants in a partition share the partition Signature; per-tenant Servers exist for token / activity-log isolation but DKIM-domain reputation is shared. | No additional Phase 5b work for sending. But: one bad tenant degrades reputation for every tenant in that partition. No remediation path for tenants with persistent bounce / spam issues. |
| β | Per-tenant Signature from v1 — every tenant onboarded in Phase 5b gets its own Sender Signature registered via the Postmark Account API; per-tenant DKIM TXT + Return-Path CNAME records written at tenant onboarding via EmailDnsProvisioningRole. | Best reputation isolation. But: additional tenant-onboarding cost (Postmark API call + DNS write per tenant); operational surface grows linearly with tenant count. |
| γ | Hybrid — opt-in per-tenant Signature — Phase 5b ships with partition Signature as the default; tenants flagged as high-volume or reputation-sensitive (operator-driven or automated based on send volume) are migrated to per-tenant Signatures on demand. | Balances cost and isolation. But: introduces an operator decision per tenant; migration path needs design. |
| δ | Remediation-only per-tenant Signature — partition Signature is the default; per-tenant Signature is the remediation when a tenant generates a reputation incident. | Lowest operational cost. But: by the time remediation is needed, reputation damage has already affected siblings. |
Recommendation: To be made at Phase 5b planning, informed by:
- Actual tenant send volume and bounce / spam rates in Phase 5b’s pilot phase.
- Postmark’s own guidance at the time (their best practices may evolve).
- Compliance / contractual requirements specific to tenant cohorts (e.g., enterprise tenants may contractually require reputation isolation).
Phase 4 work that this affects: None. The EmailDnsProvisioningRole (G-IAM-3 in 4-runtime-platform-updates/design/analysis.md) is provisioned regardless — it is the explicit enabler for whichever way this decision goes. Phase 4 ships the infrastructure; Phase 5b decides when to exercise it.
Applied to:
5b-email-module/pre-existing-decisions.md— listed as a Phase 5b decision pending.phases.md§ Phase 5b — referenced in the Phase 5b open design questions.4-runtime-platform-updates/design/analysis.md§ 5.5 — explicit forward-reference from C-Postmark-Sending’s out-of-scope edges.
DQ-R1-024: EmailEncryptionKey initial-value generation mechanism
Section titled “DQ-R1-024: EmailEncryptionKey initial-value generation mechanism”Status: Superseded by DQ-R1-032 (originally Resolved, Round R1-Phase4).
Superseded (Round R1-Phase5b). Option A (CFN-native
GenerateSecretString) was implemented in Phase 4 but proved undeployable for the email module: it can only emit a flat{"key": "<random chars>"}value, whereas the operations module’sMaterialRegistryRefresherrequires a UUID-keyed{ "<versionId>": "<base64 64-byte key>" }registry and fail-fasts on anything else. DQ-R1-032 reverses this decision in favour of Option C (the δ.1 NoEcho-parameter path, sourced from 1Password). The original analysis below is retained for the audit trail.
Context: DQ-R1-019 decided what the SM secret looks like (one aws_secretsmanager.Secret per partition, passwordLength: 64, RemovalPolicy.RETAIN, two-axis envelope a{N}.k{SM-VERSION-ID}). It did not pin how the 64-byte initial value gets generated at first deploy. Three mechanisms are available; the choice affects both implementation cost and the audit story for “immutable post-launch” (V-PART-016).
| Option | Mechanism | Trade-offs |
|---|---|---|
| A | CFN-native GenerateSecretString on AWS::SecretsManager::Secret. CDK declares new sm.Secret({ generateSecretString: { passwordLength: 64, excludePunctuation: true } }). CFN generates the value at first deploy; re-deploys are no-ops because the GenerateSecretString block is identity-stable. | Zero custom code. Existing precedent in the repo (partition-secrets.ts uses it for SentryScrubSalt with identical shape). CFN guarantees the value never regenerates unless the operator explicitly forces it. describe-secret versionId before/after re-deploy is identical — V-PART-016 verifies via CFN’s own behaviour. |
| B | CDK Custom Resource (inline Lambda) that generates the random value at deploy time and writes it to SM via the SDK. | Maximum flexibility (custom entropy source, key-derivation steps). But: Lambda boilerplate, opaque error handling, no out-of-the-box immutability story — the Lambda’s behaviour on re-deploy is whatever the author writes. Precedent exists for asymmetric keys (generate-signing-key.ts) but is heavier than required here. |
| C | Pre-Deploy script + NoEcho CFN parameter (δ.1 pattern). The Pre-Deploy script generates entropy locally, op writes it into 1Password, then amm.sh reads it and passes via --parameters EmailEncryptionKey=$value (NoEcho). | Mirrors the EmailPostmarkAccountToken delivery path (δ.1). Cleanly separates “operator-supplied” from “CFN-generated” secrets at the operational boundary. But: introduces an out-of-band 1P write that V-PART-016 would have to assert idempotency for; mixes secret-delivery patterns inside the same stack. |
Recommendation: Option A.
Decision: Option A — CFN-native GenerateSecretString. PartitionEmailStack declares the encryption-key secret with the same shape as partition-secrets.ts’s SentryScrubSalt:
new sm.Secret(this, "EmailEncryptionKey", { secretName: `${publishingPrefix}-I-EmailEncryptionKey`, generateSecretString: { secretStringTemplate: JSON.stringify({}), generateStringKey: "key", passwordLength: 64, excludePunctuation: true }, removalPolicy: cdk.RemovalPolicy.RETAIN,});The δ.1 NoEcho pattern is reserved for the EmailPostmarkAccountToken (externally supplied from 1Password). EmailEncryptionKey is CDK-internal generation — no NoEcho parameter, no Custom Resource, no separate amm.sh step.
Applied to:
4-runtime-platform-updates/design/specification.md— T-I4PartitionEmailStackbody refers to this decision when declaring the secret.4-runtime-platform-updates/design/verification.md— V-PART-014, V-PART-016 procedures assert viadescribe-secretversionId before/after re-deploy (CFN-native immutability).
DQ-R1-025: Pre-Deploy script’s cdk.context.json write strategy
Section titled “DQ-R1-025: Pre-Deploy script’s cdk.context.json write strategy”Status: Resolved. Round R1-Phase4.
Context: T-I8 (tools/register-partition-mail-signature.ts) writes DKIM selector / public key / Return-Path target into cdk.context.json after the Postmark Sender Signature is registered. Those values are then consumed by PartitionEmailStack at synth time. DQ-R1-014 decided whether to commit cdk.context.json (yes, with the public values). It did not pin how the entry script writes to it. There is no precedent in this repo for a tool writing context.json — CDK normally writes it at runtime via its providers.
| Option | Mechanism | Trade-offs |
|---|---|---|
| A | Hand-rolled atomic JSON merge in a new tools/lib/context-store.ts. Read cdk.context.json, set keys under a namespaced path (e.g. partitionMail:<infrastructure>:<partition>), JSON.stringify(obj, null, 2), write to .tmp, atomic rename. | Simple, zero deps. Deterministic key ordering via insertion-order. Pure unit-testable. Namespaced keys avoid collisions with CDK provider entries. Atomic-write boilerplate is ~15 LoC. |
| B | CDK ContextProvider helpers. Bootstrap the CDK runtime in the CLI script and use ContextProvider.getValue() / Stage.synth() mutation paths. | ”Official” CDK path. But: the provider API is read-only at the public surface; mutation requires reaching into private CDK internals. Heavyweight (bootstraps a CDK runtime in a CLI tool). |
| C | npx cdk context --set CLI wrap. Invoke the CDK CLI’s context --set <key>=<value> per field. | Uses CDK’s own mutation surface. But: writes flat key=value pairs (strings only) — forces flattening nested objects into partitionMail.Alpha002.dev.dkimSelector=..., which is ugly to grep and brittle. One CLI invocation per field inflates run time. |
Recommendation: Option A.
Decision: Option A — hand-rolled atomic JSON merge. Phase 4 lands a tools/lib/context-store.ts helper with the signature:
export function writePartitionMailContext( infrastructure: string, partition: string, values: { dkimSelector: string; dkimPublicKey: string; returnPathTarget: string },): void;It reads cdk.context.json from the repo root, sets the path partitionMail:<infrastructure>:<partition> to the values object, and writes back atomically (.tmp + rename) with JSON.stringify(..., null, 2). The corresponding read accessor in PartitionEmailStack reads the same namespaced key at synth time. Pure unit-testable with a temp directory; no CDK runtime bootstrap.
Applied to:
4-runtime-platform-updates/design/specification.md— T-I8 entry-script body references this helper instead of inlining the JSON merge.4-runtime-platform-updates/design/verification.md— V-CLI-001 (entry-script tests) includes a context-write assertion against a tempcdk.context.jsonfixture.
DQ-R1-026: Consolidation of Per-Partition Rollout Runs into a Single Operator-Cascade Run
Section titled “DQ-R1-026: Consolidation of Per-Partition Rollout Runs into a Single Operator-Cascade Run”Status: Resolved. Round R1-Phase4, post-PR-462 amendment (2026-05-26).
Context: The original Phase 4 plan decomposed the partition rollout into four runs: Run-2 (dev, code + dev deploy), Run-3 (stage), Run-4 (demo), Run-5 (prod). After PR #462 (Run-2) entered review, two facts became clear:
- PR #462 already ships the code for all four active partitions:
platforms.tscarries themailblock fordev,stage,demo, andprod;apps/Al1x/partition.tsiterates every partition whosemailblock is set;PartitionEmailStack, the Pre-Deploy CLI, and theamm.shstep are all parameterised by<infrastructure> <partition>.cdk synthagainst each of the four{infra}-{partition}-Emailstacks succeeds today against the as-built Run-2 tree. No code diff is needed to deploystage,demo, orprod. - Each remaining run’s PR diff is, by default, a single CHANGELOG line. Runs 3 / 4 / 5 are operator-driven deploys (run
amm.shagainst a partition; capture the resultingcdk.context.jsonvalues; verify), not code work. Three sequential CHANGELOG-only PRs would multiply review and approval steps without commensurate technical content per PR.
| Option | Description | Trade-offs |
|---|---|---|
| A | Keep Runs 3 / 4 / 5 as separate CHANGELOG-only PRs. Each PR is one bullet under ### Added; operator runs one partition deploy per PR; sign-off per PR. | Per-partition isolation in approval flow; partition-by-partition retreat path if anything diverges. But: three review cycles for what is mechanically the same shape; the meaningful artefact (the execution log, cdk.context.json updates, any code fixes) accumulates across all four partitions and is fragmented across three PRs. |
| B | Collapse into a single Run-3 (operator cascade): one PR captures the CHANGELOG entry reflecting the deployed-system state, the accumulated cdk.context.json updates from all four partition deploys, and any code fixes that surfaced during execution. The execution log becomes the run’s primary deliverable (an artefact document, not the PR body). | Single PR approval gate; the deliverable artefact (the execution log) is intact, not split. The operator still has full per-partition retreat: if prod fails to deploy, only its sub-section of the log is unpopulated; the run does not close until all four partitions are verified. Cost: a single PR review must cover four partition deploys; production deploy approval is part of the same gate as stage/demo. |
| C | Collapse 3 + 4 only; keep prod (Run-5) separate. Stage and demo land together; production is its own approval gate. | Splits the difference. But: stage and demo are on different Postmark accounts (PostmarkNonProd vs PostmarkProd); pairing them does not reduce risk asymmetry the way the natural split ({dev, stage} on NonProd vs {demo, prod} on Prod) might. The collapsed shape still leaves the worst review-churn case (three PRs in total: one code, one cascade, one prod) without the benefit of a clean cascade artefact. |
Recommendation: Option B.
Decision: Option B. Runs 3, 4, and 5 collapse into a single Run-3 (operator cascade). The run’s deliverables are:
- An execution log (
4-runtime-platform-updates/plan/runs/run-3-operator-cascade/execution-log.md) — one section per partition (dev,stage,demo,prod), capturing pre-flight outcomes,amm.shrun output, post-deploy verification (digchecks, Postmark Console state, CFN export presence), and any anomalies. The log is written as the operator executes each partition; it is the primary artefact of the run. - A single infra PR opened from the existing
phase-4/infrastructure-run-3worktree, base =main(auto-retargets when PR #462 merges), containing:CHANGELOG.md— one new entry (### Added) describing the deployed-system state across all four partitions.cdk.context.json— populated with each partition’spartitionMail:<infra>:<partition>block (DKIM + Return-Path target) captured during the operator’s Pre-Deploy step.- Any code fixes that emerged during partition deploys (e.g., a
PostmarkProd-account quirk surfaced ondemo).
- Operator sign-off rows in
4-runtime-platform-updates/design/verification.mdpopulated for the per-partition V-checks (V-OPS-005-dev, V-OPS-005-stage, V-OPS-005-demo, V-OPS-005-prod) as each partition lands.
Note on Run-2 boundary: PR #462’s operator step (running ./amm.sh Alpha002 dev after merge to deploy dev) is part of the new Run-3 cascade, not a post-action of Run-2. Run-2 closes when PR #462 merges; Run-3 begins immediately afterwards with the dev partition as its first cascade entry.
Note on Run-6 and Run-7 numbering: Runs 6 (drift workflow) and 7 (documentation) keep their original numbers. Renumbering them to 4 and 5 would churn specification.md, verification.md, phases.md, and PR #462’s existing CHANGELOG entry text for no operational benefit. The gap (runs 4 and 5 vacant) is documented here and in the phases.md Run table.
Note on retreat path: If a partition’s deploy fails mid-cascade (CFN rollback, Postmark error, DNS propagation gap), the operator captures the failure in the execution log, files any necessary code fix, and resumes from the failed partition once the underlying issue is addressed. The cascade does not roll back successfully-deployed partitions: per-partition isolation that drove the original Phase-4 decomposition (DQ-R1-021) still holds at the resource level.
Applied to:
4-runtime-platform-updates/plan/choreography.md— Run table, dependency diagram, hand-offs, deliverables, failure modes all collapsed to a single Run-3 row.4-runtime-platform-updates/plan/evaluation.md— Run table reduced; working-directory count revised.4-runtime-platform-updates/design/specification.md§ 3 — worktree list collapses to a singlephase-4/infrastructure-run-3entry.4-runtime-platform-updates/plan/runs/run-2-dev-rollout/project-plan.md— T-O4 reference updated to “unblocks Run-3 cascade (stage entry)”.4-runtime-platform-updates/plan/runs/run-6-drift-workflow/project-plan.md— entry criterion now references “Run-3 cascade has at leastdevpartition live”.plan/runs/run-3-stage-rollout/,plan/runs/run-4-demo-rollout/,plan/runs/run-5-prod-rollout/— directories removed.- New
plan/runs/run-3-operator-cascade/— consolidated project plan, execution log skeleton, andvalidate-exit.sh.
Round R1-Phase5a: Component Library Updates Decisions
Section titled “Round R1-Phase5a: Component Library Updates Decisions”This round captures decisions made during Phase 5a — additive helpers to the common-module library consumed by the Phase 5b Email module. Decision IDs DQ-R1-027 through DQ-R1-031 are reserved for this round; all entries are resolved. Phase 5b’s consumer-side adoption work (lifting the new common-module version, applying the Email module’s classification of Internal.IncompatibleState sites, wiring the typed idempotency view) lands separately under Phase 5b decisions.
DQ-R1-027: AppError.Application Introduction
Section titled “DQ-R1-027: AppError.Application Introduction”Status: Resolved. Round R1-Phase5a, 2026-05-28.
Context: The existing AppError hierarchy splits caller error (Invocation — ArgumentValidation, NullArgument, NotFound, Duplicate, Authorization; reportable() = emptyList()) from system error (Internal — Implementation, Infrastructure, IncompatibleState, InternalService, InternalTimeout, Transient; reportable() = listOf(this)). Neither captures the third real category: the call was well-formed, the system is healthy, but the application’s current state does not allow this operation right now. Today these get squeezed into Invocation.GeneralValidation (which misclassifies a domain-state outcome as caller error) or Internal.IncompatibleState (which misclassifies a recoverable application outcome as a bug-class signal). The misclassification has real cost: Internal.IncompatibleState pages on-call; the L4 mapping table can’t distinguish “caller passed bad input” from “system in unexpected state”.
| Option | Description | Trade-offs |
|---|---|---|
| A | Keep the two-branch hierarchy. Document the convention that application-state outcomes use Invocation.GeneralValidation. | No new types. But: the type system stops carrying the information; reviewers must enforce the convention by inspection; reportable() still pages on-call for cases that aren’t bugs. |
| B | Add Application as a third top-level branch, peer to Internal and Invocation. Three concrete subtypes — PreconditionFailed (operation requires prior state the system doesn’t have), PolicyRejected (operation disallowed by policy), ConflictingState (operation race-lost or expectation drifted). reportable() returns empty list for the whole branch. | Three new types. Source-incompatibility for exhaustive-when consumers of AppError. But: the type system carries the classification; reportable() is correct by construction; the L4 mapping table dispatches on subtype. |
| C | Add Application as a flat peer of three data classes (no enclosing sealed class Application). | Avoids the source-incompatibility from sealed-class nesting. But: loses the shared reportable() = emptyList() override; loses the typed-dispatch case at L4. |
Recommendation: Option B.
Decision: Option B. sealed class Application is added as a third top-level branch under AppError, with three concrete subtypes (PreconditionFailed, PolicyRejected, ConflictingState). All three are data classes with message: String, context: LazyMessage? = null, cause: Throwable? = null. reportable() returns emptyList() at the Application level (single inheritance point). REST-status mapping is the single responsibility of the L4 mapping table (HttpErrorResponses.kt); Application subtypes do not carry HTTP-status hints. The companion Internal.IncompatibleState reclassification sweep is tracked separately (DQ-R1-028).
Applied to:
design/index.md§ 1 — API surface and consumer guidance.task-plan.mdPR #1 — additivecommon-moduleminor release introducing the three subtypes (Addedcategory; 9.2.0).- Phase 5b email module — L3 services in the Email module use
Application.PreconditionFailed/Application.ConflictingStatewhere appropriate.
DQ-R1-028: Internal.IncompatibleState Reclassification Sweep
Section titled “DQ-R1-028: Internal.IncompatibleState Reclassification Sweep”Status: Resolved. Round R1-Phase5a, 2026-05-28.
Context: DQ-R1-027 introduces AppError.Application. The existing 62 construction sites of Internal.IncompatibleState in common-module/lib/src/main (and additional sites in operations) need case-by-case judgement: each site is one of (a) a genuine bug-class invariant violation that should keep Internal.IncompatibleState, (b) a recoverable application outcome that should move to Application.ConflictingState, or (c) a caller-error that should move to Invocation.GeneralValidation. The question is how to land the sweep — as a single bulk PR, as a series of per-area PRs, or as the last PR of Phase 5a after the other helpers are in.
| Option | Description | Trade-offs |
|---|---|---|
| A | Sweep + AppError.Application introduction in a single PR. One review captures both the new types and the reclassification. | Locks the rationale to the same review. But: bigger PR, harder to atomically revert one decision without the other. |
| B | Sweep as one of several parallel PRs that each carry an additive helper. Each PR is Added-only; the sweep is Changed. Five PRs land in any order; consumers absorb when convenient. | Sweep can land mid-stream; would force a major bump (Changed) in the middle of the additive-minor sequence. Confusing release history. |
| C | Sweep as the last PR of Phase 5a, after the four Added-only helpers. PR #1, #3, #4, #5 each Added and minor (9.2.0 → 9.5.0); the sweep is PR #2 by number but lands last and consolidates the major bump (10.0.0). | Each minor PR is small and reviewable in isolation; the sweep accumulates the discovery and reclassification across all 62+ sites in one focused review; consumers absorb one major + four minors in their adoption bump. Closest match to OQ-V’s “consumers absorb everything at once”. |
Recommendation: Option C.
Decision: Option C. The sweep lands as the final Phase 5a PR (numbered PR #2 by design topic, sequenced last by release order). Discovery is a separate planning step that builds the inventory of Internal.IncompatibleState construction sites in common-module and classifies each. The Phase 5a sweep covers common-module only; the matching sweep in operations is part of Phase 5b’s consumer adoption work. CHANGELOG category is Changed (consumers doing exhaustive when over AppError.Internal see reclassified sites move out); version bump is major (10.0.0 from the prior 9.x.x).
Applied to:
design/index.md§ 2 — sweep methodology and per-bucket criteria.task-plan.mdPR #2 — sequenced last;Changed; major bump.- Phase 5b — the matching sweep within
operationsis part of consumer adoption.
DQ-R1-029: sanitizeHeader Value-Cleaning Primitive
Section titled “DQ-R1-029: sanitizeHeader Value-Cleaning Primitive”Status: Resolved. Round R1-Phase5a, 2026-05-28.
Context: The Phase 5b Email module accepts inbound HTTP requests carrying headers that get persisted or routed into downstream operations (idempotency keys via Idempotency-Key, tenant correlation via X-Tenant-Id, etc.). The existing HeadersAllowList (in common-module’s lib/runtime/observability/) controls which headers are safe to log; it does not control what values are safe to read into business logic or persist. The two concerns are independent and compose: deny-by-name first (observability scoping), then per-value cleaning. Phase 5a needs a primitive that owns the value-cleaning step.
| Option | Description | Trade-offs |
|---|---|---|
| A | Extend HeadersAllowList to also clean values. | One artefact. But: conflates two responsibilities (observability scoping vs. persistence safety); the allowlist’s existing scope (Sentry-payload-shaping) is not the same as L3 / persistence input cleaning. |
| B | Add sanitizeHeader(name, value): Result<String?> in a new package (lib/api/headers/). Composes downstream of HeadersAllowList. Reject by-value (control characters, length cap, charset) returning Result.failure(AppError.Invocation.GeneralValidation); clean (trim, normalize) returning Result.success(cleaned). | Two artefacts with crisp responsibilities. sanitizeHeader is callable from L4 inbound and (by package placement) L4 outbound when needed. Composes with the existing allowlist without coupling. |
Recommendation: Option B.
Decision: Option B. sanitizeHeader(name: String, value: String): Result<String?> ships in lib/api/headers/ (a new package). Returns Result.success(cleanedValue) on accept, Result.success(null) on policy reject (header dropped silently, no error), Result.failure(AppError.Invocation.GeneralValidation) on hard-rejection (control characters, oversize, charset violation). The composition pattern at L4 inbound is: HeadersAllowList.filter(...) first (drop disallowed-by-name; observability scoping), then sanitizeHeader(name, value) per surviving header (clean / reject by value; in-transaction). HeadersAllowList is left unchanged.
Applied to:
design/index.md§ 3 — API surface, test plan, composition pattern.task-plan.mdPR #3 — additive; 9.3.0.- Phase 5b L4 endpoints (
email-configuration,email-job,postmark-events) — consumesanitizeHeaderfor inbound header handling.
DQ-R1-030: TokenCipher + Hmac Cryptographic Helpers
Section titled “DQ-R1-030: TokenCipher + Hmac Cryptographic Helpers”Status: Resolved. Round R1-Phase5a, 2026-05-29.
Context: DQ-R1-019 pinned the per-partition encryption-key design and named TokenCipher as the Phase 5a primitive that implements the two-axis envelope (a{N}.k{SM-VERSION-ID}:<base64-payload>). The decision left four implementation-level shape questions for Phase 5a: (1) factory shape (the canonical Arda companion inline operator fun <reified T> invoke(...) pattern doesn’t apply because TokenCipher is non-generic); (2) how the cipher resolves a versionId that is not present in the in-memory MaterialRegistry at decrypt time; (3) auth-tag-failure classification (AES-GCM tag-verification failure on decrypt — Internal.IncompatibleState vs Application.ConflictingState); (4) whether to extract an Hmac micro-helper or leave the two existing JDK-Mac callsites duplicated.
| Option | Description | Trade-offs |
|---|---|---|
| Factory: A | Use plain primary constructor. Validation (non-empty info, registry contains currentVersionId) becomes the caller’s responsibility. | Total constructor. But: pushes validation up to every caller; loses an obvious place to centralize the contract. |
| Factory: B | companion operator fun invoke(info: String, materials: MaterialRegistry, currentVersionId: UUID): Result<TokenCipher>. Callers write TokenCipher(info, materials, currentVersionId) and get Result<TokenCipher> back. | Constructor-shaped call site; Result<T> carries validation failures; consistent with the workspace kotlin-coding standard. The Arda reified-inline pattern is for resolving type parameters and doesn’t apply here. |
| Resolution: A | Project every live key-material version into the in-memory MaterialRegistry from a single ESO-projected JSON map; cipher fails fast (transient) on a miss. | Single secret-delivery path (ESO); no application-side path to AWS Secrets Manager; bounded common-module surface (no SDK dependency, no fallback abstraction). Propagation lag covered by the existing transient-retry layer. |
| Resolution: B | Maintain a smaller in-memory registry (e.g., AWSCURRENT + AWSPREVIOUS only); fall back to a caller-supplied hook (typically AWS SDK direct call) for older versions. | Smaller registry, but introduces a parallel-to-ESO secret-delivery path with the future-drift concern that other features may start using the same path. |
| Auth-tag: A | Classify auth-tag mismatch as Application.ConflictingState. L3 caller handles re-read / retry from a different version. | Recoverable framing. But: misclassifies what is actually data corruption (or active tampering); doesn’t page on-call when it should. |
| Auth-tag: B | Classify auth-tag mismatch as Internal.IncompatibleState. Bug-worthy; pages on-call. | Pages on the rare-but-serious failure modes (storage corruption, key-material desync, active tampering). If operational reality reveals spurious tag failures, reclassify — but bug-worthy is the safer starting position. |
| Hmac DRY: A | Leave the two JDK-Mac callsites (OpaqueId.kt:67, S3AssetService.kt:143) inline-duplicated. | No new helper. But: two near-identical copies that drift independently; future HKDF wrapper inside TokenCipher needs the same pattern. |
| Hmac DRY: B | Extract Hmac micro-helper in lib/crypto/ (the package TokenCipher lives in). Both existing sites migrate. TokenCipher uses it internally for HKDF derivation. | One helper, three callsites consistent. TokenCipher’s HKDF logic is testable in isolation. Internal refactor only; no external API change at the migrated sites. |
Recommendation: Factory B + Resolution A + Auth-tag B + Hmac DRY B.
Decision: Factory B (companion operator fun invoke(info, materials, currentVersionId): Result<TokenCipher>), Resolution A (single ESO-projected registry; no application-side path to AWS Secrets Manager), Auth-tag B (Internal.IncompatibleState for AES-GCM tag mismatch), Hmac DRY B (extract Hmac to lib/crypto/). Public surface lives in cards.arda.common.lib.crypto:
TokenCipher— two-axis envelopea{N}.k{SM-VERSION-ID}:<base64>; HKDF-SHA256 key derivation; AES-256-GCM encrypt / decrypt;MaterialRegistrykeyed byversionId; private constructor + companionoperator fun invoke(info, materials, currentVersionId). The cipher does not consult any external system at runtime — theMaterialRegistryis populated by the caller from a single ESO-projected JSON map carrying every live key-material version, and the cipher reads only what is currently in the registry. The caller may mutate the registry at runtime in response to ESO refresh events.MaterialRegistry— versioned 64-byte key-material store;addandofenforce material length.Hmac— thin wrapper overjavax.crypto.Macfor HmacSHA256. Used byTokenCipherfor HKDF;OpaqueId.ktandS3AssetService.ktmigrate to it as part of the same PR.Hmacexposure as a standalone helper for non-TokenCipherconsumers is deferred; HKDF stays internal toTokenCipherin v1.
Decrypt failure classification (two distinct modes):
- Auth-tag mismatch →
Result.failure(AppError.Internal.IncompatibleState(...))— bug-worthy; pages on-call. - Unknown
versionId→Result.failure(AppError.Transient.FailoverFailed(cause))wherecauseis a syntheticThrowablewhose message names the missing version. Bounded transient (propagation lag between AWS Secrets Manager and ESO’s projection into the pod); existing transient-retry layers (Postmark webhook, outbound idempotency, L4 client retries) fire after timescales that exceed ESO’s reconciliation interval and find the registry refreshed on the next attempt. Class nameFailoverFailedis observability noise — diagnostic info lives in the cause’s message and structured logging at the catch site; a more specific subtype is intentionally not added to avoid the breaking change of extending the sealedTransienthierarchy.
Rotation enablement (the JSON-map schema for the SM secret value, the disposition of the deployed EmailEncryptionKeyFallbackRole, the operator rotation script, and the future AWS SM Rotation Lambda) is tracked as a follow-up in PDEV-659.
Applied to:
design/index.md§ 4 — envelope format, key-derivation, error classification, internal refactor sites.task-plan.mdPR #4 — additivecommon-moduleminor release (Added). TheOpaqueId.kt/S3AssetService.ktmigration is a private-call-site refactor with no external API change; CHANGELOG remainsAdded-only.- Phase 5b email module (consumes
TokenCipherper DQ-R1-019).
DQ-R1-031: Idempotency Helpers with Native JsonElement + Typed Wrapper
Section titled “DQ-R1-031: Idempotency Helpers with Native JsonElement + Typed Wrapper”Status: Resolved. Round R1-Phase5a, 2026-05-28.
Context: Phase 5b’s Email module needs idempotency at two seams: inbound HTTP requests (Intent (a) — caller supplies an Idempotency-Key header; the L3 service de-duplicates retries) and outbound side-effecting calls (Intent (b) — the L3 service generates a deterministic idempotency key from a natural key for the downstream API). Phase 5a ships the primitives; Phase 5b wires them.
Two structural questions dominate the design:
- Type shape at the store boundary — whether the store is parameterised by the consumer’s
Req/Restypes (forcing the store to own kotlinx serialization via reifiedKSerializer<Req>/KSerializer<Res>references) or operates onJsonElementsymmetrically with a typed wrapper on top. - Schema evolution — when a consumer changes the shape of
Reqin a serialisation-affecting way (field rename, type change), in-flight idempotency records hash differently after the change. The defensive mitigation can be arequest_schema_versioncolumn on every row (store-side knob), procedural drain-before-deploy (operational), or the caller projectsReqto a stable hash shape explicitly (caller-controlled).
| Option | Description | Trade-offs |
|---|---|---|
| A | Typed store IdempotencyStore<Req, Res> with reified factory. Schema-evolution is procedural (drain-before-deploy). | Typed-by-default; naive callers get correctness for free. But: conflates request DTO with hash projection; schema-evolution mitigation is store-wide procedure; store API depends on KSerializer<Req>. |
| B | Native RawIdempotencyStore operating on JsonElement symmetrically (Req and Res). Typed wrapper IdempotencyStore<Req, Res> produced by an inline extension fun <reified Req, reified Res> RawIdempotencyStore.typedAs(json: Json = JsonConfig.standardJson): IdempotencyStore<Req, Res>. Caller chooses the layer; schema-evolution is the caller’s responsibility (custom adapter or stable JsonElement projection). | Explicit separation of request DTO from idempotency-hash projection. Schema-evolution is caller-controlled per consumer. Native store API surface is one type narrower (no KSerializer<Req>). One extra line of caller-side encoding boilerplate at the call site (folds into a per-service helper). |
Recommendation: Option B.
Decision: Option B. The native interface is RawIdempotencyStore (operating on JsonElement symmetrically); the typed view is IdempotencyStore<Req, Res> produced by inline fun <reified Req, reified Res> RawIdempotencyStore.typedAs(json: Json = JsonConfig.standardJson): IdempotencyStore<Req, Res>. The typed wrapper holds resolved KSerializer<Req>/KSerializer<Res> references captured once at wiring time. On replay-time decode failure (Req or Res bytes no longer match the current type), the wrapper returns Result.failure(AppError.Internal.IncompatibleState(...)) — decode failures are bugs (the consumer changed schema without a coordinated drain or adapter), not normal operational outcomes. The native store always carries Mismatch.recordedRequest: JsonElement so consumers (typed or raw) can log the recorded request for debugging on hash collision. Schema-evolution defence is caller-controlled: a consumer that wants stable hashes across Req versions writes a custom KSerializer<Req> (passed via a refined Json to typedAs) or projects Req to a stable JsonElement shape before calling the native store. No store-wide request_schema_version column ships in v1.
Other implementation-level decisions:
error_payloadprojection (failure-path) — encoded via a well-structured@Serializabledata class. The exact projection shape is fixed in the implementation PR; consumers seeIdempotencyOutcome.PriorError(error: AppError)on replay (durable contract). The on-disk bytes are private to the store.Mismatch.equals/hashCodeoverride — kept.ByteArraycontent-equality is required for tests and consumer-side caches to compareMismatchvalues meaningfully.replayWindowOverride: Duration? = nullparameter onbegin()— kept. Operator-driven retry endpoints that want a shorter replay window than the per-store default pass it per-request. Defaultnullmeans use store default.result_payload/error_payloadstorage column type —JSONBin PostgreSQL. PG validates payloads at write time; SQL-level inspection produces readable output; JSONB has native indexing if ever needed. Switching to a binary format later would be a column-type migration, judged unlikely.IdempotencyKeyMinter(Intent (b)) — separate helper takingparts: List<String>. Deterministic SHA-256-based; orthogonal to either store interface; not parameterised by Req/Res.
Applied to:
design/idempotency-design.md— full implementation-shaped design.task-plan.mdPR #5 — additivecommon-moduleminor release introducing the packagecards.arda.common.lib.runtime.idempotency(Added; 9.5.0). Theidempotency_recordFlyway migration ships in Phase 5b’s consumer adoption (operations), not incommon-module, since common-module has no production-side migrations.- Phase 5b email module — L3 services in the Email module consume the typed view (
IdempotencyStore<EmailSendRequest, EmailJob>) and theIdempotencyKeyMinterfor outbound Postmark retry safety.
Round R1-Phase5b: Email Module Decisions
Section titled “Round R1-Phase5b: Email Module Decisions”DQ-R1-032: EmailEncryptionKey registry delivery and source of truth
Section titled “DQ-R1-032: EmailEncryptionKey registry delivery and source of truth”Status: Resolved. Round R1-Phase5b. Supersedes DQ-R1-024.
Context: DQ-R1-024 chose CFN-native GenerateSecretString (Option A) to populate the per-partition EmailEncryptionKey secret. During Phase 5b email-module deployment this proved undeployable. GenerateSecretString can only emit a flat {"<fixedKey>": "<random characters>"} value, but the operations module’s MaterialRegistryRefresher reads the secret as a key registry — a non-empty JSON map { "<versionId-UUID>": "<base64 of exactly 64 bytes>" }, ordered newest-last, where the current encryption version is the last key. It fail-fasts (AppError.Infrastructure, pod dies) on any other shape: the key isn’t a UUID, the value is 64 characters (~48 bytes) not base64 of 64 bytes, and there’s no version map. CFN cannot generate a UUID key, base64-of-N-random-bytes, or an ordered map, so Option A can never satisfy the consumer. The registry must therefore originate outside CFN.
| Option | Mechanism | Trade-offs |
|---|---|---|
| A (DQ-R1-024) | CFN-native GenerateSecretString. | Zero custom code, but structurally incapable of the registry shape. Rejected — undeployable. |
| B | CDK Custom Resource (Lambda) generates the registry at deploy time and writes it to SM. | Could produce the right shape, but adds Lambda boilerplate, an opaque immutability/rotation story, and a second secret-generation pattern in the stack. |
| C | Pre-Deploy script + NoEcho CFN parameter (δ.1 pattern). The registry JSON is operator-provisioned in the partition vault (op://Arda-{Env}OAM/EmailEncryptionKey/registry); amm.sh resolves it through the Pre-Deploy script (--encryption-key-out) and passes it as the NoEcho EmailEncryptionKeyJson parameter backing the secret. | Mirrors the EmailPostmarkAccountToken delivery path exactly. 1Password is a single, auditable source of truth; key material never appears in templates, change sets, or stack events. Cost: an out-of-band operator step to seed the registry, and rotation requires a redeploy to propagate. |
Recommendation: Option C.
Decision: Option C. 1Password is the source of truth for the encryption-key registry. The flow is strictly one-directional:
op://Arda-{Env}OAM/EmailEncryptionKey/registry (source of truth, per partition) → amm.sh Pre-Deploy resolves it → NoEcho CFN parameter EmailEncryptionKeyJson → {Infra}-{ns}-I-EmailEncryptionKey Secrets Manager secret (deploy-time projection) → ESO ExternalSecret sync → /app/secret/email/values.json (pod projection) → MaterialRegistryRefresher (current version = entries.keys.last())Consequences that bind future work:
- Rotation = appending a new
{uuid: base64(64 bytes)}entry to the 1Password registry, newest-last, retaining prior entries so tokens encrypted under older versions still decrypt. Secrets Manager and the pod are downstream copies, refreshed on the nextamm.shdeploy (--force, since a parameter-only change does not alter the synthesised template) and ESO sync. Do not hand-edit the Secrets Manager value — it is overwritten from 1Password on every deploy. - Order is significant: the module treats the last key in the JSON as the active version, so tooling must append (never re-sort).
- In-pod live reload without restart remains the operations-side concern (email S09 /
RotatingTokenCipher+ watch loop); even then the material still originates from 1Password.
The -API-EmailEncryptionKeyArn export, the secret name, and RemovalPolicy.RETAIN are unchanged from DQ-R1-024, so no consumer-facing identifier moves.
Applied to:
infrastructure—src/main/cdk/stacks/purpose/partition-email.ts(NoEchoEmailEncryptionKeyJsonparameter +secretStringValue),src/main/cdk/platforms.ts(per-partitionmail.encryptionKeyOpReference+ helper),tools/lib/partition-mail-signature.ts(--encryption-key-out),amm.sh(parameter override). Shipped in infrastructure#485, CHANGELOG[3.4.3]. Tracked as PDEV-880.operations—MaterialRegistryRefresheris the consumer whose contract drives the shape (no change required by this decision).- Operator prerequisite & rotation tooling: the registry must be seeded once per partition vault (item
EmailEncryptionKey, fieldregistry). Generation/rotation tooling is tracked as PDEV-881.
Summary
Section titled “Summary”| # | Summary | Status | Downstream Impact | Decision |
|---|---|---|---|---|
| DQ-001 | Tenant sending domain shape | Resolved | DNS zone structure, CDK stacks, tenant provisioning scripts, supplier-facing FQDNs | <tenant>.<partition>.{mail-root-domain} uniformly (revised per DQ-010) |
| DQ-002 | Multi-config domain strategy for v2+ | Resolved | tenant_email_config schema (nullable config_slug), DNS record structure, v2+ provisioning | Sub-subdomain (<conf>.<tenant>.<partition>.{mail-root-domain}); v1 provisions at tenant level only |
| DQ-003 | Tenant slug source | Resolved | Provisioning request shape, slug derivation logic | From request (tenantEId, tenantName, tenantSlug); derivation algorithm deferred |
| DQ-004 | Reply-To editability in send dialog | Resolved | Send dialog UI, BFF route contract, GEN::EML and PRO::EML use cases | Read-only; system-resolved from procurement contact or user email |
| DQ-005 | Email order send paths (copy-paste vs system) | Resolved | SPA side panel UX, backend submit signal handler, PRO::EML use cases | Both coexist; copy-paste preserved for email orders, system send added as new path |
| DQ-006 | CS alerting scope in v1 | Resolved | Observability infrastructure, GEN::EML::0004 use case scoping | ESP OOTB alerting in v1; Arda-built is v2+ |
| DQ-007 | Document generation responsibility | Resolved | Email service interface contract, PO submit workflow, GEN::EML::0002 use case | Calling feature generates document, passes Blob/URL to email capability |
| DQ-008 | Send dialog interaction model | Resolved | SPA dialog component, GEN::EML::0001 scenario structure | Single-step dialog; cancel prompts if edits were made |
| DQ-009 | Mail root domain choice | Resolved | DNS zone creation, registrar delegation, all tenant FQDNs, infrastructure.md parameter resolution | ardamails.com (standalone, already owned); implementation parametric |
| DQ-010 | Prod tenant zone placement | Resolved | Root zone content, IAM scoping, cross-account access, DQ-001 FQDN shape | Own partition zone; root zone stays static/CDK-only |
| DQ-011 | Webhook authentication mechanism | Resolved | Postmark-events endpoint auth, provisioning flow (Step 5), Webhooks API usage | Bearer token via modern Webhooks API; reuses existing ARDA_API_KEY validation |
| DQ-012 | Per-tenant server token storage | Resolved | Secrets Manager scope, IAM roles, provisioning flow, emailConfiguration service, DB schema | Encrypted in DB with partition-wide key (via ESO); no per-tenant SM writes; emailConfiguration decrypts for emailJob |
| DQ-013 | IAM role extraction from root stack | Resolved | Root CDK application structure, deployment procedure | Do not extract; role stays in RootDnsStack (CF name: RootConfiguration). Extraction deferred to future need. |
| DQ-R1-006 | Locus of cross-zone NS-delegation writes | Resolved | Phase 2 / Phase 3 / Phase 4 ownership boundaries; deploy-order dependency between Root and child stacks | Child stack writes upstream via WriteNSRecordsToUpstreamDns; Root only owns the assume-role IAM target |
| DQ-R1-007 | Vault separation for Free Kanban Tool server token | Resolved | Phase 1 typed surface (item removed); Phase 3 reintroduces with new location; threat model — credential out of OP_SERVICE_ACCOUNT_TOKEN blast radius | op://Arda-CorporateOAM/Free-Kanban-Generator-Postmark-Server/credential (separate vault from Arda-SystemsOAM) |
| DQ-R1-008 | Adopt vs create the existing ardamails.com zone | Resolved | RootConfiguration stack composition; deployment workflow (IMPORT change-set + normal deploy); registrar NS chain preserved | Adopt via cdk import against Z0721066239FWCD47EJDX; CDK code mirrors the live AWS-default comment to keep the import read-only; RemovalPolicy.RETAIN defends against accidental destroy |
| DQ-R1-009 | Postmark domain-verification target (parent vs leaf) | Resolved | PostmarkSendingDomain configuration; operator companion; future Corporate-consumer onboarding | Verify at the Corporate-zone parent (arda.ardamails.com); leaves inherit DKIM |
| DQ-R1-010 | Locus of Corporate’s NS-delegation write (same-account) | Resolved | CorporateMailDns stack composition; behavior under future Corporate-account migration | Always go through WriteNSRecordsToUpstreamDns and assume the Root role even when same-account |
| DQ-R1-011 | route-53-hosted-zone.ts → dns-zone.ts migration shape | Resolved | Construct catalogue; Phase 3 PR scope; existing callers (partitions + Root) | Rename in place; existing callers updated in the same PR |
| DQ-R1-012 | Corporate drift-workflow filename and scope | Resolved | .github/workflows/ shape; tools/corporate-drift.ts driver design; future Corporate-asset onboarding | corporate-drift.yml — one workflow per instance group, exercising every asset listed in instances/Corporate/ |
| DQ-R1-013 | Phase A failure ordering for the Postmark server token | Resolved | corporate-cli.ts Phase A semantics; recovery path on 1P-write failure; testability | In-memory buffer + retries on the 1P write; fail loud with redacted summary on permanent failure; manual operator action to recover |
| DQ-R1-014 | cdk.context.json commit policy | Resolved | Repo .gitignore; CI re-synth determinism | Commit cdk.context.json with the postmark.free-kanban.* keys (public values) |
| DQ-R1-015 | DMARC reporting mailbox | Resolved | DMARC TXT record content at _dmarc.arda.ardamails.com; operator companion (mailbox provisioning prerequisite) | rua=mailto:dmarc-reports@arda.cards; operator provisions the mailbox in Arda’s Google Workspace before Phase B deploy |
| DQ-R1-016 | Reserved-name registry scope at arda.ardamails.com | Resolved | Cross-instance-group import coupling; corporate-cli.ts Phase A acceptance criteria | Documentation-only registry; CLI enforces locally via a Phase-A conflict-check against pre-existing Sender Signatures, servers, and 1P items |
| DQ-R1-017 | Postmark Sender Signature granularity per partition | Resolved | Phase 4 partition-email stacks; Postmark account split; per-tenant deferral to Phase 5b | One Signature per partition sub-zone; leaves inherit DKIM; per-tenant Signatures deferred |
| DQ-R1-018 | corporate-drift rename and scope | Resolved | .github/workflows/ shape; future runtime-platform drift checks unrelated to email | Keep corporate-drift; add parallel runtime-platform-drift with shared reusable scripts |
| DQ-R1-019 | Per-partition email server-token encryption key | Resolved | Phase 4 SM secret; Phase 5b TokenCipher + Helm ExternalSecret mounts; future AWS Rotation Lambda | Single SM secret per partition with native versioning; two-axis envelope a{N}.k{SM-VERSION-ID}; hot-swap dual-mount; lazy + coroutine migration; SDK fallback |
| DQ-R1-020 | DNS-provisioning + SM-fallback IAM roles | Resolved | Phase 4 partition-email stack IAM declarations; Phase 5b STS-chain call sites in the Email module; AllowCreatingNSRecordsRole construct generalization (R-4) with Root no-drift guard | Two per-purpose roles per partition: DNS-records role via reuse of the existing AllowCreatingNSRecordsRole construct (generalized for a configurable trust principal; Root output byte-identical, guarded by unit test + verification); EmailEncryptionKeyFallbackRole fresh. Both STS-chained from the partition pod role; trust policy = account principal + ArnLike on {fqn}-*; mirrors the ImageUploadPreSigningRole pattern |
| DQ-R1-021 | Order of partition rollout | Resolved | Phase 4 + Phase 5b deploy order; kyle suspension | Partial order dev → {stage || demo} → prod; kyle excluded (per 2026-05-13 amendment) |
| DQ-R1-022 | Operator CLI shape for Phase 4 | Resolved | Phase 4 operator surface; refactoring of Phase 3 corporate-cli to extract shared utilities | Integrate into amm.sh; share utilities with corporate-cli; no standalone partition-mail-cli |
| DQ-R1-023 | Per-tenant Postmark Sender Signature introduction (Phase 5b) | Open — TBC at Phase 5b planning | Phase 5b tenant-onboarding flow; per-tenant reputation isolation strategy; whether EmailDnsProvisioningRole is exercised per-tenant or held in reserve | Four options (α / β / γ / δ); no Phase 4 dependency. To be confirmed when Phase 5b sees pilot data on tenant send volume and bounce / spam rates. |
| DQ-R1-024 | EmailEncryptionKey initial-value generation mechanism | Superseded by DQ-R1-032 | Phase 4 PartitionEmailStack secret declaration; V-PART-016 immutability test shape | GenerateSecretString |
| DQ-R1-025 | Pre-Deploy script’s cdk.context.json write strategy | Resolved | T-I8 entry-script implementation; tools/lib/context-store.ts helper; V-CLI-001 test shape | Hand-rolled atomic JSON merge namespaced under partitionMail:<infrastructure>:<partition>. No CDK runtime bootstrap; no cdk context --set per field. |
| DQ-R1-026 | Consolidation of per-partition rollout runs | Resolved | Run table in choreography / evaluation / specification; Run-3 deliverables (execution log + single infra PR); retirement of run-3-stage / run-4-demo / run-5-prod plans | Collapse Runs 3 / 4 / 5 into a single Run-3 operator cascade. The cascade walks the partial order from DQ-R1-021 — dev first, then stage and demo in either order or in parallel, then prod last — inside the single PR. One PR captures CHANGELOG + accumulated cdk.context.json + any code fixes. Run-6 / Run-7 numbering retained (gap intentional). |
| DQ-R1-027 | AppError.Application introduction | Resolved | common-module AppError hierarchy gains a third top-level branch; L4 error-mapping table updated; Phase 5b L3 services in the Email module use Application.PreconditionFailed / ConflictingState | Add sealed class Application with three subtypes (PreconditionFailed, PolicyRejected, ConflictingState); reportable() = emptyList() at the branch root; no HTTP-status hints on subtypes |
| DQ-R1-028 | Internal.IncompatibleState reclassification sweep | Resolved | common-module (62+ construction sites) reclassified per Phase 5a methodology; operations sweep is Phase 5b’s responsibility; CHANGELOG Changed -> major bump (10.0.0) | Discovery-then-classify methodology; sweep lands as the final Phase 5a PR (after the four additive minors); covers common-module only |
| DQ-R1-029 | sanitizeHeader value-cleaning primitive | Resolved | New lib/api/headers/ package in common-module; Phase 5b L4 endpoints consume it; composes downstream of the existing HeadersAllowList | sanitizeHeader(name, value): Result<String?>; accept / silent-drop / hard-reject; separate from HeadersAllowList (observability scoping); minor release (Added; 9.3.0) |
| DQ-R1-030 | TokenCipher + Hmac cryptographic helpers | Resolved | New lib/crypto/ package in common-module with two-axis envelope per DQ-R1-019; MaterialRegistry populated by the caller from a single ESO-projected JSON map (no application-side path to AWS Secrets Manager); OpaqueId.kt and S3AssetService.kt internally refactored to use the new Hmac helper (no external API change at those sites); Phase 5b consumes TokenCipher for at-rest server-token encryption; rotation enablement tracked as PDEV-659 | companion operator fun invoke(info, materials, currentVersionId) factory returning Result<TokenCipher>; AES-GCM auth-tag failure classified as Internal.IncompatibleState (bug-worthy); unknown versionId on decrypt classified as Transient.FailoverFailed (bounded propagation lag handled by existing retry layers); Hmac extracted and shared; Added |
| DQ-R1-031 | Idempotency helpers with native JsonElement + typed wrapper | Resolved | New lib/runtime/idempotency/ package in common-module; Phase 5b consumer wires IdempotencyStore<EmailSendRequest, EmailJob> and IdempotencyKeyMinter; the idempotency_record Flyway migration ships in Phase 5b’s consumer adoption, not in common-module | Two-tier API: RawIdempotencyStore (native JsonElement, symmetric Req/Res) + IdempotencyStore<Req, Res> typed wrapper via inline fun typedAs(). Decode-failure -> Result.failure(AppError.Internal.IncompatibleState); Mismatch carries recordedRequest: JsonElement; schema-evolution is caller-controlled; JSONB storage columns; Added; 9.5.0 |
| DQ-R1-032 | EmailEncryptionKey registry delivery and source of truth | Resolved (supersedes DQ-R1-024) | PartitionEmailStack secret declaration; platforms.ts op-references; amm.sh + Pre-Deploy script; operator vault-seeding step; key-rotation tooling (PDEV-881) | Deliver the {versionId: base64(64-byte key)} registry via a NoEcho CFN parameter (δ.1) sourced from op://Arda-{Env}OAM/EmailEncryptionKey/registry. 1Password is the source of truth; SM + pod are deploy-time projections; rotation appends entries newest-last and redeploys. |
Copyright: © Arda Systems 2025-2026, All rights reserved