Phase 3 -- Implementation Learnings

Substantive insights from Phase 3 implementation that future phases (and future operator walkthroughs) should benefit from. Each learning ties back to a concrete moment in the implementation and to an artefact that captures it.

L-1: Design decisions that constrain code must be encoded as code

The single most expensive defect in Phase 3 (dqr1009-divergence.md) traces to a design decision recorded in three prose locations — the decision log entry, a docstring, and the operator runbook — but in no value or function any code consumed. The CLI side honored the decision inline by accident; the CDK construct’s API conflated two distinct record placements under a single parameter and silently violated the decision. Single-side tests passed because each side was internally consistent.

Take-away: when a decision constrains code in more than one place, encode it as a typed value or function and have every consumer read it. Adding cross-seam tests (assert that consumer A’s derived value equals consumer B’s derived value) closes the gap that single-side tests inherently miss. The pattern is documented in PR #450 commit cd85527 (a typed sendingDomainPlacement() plus a corporate-drift cross-seam assertion).

Applies workspace-wide. The corollary for the broader codebase: any decision recorded in decision-log.md whose constraint reaches code should either (a) point at a function or value that implements it, or (b) carry an explicit acknowledgement that no code expression exists and the constraint is operator-enforced.

L-2: A `cdk diff` of zero against a deployed stack means “the template did not change”, not “the deployed system is correct”

cdk diff reported zero differences for CorporateMailDns after the DQ-R1-009 fix because Corporate’s records (zone, SPF, DMARC, NS-delegation CR) were unchanged by the fix; only FreeKanbanToolMailDns’s DKIM record changed. Before the fix, cdk diff against the deployed system had reported “no differences” for both stacks — consistent with the live state, which the code was internally consistent with — yet the live state was wrong by the design intent.

Take-away: cdk diff is a code-vs-deployed comparison. It is not a design-vs-deployed comparison. The drift check is the artefact that crosses the seam between code (and what the code intends) and the deployed-plus-external state (DNS, Postmark, 1Password). Phase 3’s drift check now performs both classes of assertion; future phases should add cross-seam drift checks for any third-party state the IaC depends on.

L-3: Postmark’s “domain verification” indicators are not symmetric across the three faces (SPF, DKIM, Return-Path)

In the buggy state we observed SPFVerified: true on arda.cards while in fact the live DNS SPF record had no Postmark include: mechanism. Postmark reported a stored SPFTextValue that matched its template (include:spf.mtasv.net), not what the customer’s DNS actually served. Meanwhile DKIMVerified was the strict, dynamic check (Postmark re-polls and updates); ReturnPathDomainVerified similarly dynamic. The three indicators answer subtly different questions: “did this verify at some point?” vs “is this currently correct?”.

Take-away: for any drift check that consumes Postmark Account API fields, treat them as indicators of distinct contracts and verify each against live DNS independently. Do not infer one from another (e.g., do not assume that all-three-true means the system is healthy). The Phase 3 drift check now uses Postmark’s DKIMPendingHost / DKIMHost / ReturnPathDomain as authoritative expectations and cross-references them against (a) the placement function and (b) live DNS lookups.

L-4: Account-level vs server-level controls on Postmark are independent

Phase B’s send-a-test was blocked by Postmark’s pending-account-approval gate (account level) even though the FreeKanbanTool server was already DeliveryType: Live (server level). The account is in “pending approval” until Postmark Compliance approves it; until then, recipient addresses must share the From: domain regardless of any server’s delivery type. Approval is a compliance review distinct from per-server provisioning.

Take-away: the Phase A server-delivery-type log event tells you “the server can in principle deliver real mail”, not “the account can deliver real mail externally”. The runbook should mention both checkpoints (server DeliveryType and account approval status) as separate prerequisites for external sending. For new Postmark accounts in any future project, expect a 1-2 business day Compliance loop before external send-a-test will succeed.

L-5: Postmark Sender Signature deletion is non-destructive at the DNS layer

When we deleted the stale arda.cards Sender Signatures (one per account, after Jon’s suggestion), the DNS records in the arda.cards zone (SPF + Return-Path CNAME) remained untouched. Postmark’s DELETE /domains/{id} removes the Sender Signature record on Postmark; it does not touch customer DNS. Conversely the arda.cards zone is not managed by this project’s CDK — the SPF and Return-Path CNAME records there are operator-managed.

Take-away: Postmark Sender Signature lifecycle is decoupled from DNS lifecycle. Cleanup of the unused DNS records in arda.cards is a separate, operator-driven step (out of scope for Phase 3; harmless if left in place). For sub-domains whose DNS is CDK-managed (e.g., arda.ardamails.com), removing a Sender Signature would orphan the DKIM TXT and Return-Path CNAME until the next cdk deploy. This is worth documenting in any future operator runbook that retires a Sender Signature.

L-6: CFN `RecordSet` Name changes force `replace`, not in-place update

The DKIM TXT record move from leaf to parent was an Update-with-replace in CloudFormation (create-new-then-delete-old), not an in-place modification. Route53 keys record sets on (Name, Type); changing Name is a new resource. The resulting brief overlap (both records exist for seconds) is intentional and avoids a DNS outage during the move. cdk diff flagged this as requires replacement with a clear [~] indicator.

Take-away: any future change that alters a record’s Name (zone-relative or absolute) will replace, not update. For records that are load-bearing (DKIM, MX, etc.), this is a feature — the new record is fully created and propagated before the old one is removed. For records whose values evolve in place (TTL, ResourceRecords content), CDK performs an in-place update with no overlap. The drift check’s TTL=300s default on the email records keeps the propagation window short.