Phase 0 Infrastructure PR Review (PR #445)

Historical record. This document reviews an earlier Phase-0 implementation that has been superseded by the rev1 design. The artifacts referenced below (scripts, constructs, paths under scripts/postmark-foundations/, etc.) are no longer part of the active codebase — they survive only as a closed-PR record. The review is preserved here for traceability of decisions and concerns that informed the current design; do not treat any path or filename as a live pointer.

Phase 0 Infrastructure PR Review — PR #445

In-depth review of Arda-cards/infrastructure#445 — the implementation of Phase 0: Postmark Foundations of the email-integration project. Scope intentionally constrained to Phase 0 (not Phase 1 / 2). This is a production-touching script + CDK review; the review was requested in depth and thorough.

The review document is co-authored from a static reading of the worktree at commit a17eab6, the Phase 0 specification artifacts under 0-postmark-foundations/, and a complete local execution of typecheck, lint, unit tests, npm-audit, and CDK assertion tests against the construct.

[!note] Reviewer: Claude Opus 4.7 for Miguel Pinilla. No PR-side files modified.

1. Executive Summary

Recommendation: PR is high-quality and substantively ready for review-pass-2 / merge to main after addressing the three Major findings below. No Blocker findings.

Highlights:

Idempotent orchestrator with bounded retries, structured logging with token redaction, dual-auth (DesktopAuth / service-account-token) 1Password wrapper, and a separate gha-secrets CLI — all behind a clear test surface (86 unit tests, 100% pass).
Defensive CDK construct with offline-safe synth, fail-fast validation of synth-time inputs, and a gated wire-in (INCLUDE_FREE_KANBAN_RECORDS=true) that keeps existing CI synth jobs unaffected.
A well-designed CI workflow with empty default permissions, scoped per-job permissions, path-filtered cheap jobs, integration job behind workflow_dispatch + schedule only, and an if: always() cleanup step that deletes integration-test resources unconditionally.
A safety guard on delete-platform-server that refuses to operate against PostmarkProd without an explicit --allow-prod-delete flag.
HUMAN-STEPS pre-flight checklist parsing as a real (not theatrical) gate on live-side-effecting code paths.

Top concerns:

M1 (RPN 64) — Octokit chain has 6 moderate ReDoS CVEs. Pinned @octokit/rest@19.0.13 brings transitive deps with known advisories. Real-world risk is low (only GitHub-controlled responses traverse the regexes), but compliance posture and npm audit exit-code-1 in CI matter.
M2 (RPN 80) — Hardcoded op:// PROD reference for the delete safety guard. The guard compares against a magic string; a 1Password-item rename silently bypasses the guard.
M5 (RPN 96) — cdk-synth job uses npm install (mutating) rather than npm ci (reproducible).

There are also Minor issues around env-var-bridged DKIM data into CDK, missing TXT-chunking test for >255-char DKIM values, action pinning to major tag rather than commit SHA, and several documentation deliverables specified in requirements.md that are not included in this PR.

2. Methodology

2.1 Review Dimensions

The review covered the dimensions agreed at session start, in priority order:

Specification conformance to 0-postmark-foundations/.
Security posture (auth surface, secret handling, IAM blast radius, supply chain).
Robustness (error handling, retries, idempotency, rollback / cleanup).
Engineering principles (DRY, single responsibility, modular cohesion, magic-literal hygiene, single return points where applicable — adapted from the project’s Kotlin conventions to TypeScript / CDK).
Test coverage (unit + assertion + integration).
Idempotency & re-runnability.
Blast radius & environment isolation.
Drift, deletion policy, replacement risk.
Stack-name & logical-ID stability.
Secrets handling end-to-end.
Observability & operability.
Cost & quotas.
Change-management mechanics.
CI/CD coverage.
Documentation & runbook.
FMEA: each finding scored on Severity × Occurrence × Detection (each 1-10), Risk Priority Number (RPN) reported. Action thresholds: RPN ≥ 100 = Blocker; 50-99 = Major; 20-49 = Minor; < 20 = Nit.

2.2 Local Verification Runs

All output captured in infrastructure/scratch/phase0-review/ (gitignored).

Check	Result
`cd scripts && npm ci`	✅ clean
`cd scripts && npm run build` (`tsc --noEmit`)	✅ clean
`cd scripts && npm run lint`	✅ clean
`cd scripts && npm run test:ci`	✅ 86 tests, all pass (PR description says 65 — count grew with `verify` + `delete` subcommands and is now higher)
`cd scripts && npm audit`	⚠️ 6 moderate (ReDoS) — see M1
Repo `npm install`	✅ clean
Repo `npm run build`	✅ clean
Repo `npm run lint`	✅ clean
Repo `npm test`	✅ 54 pass, 16 skipped (pre-existing), 2 suites skipped
Repo `npm audit`	⚠️ 23 vulns — pre-existing, not introduced by this PR
`npx jest src/main/cdk/constructs/email --ci`	✅ 9/9 pass

CDK synth:named was deliberately not run for this review:

The PR description already documents that synth:named -- Alpha001/prod succeeds both with and without INCLUDE_FREE_KANBAN_RECORDS=true.
The latest CI run on the PR shows all 7 synth-each-cdk-app (...) matrix jobs succeeding (build job included).
The FreePlatformServerRecordsStack uses fromHostedZoneAttributes (offline-safe lookup), so no AWS API calls are required to synth that stack.
Running synth would require an AWS-profile login. Per the agreed protocol the user is asked before each profile is used; for this review the green CI evidence + offline-safe code path is sufficient.

2.3 Source-of-Truth Inputs

Specification: 0-postmark-foundations/specification.md, requirements.md, verification.md, analysis.md.
Plan and decision context: 0-postmark-foundations/plan/task-plan.md, decision-log.md, goal.md.
PR metadata, file list, CI status, reviewer threads: GitHub MCP fetches against Arda-cards/infrastructure#445.
Codex automated reviewer (the only third-party reviewer) flagged two issues; both were resolved in commits 3180c8d and 2b5a12b and the review threads are marked resolved + outdated.

3. Specification Conformance

Phase 0 (per requirements.md) is divided into 6 requirement clusters: A (Account), B (Free Server), C (Documentation), D (Orchestrator), D-bis (Shared Lib), E (Reserved Slug), E-bis (CI Integration), F (Dependencies), G (Out of Scope).

Cluster	Requirement(s)	This PR delivers?
A	REQ-PM-ACCT-001/002/003	Operator runbook only (HUMAN-STEPS.md). Account creation is manual; PR delivers the gating checklist. ✅
B	REQ-PM-FREE-001..005	Orchestrator + CDK construct. ✅
C	REQ-PM-DOC-001/002/003/004	Not in this PR. `postmark-api-observations.md` is referenced (REQ-PM-DOC-003) but the OAM service docs (-001/002) and gha-secrets script docs (-004) are absent. See Finding N7.
D	REQ-PM-SCRIPT-001/002/003	All three present (orchestrator, CDK construct, gha-secrets standalone). ✅
D-bis	REQ-PM-LIB-001/002/003	Shared `scripts/lib/` with `OnePasswordClient` + `GitHubConfig` wrappers and dual-auth discovery. ✅
E	REQ-PM-SLUG-001	Reserved slug `platform` lives in the `operations` repo’s HOCON config; the requirements doc explicitly says “No change required”. This PR does not need to address it — but Phase-1 verification should confirm the constant is present. ✅ (out of this repo’s scope)
E-bis	REQ-PM-CI-001/002/003/004	All four jobs present in `.github/workflows/postmark-foundations.yml`. The integration job goes one step further than the spec by also running on a monthly `schedule` for drift detection. ✅
F	REQ-PM-DEP-001/002	Phase 1 dependency formally noted; the gated wire-in via `INCLUDE_FREE_KANBAN_RECORDS=true` honours it. ✅
G	Out-of-scope items	Honoured: no token rotation, no Secrets Manager copy, no per-tenant DMARC, no decommission flow (delete-platform-server is integration-test cleanup, not decommission). ✅

3.1 Plan Deviations (declared in PR body)

The PR body declares three plan deviations. Reviewer assessment:

Deviation	Assessment
Construct lives at `src/main/cdk/constructs/email/` not `lib/email/`	✅ Better. Matches the repo’s existing CDK layout convention. The spec’s `lib/email/` path predates that convention.
`@octokit/rest@19.0.13` pinned because v20+ is “ESM-only” and breaks ts-jest	⚠️ Partially incorrect. Octokit v20-21 stayed CJS-friendly; v22 was the ESM cutover. Bumping to 21.x is safe and would resolve the ReDoS chain (M1).
CDK construct uses env-var bridge for synth-time DKIM data	⚠️ Pragmatic but risky. See Finding M3. A future synth-time 1Password helper would close the gap.

3.2 Undeclared Scope Additions

The PR adds three things that go beyond the original Phase 0 spec. These are net-positive but worth noting:

delete-platform-server subcommand (orchestrator + CLI). Adds a CI-cleanup capability that was not in the spec but is operationally necessary because the integration job creates timestamped resources.
Monthly drift-detection schedule on the integration job. Good defensive engineering.
Auto-issue on scheduled-run failure (gh issue create step). Good operability hygiene.

4. Findings

4.1 Major Findings

M1 — Octokit dependency chain has 6 moderate ReDoS CVEs

Where: scripts/package.json pins @octokit/rest@19.0.13; transitive deps @octokit/plugin-paginate-rest@6.1.2, @octokit/request@6.2.8, @octokit/request-error@3.0.3 carry the advisories.
Advisories:
- GHSA-h5c3-5r3r-rr8q (@octokit/plugin-paginate-rest)
- GHSA-rmvr-2pp2-xj38 (@octokit/request)
- GHSA-xx4v-prfh-6cgc (@octokit/request-error)
Detection: npm audit (run locally) reports 6 moderate vulnerabilities, exit code 1.
Why it matters:
- ReDoS via catastrophic backtracking on URL / header parsing in the Octokit chain. Triggered only by responses traversing those regexes; in practice only GitHub’s own API responses do, and GitHub does not emit pathological strings.
- However: this is an infrastructure repo whose audit-clean posture should be a quality bar; a npm audit --omit=dev --audit-level=high gate in CI would be reasonable.
PR rationale (from PR body): “v20+ is ESM-only and breaks ts-jest under the parent repo’s CommonJS toolchain.” Verify against Octokit’s actual ESM cut-over: v20 dropped Node 14/16 but stayed CJS; v22 went ESM-only. Bumping to 21.x is CJS-friendly and resolves the chain.
FMEA:
- Severity = 4 (DoS of one operator script, no privilege escalation; not a network-facing service)
- Occurrence = 2 (only triggers on hostile GitHub responses, which are not realistic)
- Detection = 8 (already auto-detected by npm audit; CI does not currently fail on it)
- RPN = 64 — Major.
Recommendation: bump @octokit/rest to the latest CJS-friendly major (verify each minor against ts-jest), or accept residual risk and add npm audit --audit-level=high to CI so the moderate advisories don’t auto-resolve into a blind spot.

M2 — Hardcoded `op://` PROD reference for delete safety guard

Where: scripts/postmark-foundations/src/orchestrator.ts:464
```
const PROD_REF = 'op://Arda-SystemsOAM/Postmark-Prod/credential';
```
guarded by if (inputs.accountSecretReference === PROD_REF && !allowProdDelete && !integrationTest) throw ... (lines 474-483).
Why it matters: this is the only safety guard preventing accidental delete of PostmarkProd resources. It compares against a magic string. If the 1Password item is ever renamed (e.g., to Postmark-Prod-Account for clarity, or as part of a vault-restructure), the comparison silently returns false and the guard bypasses. The CLI still passes accountSecretReference(account) derived from the --account flag, but the resolution is keyed to the same magic string, so a rename in 1Password does not propagate to a code change.
Better shape: pass an explicit isProd: boolean from CLI -> orchestrator (CLI already knows from --account PostmarkProd), or extract a shared ACCOUNT_REFS table (used by cli.ts:220 and orchestrator.ts:464) so the magic strings collapse to one place.
FMEA:
- Severity = 10 (irreversible: PostmarkProd delete cascades to lost transactional email sender domains, broken DKIM, lost server tokens)
- Occurrence = 2 (rename is rare, but happens during vault clean-ups)
- Detection = 4 (unit tests would not detect; the test fixture asserts the guard fires for the current string)
- RPN = 80 — Major.
Recommendation: replace the magic-string equality with a typed flag passed from CLI. Add a unit test that asserts the guard fires when --account PostmarkProd is set even if the secret reference is changed.

M3 — CDK construct’s env-var bridge for synth-time DKIM data

Where: src/main/cdk/constructs/email/free-platform-server-records.ts:81-103 (readValuesFromEnv).
Why it matters: Operators must populate three env vars (FREE_KANBAN_DKIM_SELECTOR, FREE_KANBAN_DKIM_KEY, FREE_KANBAN_RETURN_PATH_TARGET) before cdk deploy. The values come from the 1Password item Free Kanban Generator Postmark. The bridge is a manual two-step coupling. Risk: when the DKIM key rotates (Postmark or operator-initiated rotation), the operator must remember to re-export env vars before re-deploying. A stale env var produces records that point at a no-longer-valid DKIM key, breaking email auth silently until the DKIM verify call notices.
Mitigations already present:
- Construct fails fast on empty / whitespace env vars (good).
- The verify subcommand can re-poll Postmark verification after deploy, which would surface drift (but only if the operator runs it).
What’s missing: no synth-time consistency check between env vars and the 1Password item. The PR explicitly defers a synth-time 1Password helper to a future iteration (“the construct’s external surface does not change when a future synth-time 1Password helper lands”).
FMEA:
- Severity = 6 (broken email auth for the Free Prod Kanban Tool until corrected)
- Occurrence = 4 (operator forgetfulness is common with two-step deploys, especially in low-frequency rotation events months apart)
- Detection = 3 (silent drift; only verify subcommand or downstream email-bounce surveillance catches it)
- RPN = 72 — Major.
Recommendation: short-term, add an operator-runbook reminder + an explicit npm run pre-deploy-check script that compares env-var values to the 1P item via the existing OnePasswordClient. Long-term, build the synth-time 1P helper the PR’s own comments anticipate.

M5 — `cdk-synth` CI job uses `npm install`, not `npm ci`

Where: .github/workflows/postmark-foundations.yml:155 (postmark-foundations:cdk-synth job).
Why it matters: npm install can mutate package-lock.json if any transitive dep advances. The other three jobs (typecheck-lint, unit, integration) correctly use npm ci. This is the only outlier. CI reproducibility + supply-chain hygiene argue for npm ci everywhere.
FMEA:
- Severity = 4 (CI-only impact; reproducibility / supply-chain drift)
- Occurrence = 4 (every CI run; transient drift only when a transitive minor advances)
- Detection = 6 (drift would not show in CI since it doesn’t commit back, but local devs running same command may see different node_modules; subtle)
- RPN = 96 — Major. (Highest RPN among the Major findings precisely because Occurrence is high; severity is low.)
Recommendation: change line 155 to run: npm ci.

M4 — No test for TXT-record chunking with >255-char DKIM keys

Where: src/main/cdk/constructs/email/free-platform-server-records.test.ts — the test fixture uses a 50-character DKIM key.
Why it matters: real RSA 2048-bit DKIM TXT values are >256 chars of base64. CDK’s TxtRecord chunks long strings into 255-byte segments at the formatTxt level (current aws-cdk-lib behaviour), but the test does not pin that contract. A future CDK upgrade that changed chunking, or an unexpected character class in the DKIM key, would not surface until live deploy.
FMEA:
- Severity = 6 (broken DKIM)
- Occurrence = 2 (CDK chunking is well-established)
- Detection = 4 (no test coverage; only deploy-time observation — and the construct does no synth-time validation of TXT chunking)
- RPN = 48 — borderline Minor / Major. Calling Major because the consequence is broken email auth, not a cosmetic issue.
Recommendation: add a fixture test using a realistic 400-char DKIM key (e.g., 'k=rsa; p=' + 'A'.repeat(400)) and assert via Template that the resulting TXT is correctly chunked.

4.2 Minor Findings

N1 — Logger field redaction relies on a key-name allowlist

Where: scripts/postmark-foundations/src/logging.ts:15-33 (SENSITIVE_KEYS).
The set is finite and case-sensitive. accountToken is in the set; account_token would not match. Token-pattern regexes (TOKEN_PATTERNS) catch typical token shapes (Postmark UUIDs, ghp_*, ops_*), but a hypothetical field with key account_token and a value that did not match any pattern would leak.
RPN: S=8 × O=2 × D=4 = 64. Borderline Major. Practical risk is low because the project’s own code uses camelCase keys.
Recommendation: normalise keys to a stable form (lowercase + strip non-alphanumeric) before allowlist comparison. Add a regression test for a account_token / ACCOUNT_TOKEN / Account-Token key shape.

N2 — `responseSnippet` in error logs is unredacted slice

Where: scripts/postmark-foundations/src/postmark-client.ts:275
```
responseSnippet: responseText.slice(0, 200),
```
The 200-character slice is from the Postmark error body. Postmark error bodies do not include tokens, but defensive practice is to pass through redact() (from logging.ts) before slicing.
RPN: S=4 × O=2 × D=6 = 48. Minor.
Recommendation: responseSnippet: redactString(responseText.slice(0, 200)) (or import redact from ./logging and apply).

N3 — `accountSecretReference` in CLI duplicates magic string from orchestrator

Where: scripts/postmark-foundations/src/cli.ts:220-228. The CLI hard-codes 'op://Arda-SystemsOAM/Postmark-Prod/credential' and the NonProd counterpart. The orchestrator hard-codes the same Prod string at line 464.
Twin definitions; rename in one place silently de-syncs from the other. Mild DRY violation.
Related to M2 — fixing M2 by collapsing both into a single ACCOUNT_REFS map would also fix this.

N4 — GitHub Actions pinned to major version, not commit SHA

Where: .github/workflows/postmark-foundations.yml — actions/checkout@v6, actions/setup-node@v6, dorny/paths-filter@v3.
Best practice for security-sensitive workflows is commit-SHA pinning to defeat tag-hijacking attacks. The repo’s other workflows likely follow the same major-tag convention; cross-cutting hardening is out of scope of this PR.
RPN: S=8 × O=1 × D=2 = 16. Nit / Minor borderline.

N5 — Missing OAM documentation deliverables

The Phase 0 specification calls for three documentation pages:
- current-system/oam/postmark-service/index.md (REQ-PM-DOC-001)
- current-system/oam/postmark-service/free-platform-server.md (REQ-PM-DOC-002)
- current-system/oam/scripts/{index,gha-secrets}.md (REQ-PM-DOC-004)
None are in PR #445. Likely shipped via a separate documentation-repo PR (typical for this workspace’s split between infrastructure and documentation repos), but PR #445 should cross-link to the documentation PR if it exists.
Action: add a “Companion documentation PR” line to the PR body (or a check on the PR body’s “Test plan” if such PR is not yet open).

N6 — Coverage gaps in CLI / logging / one-password

scripts-test-ci reports:
- postmark-foundations/src/cli.ts: 47.3% statements
- scripts/postmark-foundations/src/logging.ts: 68.75% statements (security-sensitive code; should be higher)
- scripts/lib/one-password.ts: 72.72% statements
- scripts/lib/github-config.ts: 97.91% statements ✅
- scripts/postmark-foundations/src/orchestrator.ts: 96.11% statements ✅
The CLI’s interactive defaultConfirm path is intentionally hard to test; the rest is reachable.
RPN for logging.ts specifically: S=8 × O=2 × D=6 = 96 — borderline Major. The redaction code paths for arrays / nested objects / token-pattern matching are not all covered. Calling Minor because the code path is small and recently reviewed; flag for next iteration.

N7 — Stack does not set `RemovalPolicy` on records

Where: src/main/cdk/stacks/purpose/free-platform-server-records-stack.ts. Records inherit the default DESTROY policy. Stack deletion deletes records, breaking email delivery during the destroy window.
This is the desired behaviour for clean rollback (orphaned records would be worse), so this is informational only.

N8 — Magic string `partition.locator.id === "prod"` in partition.ts

Where: src/main/cdk/apps/Al1x/partition.ts:214. Uses a string literal "prod" to gate the wire-in. The Partition type likely has an enum or constants; a typed comparison would be more robust.
RPN: trivial. Style nit.

4.3 Nit Findings

Nit-1: Uncommitted edit to scripts/postmark-foundations/HUMAN-STEPS.md in the worktree (B.1, B.2, B.3 marked [X]). Reflects real operator progress but should be committed before final merge so CI parses the same checklist as the operator. The orchestrator’s parser is case-insensitive (x|X) so this is purely cosmetic for now.
Nit-2: console.log("DONE WITH ${partitionPrefix} PURPOSE BUILD") at partition.ts:233 — pre-existing pattern in the file; this PR doesn’t introduce or change it.
Nit-3: PR description claims “65 tests” but coverage now reports 86 (delete + verify subcommands added in commits 3180c8d / 2b5a12b). Update PR body for accuracy.
Nit-4: The integration workflow’s gh issue create HEREDOC interpolates $RUN_URL, $POSTMARK_ACCOUNT, $(date). All values are server-controlled (github-context or constrained workflow_dispatch choice). Consider a leading comment documenting why interpolation here is safe for future maintainers.

5. Findings by Review Dimension

5.1 Security Posture

Strengths:

Empty default permissions: {} at workflow root; per-job least-privilege grants (contents: read, issues: write only on the integration job).
Secrets only injected into the integration job (not into typecheck/lint/unit/cdk-synth).
Secrets fail-fast precondition step (Verify required secrets are present) before any 1Password / Postmark call.
1Password dual-auth wrapper: DesktopAuth for local dev (no shared service-account token leaves the operator’s machine), service-account token for CI only.
gha-secrets libsodium-encrypts secret values client-side before upload via Octokit.
Token redaction in structured logs: by both field name (allowlist) and value shape (regex for UUIDs, ghp_*, ops_*).
delete-platform-server refuses to operate against PostmarkProd unless --allow-prod-delete (subject to the M2 caveat).
Integration workflow’s workflow_dispatch account input is a choice constrained to PostmarkNonProd only — operator cannot point a manual run at PostmarkProd via the dispatch UI.
HUMAN-STEPS pre-flight is enforced before any live API call.

Weaknesses:

M1 (Octokit ReDoS chain).
M2 (magic-string PROD safety guard).
N1 (case-sensitive redaction allowlist).
N2 (unredacted response snippets in error logs).
N4 (Action major-tag pinning vs. commit SHA).

5.2 Robustness

Strengths:

Idempotent create-or-skip on server, domain, and 1Password item.
--force gate on 1Password overwrite (refuses without explicit consent).
Bounded retry on 5xx + network errors with exponential backoff.
Structured PostmarkApiError preserves Postmark’s ErrorCode for programmatic handling.
Bounded verification polling (default 5 × 60s) with structured verify-incomplete log on budget exhaustion.
delete-platform-server is best-effort and idempotent (skip-if-not-found).
Cleanup step (if: always()) in the integration workflow guarantees no orphan resources.

Weaknesses:

M3 (env-var bridge drift).
N6 (coverage gaps in security-sensitive logging.ts).
The extractDkimSelector helper (orchestrator.ts:355-363) is fragile: assumes Postmark DKIM hosts always contain ._domainkey. Defensive code returns empty string on miss, which then causes downstream “selector empty” rejection — self-healing, but a pinned API observation in postmark-api-observations.md would reduce surprise.

5.3 Engineering Principles

Strengths:

Strong separation of concerns: shared lib (scripts/lib/) vs. consumer scripts (scripts/{postmark-foundations,gha-secrets}/); CLI vs. orchestrator vs. transport (postmark-client).
Dependency injection throughout: every external (fetch, setTimeout, OnePasswordClient, readChecklist, confirm) is a parameter or option, enabling clean testing.
Pinned dependency versions (no ^ / ~) for both 1Password SDK and Octokit — per the project’s stated v0-SDK risk posture.
Strict TypeScript configuration (strict, noImplicitAny, noImplicitReturns, noFallthroughCasesInSwitch).
Single export surface via lib/index.ts.

Weaknesses:

M2 / N3 (DRY violation on op:// references).
N8 (magic string "prod").
The CLI’s three subcommand entry points (runCreateSubcommand, runVerifySubcommand, runDeleteSubcommand) repeat the requireFlag plumbing. A small extraction could DRY the input collection. Cosmetic.

5.4 Test Coverage

Concern	Coverage
Postmark client: 5xx retry / 4xx fast-fail	✅ unit-tested with mocked fetch
Orchestrator: create / skip / overwrite / refuse paths	✅ 86 unit tests, all green
Orchestrator: verify-only path	✅ added in commit `3180c8d`
Orchestrator: delete-only path	✅ added in commit `a17eab6` (delete subcommand)
HUMAN-STEPS pre-flight bypass for `--integration-test`	✅ regression test added in commit `2b5a12b`
1Password wrapper: dual-auth discovery	✅ unit-tested with mocked SDK constructor
GitHub config: libsodium encrypt + Octokit upload	✅ unit-tested with mocked Octokit
CDK construct: TXT + CNAME assertions	✅ 9 assertion tests including 3 throw-path tests
CDK construct: TXT chunking on >255-char DKIM	❌ M4 — not exercised
Logger redaction: case-insensitive key matching	❌ N1 — not exercised
CDK construct: gating logic in partition.ts	⚠️ wire-in present, not unit-tested separately

5.5 Idempotency & Re-runnability

Orchestrator: ✅ list-then-create-or-skip on Postmark resources; --force gate on 1P.
CDK: ✅ records are CDK-managed; re-deploys are no-ops when env vars stable.
Delete: ✅ skip-if-not-found semantics on Postmark + 1P.
Integration job: ✅ unconditional cleanup step.

5.6 Blast Radius & Environment Isolation

✅ Construct wired into prod partition only (partition.locator.id === "prod"); other partitions unaffected.
✅ Workflow integration job restricted to PostmarkNonProd via the dispatch input choice list.
✅ Default-off gate (INCLUDE_FREE_KANBAN_RECORDS=true required).
⚠️ M2: the only safeguard against accidental PostmarkProd deletion is a magic-string comparison.

5.7 Drift, Deletion Policy, Replacement Risk

✅ Stack uses fromHostedZoneAttributes (no drift between synth and reality on the zone reference).
✅ TTL 300s on both records as future-migration hygiene (per the spec’s “long-term direction is a dedicated platform.prod.ardamails.com zone” intent).
✅ No stateful resource changes; CDK records are stateless.
⚠️ N7: stack delete = record delete = email-auth break. Acceptable but documented as a known fact.

5.8 Stack-Name & Logical-ID Stability

✅ Stack name composed deterministically: ${partitionPrefix}-FreeKanbanPostmarkRecords -> Alpha001-prod-FreeKanbanPostmarkRecords.
✅ Logical IDs (prod-ardamails-zone, Records, DkimRecord, ReturnPathRecord) are stable and not derived from anything mutable.
✅ This is a new stack, so no migration risk; on first deploy, names lock in.
⚠️ Per the project’s CF-name-immutability rule: once deployed, do not rename FreePlatformServerRecordsStack’s stack id (${partitionPrefix}-FreeKanbanPostmarkRecords). If a future refactor wants a different name, it must use the CF stack-refactoring protocol (move resources between stacks via metadata).

5.9 Secrets Handling End-to-End

Boundary	Mechanism
Operator → 1Password	DesktopAuth biometric prompt (local) / service-account token (CI)
1Password → Orchestrator	`client.secrets.resolve("op://...")` returns plaintext token in process memory only
Orchestrator → Postmark	`X-Postmark-Account-Token` header; never logged
Postmark → 1Password Free-server item	`serverToken` field marked `Concealed`
GitHub Actions secret upload (gha-secrets)	libsodium client-side encryption with the repo’s public key before Octokit POST
Logs	Field-name allowlist + token-shape regexes (subject to N1 / N2)

This is a clean chain. The two lifecycle gaps:

No token rotation runbook — explicitly out of scope per requirements.md § G.
No AWS Secrets Manager mirror of the Free server token — explicitly out of scope per requirements.md § G (Free server is not consumed by Arda’s runtime, so it doesn’t need ESO projection).

5.10 Observability & Operability

Strengths:

Structured JSON logs at every phase boundary (orchestrator.checklist, orchestrator.list-servers, orchestrator.create-server, …).
Auto-issue creation on scheduled-run failure (with run URL, account, and likely-cause hints).
HUMAN-STEPS.md as a parsable runbook — the orchestrator surfaces unchecked items by line number.
Exit codes: 0 / 1 / 2 are distinct (success / runtime-failure / argument-failure) — helpful for CI gating.

Gaps:

No CloudWatch logs / metrics / alarms (out of scope — the construct is just two Route53 records).
No dashboard or runbook link in the orchestrator’s success log line. Minor.

5.11 Cost & Quotas

Two Route53 record sets (~$0.50/year combined). Negligible.
One Postmark Platform-plan account contract — already in place per HUMAN-STEPS B.1.
One 1Password vault item — negligible.
No NAT / KMS / Lambda cost added. ✅

5.12 Change-Management Mechanics

Deployment order: Phase 1 (prod.ardamails.com zone) must land first; this PR’s gate enforces that ordering by requiring PROD_ARDAMAILS_ZONE_ID to be set.
Rollback: cdk destroy of the gated stack removes the records. (Consistent with N7.)
Manual approval: prod-targeted records require operator to set three env vars + the gate flag, which is a strong ceremony gate.
The companion documentation PR (REQ-PM-DOC-001/002/004) should land before / alongside this PR so operators have the full runbook.

5.13 CI/CD Coverage

Five jobs: filter, typecheck-lint, unit, cdk-synth, integration. Plus the existing synth-each-cdk-app matrix and build jobs (unaffected, all green on the latest commit).
Path-filtered cheap jobs.
Draft-PR exclusion (!github.event.pull_request.draft) on the cheap jobs — the latest CI run on the PR shows them skipped because the PR is still draft.
Schedule-driven drift detection.
Issue-on-failure for unattended runs.
Outlier: M5 (npm install in cdk-synth).

5.14 Documentation & Runbook

HUMAN-STEPS.md is excellent (machine-parseable, parsable line-by-line).
scripts/lib/README.md documents the auth conventions and “how to add a new wrapper”.
cli.ts has a HELP_TEXT block that is comprehensive and accurate.
Gaps: REQ-PM-DOC-001/002/004 OAM service docs (N5).

5.15 FMEA Summary

ID	Finding	S	O	D	RPN	Class
M5	`cdk-synth` job uses `npm install`	4	4	6	96	Major
M2	Hardcoded `op://` PROD reference safety guard	10	2	4	80	Major
M3	CDK env-var bridge for synth-time DKIM data	6	4	3	72	Major
M1	Octokit chain ReDoS CVEs	4	2	8	64	Major
N1	Logger field-name allowlist case-sensitive	8	2	4	64	Minor
M4	No TXT-chunking test for >255-char DKIM keys	6	2	4	48	Major
N2	`responseSnippet` unredacted in error logs	4	2	6	48	Minor
N3	Magic-string `op://` ref duplicated in CLI	4	3	4	48	Minor
N6	Coverage gaps in `logging.ts` / `one-password.ts`	8	2	6	96	Minor*
N4	Action major-tag pin not commit SHA	8	1	2	16	Nit
N5	Missing OAM documentation deliverables	—	—	—	—	Scope
N7	Stack `RemovalPolicy` not `RETAIN`	—	—	—	—	Info
N8	Magic string `"prod"` in partition.ts	—	—	—	—	Style

* N6’s coverage RPN is high (96) due to the security sensitivity of logging.ts, but the absolute coverage % (68.75) is acceptable for a Phase-0 deliverable. Calling Minor because the lines actually covered include all the tested-redaction paths.

The threshold table:

RPN ≥ 100 = Blocker. None.
50-99 = Major. 5 findings (M1, M2, M3, M5, M4 borderline).
20-49 = Minor. 3 findings (N1, N2, N3 — and the borderline M4 if you classify it Minor).
< 20 = Nit. 1 finding (N4).

6. Recommendations (Prioritised)

Before merge

M5: change npm install to npm ci in the cdk-synth job. One-line fix; trivial; high RPN.
M2: replace the PROD_REF magic-string equality with a typed isProd: boolean flag from CLI -> orchestrator (or a single shared ACCOUNT_REFS map used by both). Add a regression test.
M1: bump @octokit/rest to the latest CJS-friendly major (verify 21.x + ts-jest works); rerun npm audit and confirm clean. If a breaking change is unavoidable on the SDK consumer side, file a follow-up issue and add npm audit --audit-level=high to a CI job (so future regressions surface as failures, not warnings).

Before live deploy

M3: add a pre-deploy-check script that compares operator env vars to the 1Password item via OnePasswordClient.readField. Document in HUMAN-STEPS.md.
M4: add a CDK construct test using a 400-char synthetic DKIM key to pin the TXT-chunking contract.
N1: case-insensitive normalisation of redaction key allowlist + a regression test.

Phase 1 / cleanup

N5: confirm the documentation PR (REQ-PM-DOC-001/002/004) is open and link it from PR #445.
N6: bring logging.ts coverage above 90% (it’s security-sensitive code).
N4: as a workspace-wide hardening initiative (not this PR), pin all workflow Actions to commit SHAs.
Long-term M3 follow-up: build the synth-time 1Password helper the construct’s docstring already anticipates.

After merge

Run the operator deploy of Phase 1 first (so PROD_ARDAMAILS_ZONE_ID becomes available), then exercise the orchestrator against PostmarkProd, then export the three FREE_KANBAN_* env vars + INCLUDE_FREE_KANBAN_RECORDS=true, then cdk deploy Alpha001-prod-FreeKanbanPostmarkRecords. Run the orchestrator’s verify subcommand to confirm DKIM + Return-Path verification.
Schedule a one-time agent in 30 days to confirm the monthly drift-detection job has fired at least once and surface its result (or open the auto-issue).

7. Tooling and Process Gaps Encountered During Review

These are observations about the review process, not the PR itself. Captured for the workspace’s continuous-improvement track.

No security-review skill specific to AWS / CDK / Node infrastructure that codifies the threat-model checklist used in §5.1. The Anthropic claude-code-security-review GitHub Action exists but is CI-shaped, not local-review-shaped.
No fmea / production-readiness-review skill. The RPN scoring in §4 / §5.15 was applied freehand; codifying the rubric (S/O/D scales, action thresholds) would make future reviews directly comparable.
No infra-pr-review skill as a counterpart to pr-steward. Key checklist items (stack-name immutability, removal-policy hygiene, cross-stack ref pinning, TXT-chunking, env-var-bridge drift) are reusable across PRs.
awslabs/agent-plugins plugins (now installed: aws-amplify, aws-serverless, codebase-documentor-for-aws, databases-on-aws, deploy-on-aws) were not directly invoked during this review; their value will surface during Phase 1 implementation. Note the Snyk ToxicSkills caveat from the install discussion: even AWS-published plugins should be inspected before relying on them in security-sensitive contexts.

8. Appendices

8.1 Files Changed in PR #445

30 files; +13580 / -0:

New: .github/workflows/postmark-foundations.yml
New: scripts/{.prettierrc.json, eslint.config.mjs, jest.config.js, package.json, package-lock.json, tsconfig.json}
New: scripts/lib/{README.md, github-config.ts, github-config.test.ts, one-password.ts, one-password.test.ts, index.ts}
New: scripts/gha-secrets/src/cli.ts + scripts/gha-secrets/test/cli.test.ts
New: scripts/postmark-foundations/{HUMAN-STEPS.md, src/{cli.ts, orchestrator.ts, postmark-client.ts, logging.ts, human-steps.ts}, test/{cli.test.ts, orchestrator.test.ts, postmark-client.test.ts}}
New: src/main/cdk/constructs/email/{free-platform-server-records.ts, free-platform-server-records.test.ts}
New: src/main/cdk/stacks/purpose/free-platform-server-records-stack.ts
Modified: src/main/cdk/apps/Al1x/partition.ts (+32 lines for the gated wire-in)
Modified: CHANGELOG.md (one v2.28.0 entry covering all of Phase 0)
Modified: .gitignore (+1 line: **/coverage/)

8.2 Local Verification Outputs

Captured under infrastructure/scratch/phase0-review/:

npm-ci-scripts.log, scripts-build.log, scripts-lint.log, scripts-test-ci.log — all clean
scripts-audit.txt, scripts-audit.json — 6 moderate ReDoS advisories (M1)
repo-install.log, repo-build.log, repo-lint.log, repo-tests.log — all clean (54 pass / 16 skipped pre-existing)
repo-audit.txt, repo-audit.json — 23 vulns total; pre-existing transitive deps in aws-cdk-lib / aws-sdk / brace-expansion / etc., not introduced by this PR
cdk-construct-tests.log — 9/9 pass for the new construct’s assertion tests

8.3 Reviewed PR Comment Threads

Two Codex-automated review threads:

“Split verify flow from create-platform-server” (P1) — Resolved by commit 3180c8d (verify subcommand with its own flag surface). ✅
“Preserve integration-test checklist bypass in CLI wiring” (P1) — Resolved by commit 2b5a12b (wrapConfirmWithIntegrationBypass at the CLI boundary; regression test included). ✅

No outstanding human-reviewer comments on the PR at time of review.

8.4 References

PR: https://github.com/Arda-cards/infrastructure/pull/445
Spec: specification.md
Requirements: requirements.md
Verification: verification.md
Plan: plan/task-plan.md
Decision log: decision-log.md
KB pointer (added in this session): infrastructure/knowledge-base/platform-configuration.md

[!note] Review authored by Claude Opus 4.7 for Miguel Pinilla on 2026-05-01.

Phase 0 Infrastructure PR Review (PR #445)

Phase 0 Infrastructure PR Review — PR #445

1. Executive Summary

2. Methodology

2.1 Review Dimensions

2.2 Local Verification Runs

2.3 Source-of-Truth Inputs

3. Specification Conformance

3.1 Plan Deviations (declared in PR body)

3.2 Undeclared Scope Additions

4. Findings

4.1 Major Findings

M1 — Octokit dependency chain has 6 moderate ReDoS CVEs

M2 — Hardcoded op:// PROD reference for delete safety guard

M3 — CDK construct’s env-var bridge for synth-time DKIM data

M5 — cdk-synth CI job uses npm install, not npm ci

M4 — No test for TXT-record chunking with >255-char DKIM keys

4.2 Minor Findings

N1 — Logger field redaction relies on a key-name allowlist

N2 — responseSnippet in error logs is unredacted slice

N3 — accountSecretReference in CLI duplicates magic string from orchestrator

N4 — GitHub Actions pinned to major version, not commit SHA

N5 — Missing OAM documentation deliverables

N6 — Coverage gaps in CLI / logging / one-password

N7 — Stack does not set RemovalPolicy on records

N8 — Magic string partition.locator.id === "prod" in partition.ts

4.3 Nit Findings

5. Findings by Review Dimension

5.1 Security Posture

5.2 Robustness

5.3 Engineering Principles

5.4 Test Coverage

5.5 Idempotency & Re-runnability

5.6 Blast Radius & Environment Isolation

5.7 Drift, Deletion Policy, Replacement Risk

5.8 Stack-Name & Logical-ID Stability

5.9 Secrets Handling End-to-End

5.10 Observability & Operability

5.11 Cost & Quotas

5.12 Change-Management Mechanics

5.13 CI/CD Coverage

5.14 Documentation & Runbook

5.15 FMEA Summary

6. Recommendations (Prioritised)

Before merge

Before live deploy

Phase 1 / cleanup

After merge

7. Tooling and Process Gaps Encountered During Review

8. Appendices

8.1 Files Changed in PR #445

8.2 Local Verification Outputs

8.3 Reviewed PR Comment Threads

8.4 References

M2 — Hardcoded `op://` PROD reference for delete safety guard

M5 — `cdk-synth` CI job uses `npm install`, not `npm ci`

N2 — `responseSnippet` in error logs is unredacted slice

N3 — `accountSecretReference` in CLI duplicates magic string from orchestrator

N7 — Stack does not set `RemovalPolicy` on records

N8 — Magic string `partition.locator.id === "prod"` in partition.ts