Skip to content

Infrastructure Stream — Learnings

A constants map only solves the problem if the code consumes it (PDEV-452)

Section titled “A constants map only solves the problem if the code consumes it (PDEV-452)”

The AMPLIFY_REGION_OVERRIDES associative array in amm.sh was added with the intent of supporting cross-region partitions like Alpha001:prod (whose Amplify app lives in us-east-2 while the rest of the partition is in us-east-1). The map was documented as “consumed by GitHub Actions workflows” and parked at the top of the file as a constants block. Meanwhile, the partition-loop code paths that performed aws amplify ... calls — drift check, compute-role attach, env-var merge — silently relied on the AWS CLI default region. For three of the four partitions this matched the Amplify region by coincidence; for prod it didn’t. The drift check failed against prod with NotFoundException and stopped the rollout in its tracks.

Lesson: when a constants map exists to override behaviour, the code paths that need the override must consume the map directly — not “the AWS CLI will figure it out” or “the outer workflow will pass --region”. Document this as a structural invariant: every aws amplify ... call site in amm.sh resolves its region from AMPLIFY_REGION_OVERRIDES (or falls back to the script-resolved AWS_REGION) and passes --region explicitly. No fall-through to CLI defaults.

A second-order lesson: latent regressions of this kind surface only when the previously-coincidental match breaks. Cross-region anomalies should be treated as first-class test cases — even if there’s only one such partition in the fleet today.

Default-fall-through is a fragile pattern in deploy scripts

Section titled “Default-fall-through is a fragile pattern in deploy scripts”

The reason PDEV-452 surfaced during the rollout (not before) is that no test exercised the prod-region branch independently. The script was relying on the outer caller to pass --region us-east-2 explicitly for prod operations. When the steward attempt used --region us-east-2 at the outer flag, the aws eks describe-cluster call in the same script aimed at the wrong region (EKS is in us-east-1) and failed elsewhere. Two regions in one partition is rare but real, and the script’s “one outer region” model couldn’t represent it.

Lesson: explicit --region at every call site has three concrete benefits: (a) it makes the call self-documenting, (b) it isolates each call from outer-region overrides, and (c) it removes the conditional branch that says “if the override applies, do X; else do Y”. Lower cyclomatic complexity, fewer failure modes. The cost is a few extra characters per call site.

1Password is the single source of truth — don’t duplicate

Section titled “1Password is the single source of truth — don’t duplicate”

The Phase-2 plan briefly considered a parallel AMAZON_CREATORS_API_JSON_<partition> GitHub Org secret pattern as a CI-only short-circuit (similar to how some other vendors are wired). The plan was reversed in favour of installing the op CLI in the workflow and reading from 1Password directly, using the already-provisioned OP_SERVICE_ACCOUNT_TOKEN. The decision avoided a parallel provisioning channel and kept one credential to rotate per partition, not two.

Lesson: prefer the existing secret-resolution channel even when it requires installing a CLI in CI. Parallel channels are tempting because the new channel is simpler in isolation, but they double the maintenance surface and create drift risk between local (amm.sh) and CI (amm.yml) behaviour.

Drift checks must be louder than the thing they check

Section titled “Drift checks must be louder than the thing they check”

The inline-BuildSpec drift check was added because three of the four Amplify apps had a stale inline BuildSpec that overrode the in-repo amplify.yml. The check is now load-bearing: every partition deploy fails fast if a non-empty inline BuildSpec reappears, with a remediation command in the error output. Two refinements during implementation:

  • The check originally treated None (AWS CLI --output text null marker) and the JSON string null as non-empty, producing false-positive drift failures on partitions where buildSpec was unset. Both are now treated as empty.
  • SSO login was moved before the AWS calls in the drift check; on an expired local SSO session the check would otherwise silently skip via suppressed stderr instead of surfacing the auth failure.

Lesson: a drift check that fails open is worse than no drift check at all — readers assume “no error → no drift”, which is wrong when the check silently skipped. Audit every failure mode of the check (empty-output marker, expired auth, missing app) and decide which fails-loud vs. fails-skip behaviour you want.