Operator Stream — Learnings

A runbook surfaces gaps the code path never will

Drafting the operator runbook forced answers that the implementation streams hadn’t needed to nail down: which 1Password vault holds prod credentials? what’s the canonical item title? what’s the op:// URI? what’s the rotation procedure? what fails if the GH Org token has the wrong vault scope? These questions don’t have to be answered to merge a CFN change, but they have to be answered to roll out, so they surfaced together when the runbook was written. The Secrets-vault reference page, the OP_SERVICE_ACCOUNT_TOKEN documentation, and the Arda-{Env}OAM convention page all came out of writing this runbook, not out of writing the infrastructure code.

Lesson: schedule the operator runbook before the rollout, even when it feels like it can wait. Treat it as a deliverable on the critical path. The class of question it forces is different from the class the code answers, and you’d rather find the holes during planning than during a partition rollout at 5 PM.

Operator instructions need the error paths, not just the happy path

The shipped runbook includes per-step diagnosis, rollback procedures, troubleshooting tables, and cleanup commands. Every step that can fail names the symptom, the cause, and the fix. That structure came from a deliberate prompt-design choice (“operator instructions need error handling” — implementation-task skill rule 1.8) and proved its value during the prod rollout: when the drift check aborted, the runbook’s troubleshooting table and the deployment plan’s rollback procedure were already authored.

Lesson: for any operator-facing doc, the success path is the easy 30% — the error paths, rollback procedure, and per-step diagnostics are the load-bearing 70%. Treat them as required content, not “to-be-added”.

Partition vaults are scoped by usage, not by uniqueness

The Secrets-vault reference page documents the Arda-{Env}OAM convention with one rule: each partition vault holds the credentials used by that environment, even when the same value is currently shared across environments. Stored independently from day one so any environment can later rotate or diverge with no infrastructure change. The convention was implicit before this project; documenting it stopped recurrent “shouldn’t we deduplicate?” conversations and gave a clear rule for new partition-scoped secrets going forward.

Lesson: when an implicit convention recurs in conversation, write it down once. The cost is a page of documentation; the saving is the conversation never happening again.

Phase-3 review is the last chance to catch latent bugs

PDEV-452 (the amm.sh cross-region bug) was a latent bug from an earlier PR (infrastructure#438, six weeks before). It surfaced for the first time when the operator runbook drove the prod rollout — which is, by design, the first operator-driven invocation of the full per-partition amm.sh deploy path on Alpha001:prod. Before this project, prod deploys were rare, region-aware by accident, and ran by hand outside amm.sh.

Lesson: Phase-3 (deployment) is genuinely the last chance to catch class-of-bugs that depend on per-partition execution. Treat each partition’s rollout as an independent verification event, not a copy-paste of the previous one. When prod is the last partition in the choreography (which it usually is), the latent bug that only fires on prod is the bug you’ll discover at the worst time. Build that into the risk register and the rollback procedure.