Aurora parameter group + operations 2.5.0 bump rollout

This runbook covers a single coordinated rollout that lands four interlocking changes across all four Arda partitions (Alpha002-dev, Alpha002-stage, Alpha001-demo, Alpha001-prod):

Change	Repo	Reference
Aurora cluster parameter group with slow-query logging, lock-wait logging, `pg_stat_statements` preload, and (prod only) `max_connections=500` + `db.r7g.large` instance class	`infrastructure`	PDEV-479
`{Infra}-SentryDsn` Secrets Manager secret in `Alpha001` and `Alpha002`	`infrastructure`	PDEV-500
Per-application-database `pg_stat_statements` extension via the new initializer (image `2.5.0`)	`postgres-database-initializer`	PDEV-498
Operations chart bump: init container `2.3.0` → `2.5.0` + `SENTRY_DSN` env via the new secret + pod resizing	`operations-monolith-component`	PDEV-488, PR #170

This is one-time-use — once the four partitions are through it, the steady-state procedure is the standard amm.sh deploy + helm upgrade. Future Aurora parameter-group changes follow the same shape but get their own dated operation note.

Plan

Prerequisites

Before starting any environment:

infrastructure main is at 929b3c8 (or newer) — the merge of PR #457. This carries both the tuning sub-interface on the Aurora construct (PDEV-479) and the {Infra}-Secrets stack (PDEV-500), plus the amm.sh changes that resolve the Sentry DSN from 1Password and add --force to the infrastructure CDK invocation.

postgres-database-initializer v2.5.0 is published as ghcr.io/arda-cards/postgres-database-initializer:2.5.0. Verify:

docker manifest inspect ghcr.io/arda-cards/postgres-database-initializer:2.5.0 >/dev/null \
  && echo "image present" || echo "MISSING — stop"

operations-monolith-component PR #170 is open and CI-green and has been reviewed-and-approved. The rollout deploys this PR’s chart to each partition AFTER the database reboot for that partition has completed. Do not merge PR #170 until the prod rollout is complete and observed healthy — the chart on main should match what is running everywhere.
1Password access: the operator must have op signed in and read access to op://Arda-SystemsOAM/be-sentry-dsn/dsn. Test:
Terminal window
```
op read 'op://Arda-SystemsOAM/be-sentry-dsn/dsn' >/dev/null \
  && echo "1P reachable" || echo "MISSING — stop"
```

AWS SSO sessions in both partitions (Alpha002-Admin for dev/stage, Admin-Alpha1 for demo/prod):

aws sso login --profile Alpha002-Admin   # for dev + stage
aws sso login --profile Admin-Alpha1     # for demo + prod

Rollout sequence

Order	Partition	Profile	applyImmediately	Instance class change
1	`Alpha002-dev`	`Alpha002-Admin`	true	none
2	`Alpha002-stage`	`Alpha002-Admin`	true	none
3	`Alpha001-demo`	`Admin-Alpha1`	true	none
4	`Alpha001-prod`	`Admin-Alpha1`	false	`db.t3.medium` → `db.r7g.large`

Each partition is an independent gate: do not pipeline. After step 8 of every environment, run a Playwright application-health verification against that environment and wait for the operator’s explicit go-ahead before starting step 1 of the next environment.

The per-environment runbook below is identical in structure for all four environments. The only difference is that the prod writer + reader also change instance class during the failover step, which adds ~5–10 min per node to that step. Plan the prod window accordingly.

Per-environment runbook

The eight steps below are written for Alpha002-dev (the first environment). For each subsequent environment, substitute:

Placeholder	dev	stage	demo	prod
`${INFRA}`	`Alpha002`	`Alpha002`	`Alpha001`	`Alpha001`
`${PARTITION}`	`dev`	`stage`	`demo`	`prod`
`${PROFILE}`	`Alpha002-Admin`	`Alpha002-Admin`	`Admin-Alpha1`	`Admin-Alpha1`
`${CLUSTER_ID}`	`Alpha002-dev-AuroraCluster`	`Alpha002-stage-AuroraCluster`	`Alpha001-demo-AuroraCluster`	`Alpha001-prod-AuroraCluster`

All commands use --profile ${PROFILE} at the end of the command line (per workspace convention). Region is us-east-1 for both partitions. Capture each step’s wall-clock start and end time for the per-step log entries at the bottom of this file.

Drift checks (run before Step 1 of each environment)

The runbook placeholders are the CDK logical names. Real values can drift (renamed instances, alternative namespace conventions). Before starting Step 1 in each environment, run these read-only checks and record the actual values in the Execution Log:

Aurora writer + reader instance identifiers — do not trust ${CLUSTER_ID}Writer / ${CLUSTER_ID}Reader1. Read them from RDS:
Terminal window
```
aws rds describe-db-clusters \
  --db-cluster-identifier "${CLUSTER_ID}" \
  --region us-east-1 --profile "${PROFILE}" \
  --query 'DBClusters[0].DBClusterMembers[].{Id:DBInstanceIdentifier,Writer:IsClusterWriter}'
```
Use the returned identifiers for steps 4–6 instead of the placeholders. If they differ from the placeholders, pause and consult the operator before continuing — a rename may indicate the cluster is not the one this runbook was authored for.
Operations namespace exists — runbook assumes ${PARTITION}-operations. Confirm:
Terminal window
```
kubectl --context "${INFRA}" get ns "${PARTITION}-operations"
```
If the namespace name differs, pause and consult the operator before continuing.

Any drift detected by these checks must be recorded in the Execution Log section under that environment’s entry, with a note explaining the operator decision that resolved it.

Step 1 — Snapshot the cluster

SNAPSHOT_ID="${INFRA}-${PARTITION}-pdev479-pre-rollout"

aws rds create-db-cluster-snapshot \
  --db-cluster-identifier "${CLUSTER_ID}" \
  --db-cluster-snapshot-identifier "${SNAPSHOT_ID}" \
  --region us-east-1 \
  --profile "${PROFILE}"

aws rds wait db-cluster-snapshot-available \
  --db-cluster-snapshot-identifier "${SNAPSHOT_ID}" \
  --region us-east-1 \
  --profile "${PROFILE}"

Success check: the wait command returns 0 (typically 1–3 min). Confirm:

aws rds describe-db-cluster-snapshots \
  --db-cluster-snapshot-identifier "${SNAPSHOT_ID}" \
  --region us-east-1 --profile "${PROFILE}" \
  --query 'DBClusterSnapshots[0].Status'
## expect: "available"

Failure handling:

create-db-cluster-snapshot fails with DBClusterSnapshotAlreadyExistsFault: a previous attempt left a snapshot with the same name. Delete it via aws rds delete-db-cluster-snapshot --db-cluster-snapshot-identifier "${SNAPSHOT_ID}" (after confirming it is not still needed as rollback insurance for a previous partition’s run), then retry step 1.
create-db-cluster-snapshot fails with InvalidDBClusterStateFault: the cluster is mid-modification (e.g. a maintenance-window task is in flight). Wait 5 min, retry. If persistent, run aws rds describe-db-clusters --db-cluster-identifier "${CLUSTER_ID}" and check Status — investigate whatever is in flight before proceeding.
wait times out: snapshot is taking longer than expected. Re-run the wait command. Do not proceed to step 2 until the snapshot is available — without it, you have no rollback insurance.
IAM AccessDenied: the AWS SSO session has expired or the wrong profile is selected. Re-run aws sso login --profile "${PROFILE}" and confirm aws sts get-caller-identity --profile "${PROFILE}" shows the expected account.

Step 2 — Run `amm.sh` for the partition

From an up-to-date infrastructure working tree (must be on main at 929b3c8 or newer):

git -C /Users/jmp/code/arda/infrastructure status --short --branch
## expect: ## main...origin/main  (and clean working tree)

cd /Users/jmp/code/arda/infrastructure
./amm.sh "${INFRA}" "${PARTITION}"

This single command does three things in sequence:

Resolves SENTRY_DSN from 1Password and exports it (masked under ${GITHUB_ACTIONS}; visible only in the local shell otherwise).
Deploys the ${INFRA} infrastructure stacks (networking, compute, ingress, and the new ${INFRA}-Secrets stack). --force is on, so even no-op infra stacks resubmit a change set to CloudFormation — this is intentional so DSN rotations propagate. Stack-scoped --parameters ${INFRA}-Secrets:SentryDsn=… is appended to the invocation.
Deploys the ${INFRA}-${PARTITION} partition stacks (Authentication, BulkStores, ImageStorage, Aurora, Ingress, Compute, Dns). The Aurora cluster stack attaches the new parameter group at the cluster level.

Success check:

amm.sh exits 0.

The new ${INFRA}-Secrets stack exists in CloudFormation:

aws cloudformation describe-stacks \
  --stack-name "${INFRA}-Secrets" \
  --region us-east-1 --profile "${PROFILE}" \
  --query 'Stacks[0].StackStatus'
# expect: "CREATE_COMPLETE" or "UPDATE_COMPLETE"

The ${INFRA}-SentryDsn Secrets Manager secret exists with a valid Sentry DSN value:

aws secretsmanager get-secret-value \
  --secret-id "${INFRA}-SentryDsn" \
  --region us-east-1 --profile "${PROFILE}" \
  --query 'SecretString' --output text | grep -E '^https://[a-f0-9]+@.+\.sentry\.io/[0-9]+$' \
  && echo "DSN shape OK" || echo "DSN shape BAD — investigate"

The cluster’s DBClusterParameterGroup is the newly created one:

aws rds describe-db-clusters \
  --db-cluster-identifier "${CLUSTER_ID}" \
  --region us-east-1 --profile "${PROFILE}" \
  --query 'DBClusters[0].DBClusterParameterGroup'
# expect a value that ends with -ClusterParameterGroup-<hash>, not the
# engine default "default.aurora-postgresql16".

Failure handling:

amm.sh aborts at the 1P resolution step (ERROR: Sentry DSN not available in 1Password): confirm op is signed in (op whoami) and that the entry exists (op item get 'be-sentry-dsn' --vault 'Arda-SystemsOAM'). Re-run step 2 once op read 'op://Arda-SystemsOAM/be-sentry-dsn/dsn' prints the DSN. Nothing was deployed, so no rollback needed.
amm.sh fails inside a CDK stack deploy (CloudFormation UPDATE_FAILED or ROLLBACK_COMPLETE):
- ${INFRA}-Networking / ${INFRA}-Compute / ${INFRA}-Ingress failure: unrelated to this rollout. CFN auto-rolled-back; the Aurora parameter group attach in the partition stack never ran (CDK aborts on first failure). Investigate the failing stack directly via aws cloudformation describe-stack-events, fix, rerun amm.sh. Snapshot from step 1 is still valid.
- ${INFRA}-Secrets stack failure: typically a parameter validation rejection (empty SentryDsn value or shape mismatch). The cluster has not yet been touched (Secrets is part of the infrastructure layer, deployed before the partition’s Aurora stack). Investigate, rerun. Snapshot still valid.
- ${INFRA}-${PARTITION}-AuroraDBCluster stack failure: this is where the cluster parameter group attach happens. CFN auto-rolls-back the stack to its pre-deploy state, which means the old parameter group is reattached. The cluster ends up in the same configuration it had before step 2 — no manual rollback needed. Investigate via stack events, fix the construct/platforms change if applicable (in a hotfix PR), and rerun the rollout from step 1 on a clean main.
The new ${INFRA}-Secrets stack reports CREATE_COMPLETE but the secret value is empty: 1P read returned an empty string. Stop and investigate the 1P entry; the cluster has the new parameter group attached but the operations chart will be unable to source SENTRY_DSN. Treat as a stalled rollout: continue with steps 3–7 (the DB side is fine), then before step 8 verify aws secretsmanager get-secret-value returns a non-empty DSN; if not, restore the 1P entry and run amm.sh again to repopulate the secret.
The cluster’s DBClusterParameterGroup query still shows the engine default after amm.sh returns 0: CDK didn’t synth the parameter group resource for this partition. This indicates the platforms.ts databaseTuning block is missing for the partition or the construct change wasn’t merged. Cross-check the infrastructure HEAD is at 929b3c8 or newer (git -C /Users/jmp/code/arda/infrastructure rev-parse HEAD). If yes, file a bug and stop the rollout.

Step 3 — Verify dynamic parameters propagated

Wait ~90 seconds after step 2 completes. Connect to the writer (via the existing bastion / port-forward) and confirm dynamic parameters are live:

SHOW log_min_duration_statement;   -- expect: 500ms
SHOW log_statement;                 -- expect: ddl
SHOW log_lock_waits;                -- expect: on
SHOW log_temp_files;                -- expect: 0
SHOW shared_preload_libraries;      -- still old value (pending reboot)
SHOW max_connections;               -- still old value (pending reboot)

The static parameters (shared_preload_libraries, max_connections) will still show their old values — those activate at step 5. If any of the dynamic ones is wrong, stop and investigate before rebooting; a reboot will not fix dynamic-parameter propagation issues.

Failure handling:

Dynamic params still show old values after 5 minutes: the cluster parameter group may not have associated correctly even though CFN reported success. Run aws rds describe-db-cluster-parameters --db-cluster-parameter-group-name <name> (use the name from step 2’s success check) and verify the expected values are in the parameter group itself. If yes, the cluster has not yet applied them — check aws rds describe-db-clusters ... --query 'DBClusters[0].DBClusterParameterGroupStatus' for pending-reboot (means it never propagated dynamically) vs in-sync (means propagation done). For pending-reboot on parameters that should be dynamic, the parameter group may have been built with the wrong apply-type — likely a construct bug, file an issue and roll back via the section below.
One dynamic parameter is wrong (e.g. log_min_duration_statement is -1 instead of 500): the parameter group was synthesised with an incorrect value. Treat as a partial-fix candidate: you can hot-patch the cluster parameter group in place with aws rds modify-db-cluster-parameter-group --db-cluster-parameter-group-name <name> --parameters "ParameterName=...,ParameterValue=...,ApplyMethod=immediate" to correct the value, then re-run step 3’s verify. This is faster than a full CDK re-deploy. Fix the construct/platforms code in a follow-up PR so the next deploy doesn’t drift back.

Step 4 — Reboot reader1

Reboot the reader first. This picks up the static parameters on the reader-side instance while the writer continues serving traffic.

aws rds reboot-db-instance \
  --db-instance-identifier "${CLUSTER_ID}Reader1" \
  --region us-east-1 --profile "${PROFILE}"

aws rds wait db-instance-available \
  --db-instance-identifier "${CLUSTER_ID}Reader1" \
  --region us-east-1 --profile "${PROFILE}"

Success check: wait returns 0 (~30–60s). The reader instance is back to available. Application traffic is unaffected (writer is still serving). For prod, this step does not swap the instance class — that happens during the failover in step 5.

Failure handling:

Reader stays in rebooting for >5 minutes: very rare. Run aws rds describe-events --source-identifier "${CLUSTER_ID}Reader1" --source-type db-instance --duration 30 to see what AWS reports. If the instance is genuinely stuck, open an AWS support case — do not proceed to step 5, because a failover with one unhealthy instance can leave the cluster writerless.
Reader returns to available but with pending-reboot parameter group status still set: the reboot did not pick up the static parameters. Re-run step 4 once (single retry). If still pending-reboot, treat as a construct bug and roll back via the section below.

Step 5 — Failover to reader1

Explicitly promote reader1 to writer. The old writer is restarted during the demotion and picks up the static parameters; for prod, it also swaps from db.t3.medium to db.r7g.large during the restart.

aws rds failover-db-cluster \
  --db-cluster-identifier "${CLUSTER_ID}" \
  --target-db-instance-identifier "${CLUSTER_ID}Reader1" \
  --region us-east-1 --profile "${PROFILE}"

## Wait for both instances to settle.
aws rds wait db-instance-available \
  --db-instance-identifier "${CLUSTER_ID}Writer" \
  --region us-east-1 --profile "${PROFILE}"
aws rds wait db-instance-available \
  --db-instance-identifier "${CLUSTER_ID}Reader1" \
  --region us-east-1 --profile "${PROFILE}"

Application impact: ~10–15 seconds of writer endpoint disruption. In-flight requests at the moment of the endpoint flip fail; HikariCP’s connectionTimeout=30000 + validationTimeout=1000 absorb the gap; new requests succeed once the new writer is up. For prod, total window is ~15–25s because the demoted writer additionally provisions the new instance class.

Success check (post-failover): the previously-promoted reader1 is now the cluster writer:

aws rds describe-db-clusters \
  --db-cluster-identifier "${CLUSTER_ID}" \
  --region us-east-1 --profile "${PROFILE}" \
  --query 'DBClusters[0].DBClusterMembers[?IsClusterWriter==`true`].DBInstanceIdentifier'
## expect: ["${CLUSTER_ID}Reader1"]

For prod, also confirm both instances are now on db.r7g.large:

aws rds describe-db-instances \
  --filters "Name=db-cluster-id,Values=${CLUSTER_ID}" \
  --region us-east-1 --profile "${PROFILE}" \
  --query 'DBInstances[].DBInstanceClass'
## expect for prod: ["db.r7g.large","db.r7g.large"]
## expect for dev/demo/stage: ["db.t3.medium","db.t3.medium"]

Failure handling:

failover-db-cluster returns InvalidDBClusterStateFault: the target reader is not in sync, or the cluster is mid-modification. Wait 2 min, retry. Application traffic continues on the original writer in the meantime.
Failover initiates but the writer endpoint never resolves to the new node (>2 min): the cluster is in a degraded state.
- Check aws rds describe-db-clusters ... --query 'DBClusters[0].Status'. failing-over is expected briefly; inaccessible-encryption-credentials or similar terminal states require an AWS support case.
- Application is hard-down (writer unreachable). If the prior writer is healthy enough to take traffic back, you can re-fail-over to it: aws rds failover-db-cluster --db-cluster-identifier "${CLUSTER_ID}" --target-db-instance-identifier "${CLUSTER_ID}Writer" (note: this targets the original writer name, which is now the reader). That restores service while you debug.
For prod only: failover succeeded but one of the two instances is still on db.t3.medium after both report available: the instance-class swap did not take. This indicates the applyImmediately: false deferred change did not apply during the reboot. Trigger an explicit reboot of the affected instance with aws rds reboot-db-instance --db-instance-identifier <id> --force-failover (note: that flag triggers a failover-style restart). Re-check the instance class after available. If it still does not change, the modification is pending another window — aws rds describe-db-instances ... --query 'DBInstances[].PendingModifiedValues' will show what is queued.
Application sees prolonged 5xx after the failover (more than ~30s): HikariCP may have all connections wedged on the dead endpoint. kubectl rollout restart deployment/operations --context "${INFRA}" -n "${PARTITION}-operations" forces a pod refresh and re-establishes connections cleanly. Not needed if the application recovers on its own within the timeout window.

Step 6 — Verify static parameters + `pg_stat_statements`

Reconnect to the writer (the cluster writer endpoint is the same DNS; your client must drop stale connections from the pre-failover writer).

SHOW shared_preload_libraries;
-- expect: pg_stat_statements

SHOW max_connections;
-- expect: 500 in prod, default-for-class elsewhere (around 800 on
-- t3.medium per Aurora's instance-derived defaults)

-- Confirm pg_stat_statements is functional (the verify query the
-- initializer would run):
SELECT 1 FROM pg_stat_statements LIMIT 1;
-- expect: a single row, NOT
-- "pg_stat_statements must be loaded via shared_preload_libraries"

If pg_stat_statements errors with the preload-missing message, the failover did not pick up the new parameter group on the new writer. Stop and investigate before deploying the operations chart in step 8 — the new init container will fail-loud and the operations pod will CrashLoopBackOff.

Failure handling:

SHOW shared_preload_libraries returns the empty / old value on the new writer:
1. Confirm the reader-promoted-to-writer was actually restarted. The failover restarts the demoted writer, not the promoted reader. Run aws rds describe-db-instances --db-instance-identifier "${CLUSTER_ID}Reader1" --query 'DBInstances[0].DBParameterGroups[0].ParameterApplyStatus' and look for in-sync. If pending-reboot, you need to reboot this instance too:
  Terminal window
```
aws rds reboot-db-instance \
  --db-instance-identifier "${CLUSTER_ID}Reader1" \
  --region us-east-1 --profile "${PROFILE}"
```
  This is now the cluster writer, so the reboot causes another ~10–15s writer disruption. Confirm static params after.
2. If still wrong, the parameter group itself is misconfigured. Inspect via aws rds describe-db-cluster-parameters and compare to the expected values from platforms.ts’s sharedDatabaseParameters block. Fix via direct modify-db-cluster-parameter-group for an emergency unblock, or roll back per the section below.
SELECT 1 FROM pg_stat_statements LIMIT 1 errors with pg_stat_statements must be loaded via shared_preload_libraries: same root cause as above. The schema objects are installed but the shared-memory hash is not allocated because the library wasn’t preloaded at server start. Apply the same investigate-then-reboot remedy.
For prod only: SHOW max_connections still returns the t3.medium default (~800) instead of 500: same family of failure as static params not picked up. The instance was restarted but did not read the new value. Re-run the reboot for whichever instance is wrong; if both are wrong, the parameter group itself is suspect.

Step 7 — Delete the snapshot

Only after step 6 is green and dashboards (Performance Insights, CloudWatch slow-query log) show normal traffic on the new writer:

aws rds delete-db-cluster-snapshot \
  --db-cluster-snapshot-identifier "${SNAPSHOT_ID}" \
  --region us-east-1 --profile "${PROFILE}"

Why now: the snapshot is rollback insurance for the rollout window only. Once the cluster is healthy on the new configuration, it is pay-for-storage overhead.

Failure handling:

Delete fails with InvalidDBClusterSnapshotStateFault: snapshot is being copied or shared. Wait 1 min and retry. Cosmetic — does not block the rollout.
If you discover an issue after deleting the snapshot but the cluster is otherwise healthy: roll-back via direct modify-db-cluster-parameter-group to neutralise problematic parameter values without needing the snapshot. The snapshot is insurance for catastrophic cluster-state corruption only.

Step 8 — Deploy PR #170 to this partition

This is the gate between database-side and application-side. Do not run step 8 until step 6 is green: the new init container’s verify query will fail otherwise and the pod will enter Init:CrashLoopBackOff.

The operations component is deployed via the standard Arda helm upgrade path for operations-monolith-component. Build PR #170’s chart and apply to the partition:

## (Substitute the exact command for your local operations workflow;
## see the operations repo README for the standard helm-upgrade
## invocation. The chart on PR #170's head ref is what should land.)

helm upgrade --install operations \
  ./src/main/helm \
  --namespace "${PARTITION}-operations" \
  --values src/main/helm/values-"${PARTITION}".yaml \
  --kube-context "${INFRA}"

Success check:

Operations pod transitions through Init → Running:

kubectl --context "${INFRA}" -n "${PARTITION}-operations" \
  get pods -l app=operations -o wide -w
# expect: STATUS Running, READY 1/1, RESTARTS 0

The init container ran the new initializer image and exited 0:

kubectl --context "${INFRA}" -n "${PARTITION}-operations" \
  logs -l app=operations -c db-init --tail=50
# expect: "Creating database … with role …" followed by the SELECT
# against pg_stat_statements returning, no error stack.

The Sentry SDK initialised against the new DSN:

kubectl --context "${INFRA}" -n "${PARTITION}-operations" \
  logs -l app=operations -c operations --tail=200 | grep -iE 'sentry|dsn'
# expect: a "Sentry SDK initialized" line or equivalent.
# If you see "Sentry SDK disabled because no DSN was set", the
# ExternalSecret has not yet synced — wait 30s and re-check.

pg_stat_statements is queryable in the application database (not just the admin DB), exercising what the new init container created:
```
\c <app-db-name>
SELECT count(*) FROM pg_stat_statements;
-- expect: an integer >= 0, NOT the preload-missing error
```

If any of those fail, do not proceed to the next environment; roll back per the section below and surface the failure.

Failure handling:

Init container exits with pg_stat_statements must be loaded via shared_preload_libraries: the cluster did not actually activate the preload at step 6 even though SHOW reported it correctly. Re-run step 6’s verifications from a fresh psql session; the previous result may have been from a cached connection routed to a pre-failover instance. If the cluster genuinely has the preload, the init container’s verify query should not be erroring — open an issue against postgres-database-initializer with the container logs and roll back the operations chart via the section below.
Init container exits with a pg_trgm / btree_gin failure: those extensions were already in the default floor before PR #30, so a failure here is unrelated to this rollout (cluster permissions, Postgres version mismatch, etc.). Treat as a regular operations incident.
Operations pod runs but logs show Sentry SDK disabled because no DSN was set: the ExternalSecret has not synced from ${INFRA}-SentryDsn to a Kubernetes secret yet.
- Wait 60s and re-check.
- If still missing: kubectl --context "${INFRA}" -n "${PARTITION}-operations" describe externalsecret be-sentry-dsn will show the sync status and any error.
- Common cause: the partition’s IRSA role does not yet have the Secrets Manager read permission for ${INFRA}-SentryDsn. The ${INFRA}-ReadSecrets managed policy on the cluster’s IRSA role chain uses the wildcard ${INFRA}-* and should already cover this, but if it doesn’t, follow up on the IAM side; the operations pod stays fail-soft (Sentry just stays off) in the meantime.
Operations pod transitions through Running but then crashes / restarts: regular incident, not specific to this rollout. Roll back the chart per the section below to unblock the rollout sequence, then debug separately.
HikariCP cannot acquire connections on startup (Connection is not available): often the prior step’s failover left the new writer with a cold connection pool that hasn’t recovered. Wait 30s; if persistent, re-check the cluster endpoint resolves to the new writer.

When to deploy PR #170 — quick reference

Phase	dev	stage	demo	prod
1. Snapshot	✓
2. `amm.sh dev`	✓
3. Verify dynamic params	✓
4. Reboot reader1	✓
5. `failover-db-cluster`	✓
6. Verify static params + pg_stat_statements	✓
7. Delete snapshot	✓
8. Deploy PR #170 → dev	✓
9. Playwright application-health check + operator go-ahead	✓
10–17. Repeat 1–8 for stage		✓
18. Playwright application-health check + operator go-ahead		✓
19–26. Repeat 1–8 for demo			✓
27. Playwright application-health check + operator go-ahead			✓
28. Schedule prod window				(pending)
29–36. Repeat 1–8 for prod				✓
37. Merge PR #170 to operations `main`	—	—	—	✓

The PR #170 merge to main happens only after all four environments are running its chart. Until then, the chart on PR #170 is what is live in each environment that has completed steps 1–8.

Playwright application-health verification

After step 8 of each environment, run a Playwright-driven login + smoke test against the partition’s web frontend. This replaces the time-based soak: the gate is “operator confirms the app is healthy”, not a wall-clock interval.

Endpoint

The frontend host follows a fixed pattern, all lowercase:

https://${PARTITION}.${INFRA_LOWER}.app.arda.cards

where ${INFRA_LOWER} is the lowercased infrastructure name.

Environment	URL
dev	`https://dev.alpha002.app.arda.cards`
stage	`https://stage.alpha002.app.arda.cards`
demo	`https://demo.alpha001.app.arda.cards`
prod	`https://prod.alpha001.app.arda.cards`

Credentials

All test credentials live in the operator’s Private 1Password vault. One item per environment:

Environment	1Password item (vault `Private`)
dev	`Miguel-new-dev`
stage	`Arda-stage`
demo	`arda-demo`
prod	`Arda-live`

Read username and password via op read:

## Example for dev — substitute the per-environment item name.
OP_ITEM="Miguel-new-dev"
APP_USERNAME="$(op read "op://Private/${OP_ITEM}/username")"
APP_PASSWORD="$(op read "op://Private/${OP_ITEM}/password")"

If either op read returns an empty string, stop and confirm the item’s field names with op item get "${OP_ITEM}" --vault Private before retrying — some older items use email instead of username.

Playwright MCP procedure

The verification is driven through the mcp__playwright__* tools. Per environment:

Resolve credentials with the op read snippet above. Keep them out of the conversation transcript — pass them only to the browser_fill_form / browser_type tool calls below.
Navigate to the environment URL: mcp__playwright__browser_navigate with url set to the entry from the table above. The unauthenticated landing should redirect to the login page.
Snapshot the page with mcp__playwright__browser_snapshot to identify the username / password input refs and the submit button.
Fill the login form with mcp__playwright__browser_fill_form, supplying the username and password values from step 1. Submit via browser_click on the login button (or browser_press_key with Enter if the form supports it).
Wait for navigation with mcp__playwright__browser_wait_for (e.g. wait for a known post-login element — the main navigation shell or the user-menu avatar — to appear). A successful login lands on the application home; an authentication failure stays on /login with an error banner.
Smoke check at least one authenticated page (typically the default landing route after login). Confirm the page renders without 5xx and that no console error in mcp__playwright__browser_console_messages references the operations backend (/api/... 5xx, Sentry-init failure, etc.).
Record timings — this is a required measurement, not optional. The point of this rollout is the slow-query / Sentry observability work; the post-login timings are the headline before/after data point.
- Render time: capture the time from the login-submit click (step 4/5) until the post-login landing’s “ready” signal — the same element that browser_wait_for resolved on. Read it from performance.timing / PerformanceNavigationTiming via mcp__playwright__browser_evaluate, e.g. performance.getEntriesByType('navigation')[0].domContentLoadedEventEnd - performance.getEntriesByType('navigation')[0].startTime, and also note the wall-clock seconds between the click and the browser_wait_for resolution.
- Network request times: pull the full network log with mcp__playwright__browser_network_requests after the landing resolves. For every backend call (anything to the operations API under /api/), record the URL, HTTP status, and duration in ms. Flag any request slower than 500ms (the new slow-query threshold) so the operator can correlate against pg_stat_statements.
- Persist both measurements into the per-step log section for this environment under “Step 9 Playwright check” so the four environments are comparable side-by-side at the end of the rollout.
Close the tab with mcp__playwright__browser_close so the next environment starts from a clean context.

Pass / fail criteria

Pass: login succeeds, the post-login landing renders, no backend 5xx in the network panel, and no operations-side error in the console. Operator confirms — proceed to ask permission for the next environment’s step 1.
Fail: login fails, the page errors out, or the backend returns 5xx. Treat as a stalled rollout for the affected environment: surface the failure with the captured snapshot + console log and follow step 8’s failure-handling guidance or roll back via the section below.

Rollback

Two levers, in order of preference:

Revert at the operations layer (step 8) — if the operations pod fails to start cleanly after step 8 of an environment:
Terminal window
```
helm rollback operations <previous-revision> \
  --namespace "${PARTITION}-operations" \
  --kube-context "${INFRA}"
```
The previous revision is the operations chart with :2.3.0 init container and no SENTRY_DSN env. The cluster is already on the new parameter group at this point, but the old init container did not touch pg_stat_statements, so the rollback is safe even though the cluster has changed underneath.

Restore the cluster from the step-1 snapshot — if the failure is at the database layer (steps 3–6) and cannot be diagnosed quickly:

# Restore creates a new cluster identifier; the application chart
# must be re-pointed to it. Coordinate with the operations chart
# values for the partition.
aws rds restore-db-cluster-from-snapshot \
  --db-cluster-identifier "${CLUSTER_ID}-restored" \
  --snapshot-identifier "${SNAPSHOT_ID}" \
  --engine aurora-postgresql \
  --region us-east-1 --profile "${PROFILE}"

Use this only as a last resort — restoring is a multi-hour operation and disrupts everything in the partition.

Per-step log

Fill in as you go. Replicate this block per environment.

Alpha002-dev — operator: ____

Step 1 snapshot: : start / : available
Step 2 amm.sh: : start / : complete
Step 3 dynamic verify: : ok
Step 4 reboot reader1: : start / : available
Step 5 failover: : start / : both available
Step 6 static verify: : ok (pg_stat_statements returned __ rows)
Step 7 snapshot drop: :
Step 8 PR #170 deploy: : start / : Running
Sentry first event: :
Step 9 Playwright check:
- login submit → landing ready: __ ms wall-clock / __ ms navigation
- backend /api/ requests (URL, status, ms):
- requests > 500ms (slow-query threshold): ____
Anomalies: ____

Alpha002-stage — operator: ____

(same shape)

Alpha001-demo — operator: ____

(same shape)

Alpha001-prod — operator: — window:

(same shape; note: failover step includes instance-class swap, expect +5–10 min vs other environments)

References

infrastructure PR #456 — Aurora parameter group + production sizing (PDEV-479) — 35180de
infrastructure PR #457 — {Infra}-SentryDsn secret (PDEV-500) — 929b3c8
postgres-database-initializer PR #30 — per-DB pg_stat_statements extension (PDEV-498) — afd9798, released as v2.5.0
operations-monolith-component PR #170 — chart bump + Sentry wiring (PDEV-488)
Project goal: _docs/analysis/db-configuration.md (parameter group rationale + activation theory)
Project implementation plan: _docs/implementation/infrastructure/db-plan.md (multi-env rollout sequence that this runbook is the operator-facing version of)
Per-environment status during rollout: append to the per-step log above

Execution Log

Operator: Miguel Pinilla (driven by Claude Code, session perf-upgrades). All timestamps are local (America system clock); convert as needed.

Prerequisites — 2026-05-14

infrastructure HEAD: 929b3c8c8654868e7629cfe888903596a1666f1b == origin/main. Tree clean. Ancestry check confirms 929b3c8 is included. PASS.
ghcr.io/arda-cards/postgres-database-initializer:2.5.0 — docker manifest inspect returned 0. PASS.
1Password op://Arda-SystemsOAM/be-sentry-dsn/dsn — initial read returned MISSING; retry after operator confirmed the entry exists succeeded (op whoami shows miguel@arda.cards signed in). PASS.
AWS SSO — amm.sh self-manages; non-amm.sh aws commands will prompt the operator on session expiry.

Alpha002-dev

Operator: Miguel Pinilla. Driver: Claude Code.

Drift checks

aws rds describe-db-clusters --db-cluster-identifier Alpha002-dev-AuroraCluster ... --profile Alpha002-Admin →
- Writer: alpha002-dev-auroraclusterwriter
- Reader1: alpha002-dev-auroraclusterreader1
- RDS lowercases instance identifiers; cluster identifier is case-preserved on the lookup. Not a rename — proceeding without operator consultation. Steps 4–6 will use the lowercased IDs.
kubectl --context Alpha002 get ns dev-operations → namespace dev-operations exists, age 272d. Matches the placeholder convention. No drift.
AWS SSO for Alpha002-Admin was expired on first attempt; operator ran aws sso login --profile Alpha002-Admin and confirmed.

Step 1 — Snapshot

Snapshot ID: alpha002-dev-pdev479-pre-rollout
aws rds create-db-cluster-snapshot ... at 11:09:55 PDT — accepted, status creating, snapshot create time 2026-05-14T18:09:56Z.
aws rds wait db-cluster-snapshot-available ... returned at 11:13:56 PDT (≈4 min). Final status available, progress 100%.
Step 1 PASS.

Step 2 — `amm.sh` attempt 1 — FAIL (1P)

Start 11:15:29 PDT, end 11:16:33 PDT (~64s), exit 1.
Log: infrastructure/scratch/amm-Alpha002-dev.log.
Root cause: [ERROR] could not read secret 'op://Arda-SystemsOAM/Amplify_GitHub_AccessToken/password': error initializing client: authorization timeout.
Note: amm.sh resolves a second 1P secret beyond the Sentry DSN called out in Plan prereq #4 — the Amplify GitHub access token. The Plan’s prereq check is therefore insufficient to fully validate 1P access; operator should consider adding this item to a future revision of the Plan (deferred — Plan is locked).
Post-failure CloudFormation state:
- Alpha002-Secrets stack does not exist (ValidationError).
- Sentry DSN secret value is empty / unread.
- Cluster parameter group still default.aurora-postgresql16 (engine default — unchanged).
Per runbook failure handling (Step 2, “amm.sh aborts at the 1P resolution step”): nothing was deployed, no rollback needed.
Operator action required: re-sign into op (session expired — op whoami returns “account is not signed in”). Snapshot from Step 1 still valid as rollback insurance.

Step 2 — `amm.sh` attempt 2 — FAIL (Plan/script drift)

Start 11:21:55 PDT, exit 1 within seconds.
Root cause: amm.sh auto-derives the AWS profile name as Admin-${infrastructure} (line 240). For Alpha002 that produces Admin-Alpha002, which does not exist on this machine; the real profile is Alpha002-Admin (naming asymmetry vs. Admin-Alpha1). Also: aws_region defaulted to "".
Operator decision: pass overrides on the command line — ./amm.sh --profile Alpha002-Admin --region us-east-1 Alpha002 dev. No script or AWS-config changes. Same approach will be used for the Alpha002-stage run.
Plan implication (deferred — Plan is locked): the runbook’s Step-2 command should call out the override pattern for Alpha002, or the AWS config should grow an Admin-Alpha002 alias.

Step 2 — `amm.sh` attempt 3 — PARTIAL (1P timeout at Step 2.2.4)

Start 11:22:25 PDT, end 11:34:33 PDT (~12 min), exit 1.
Reached Step 2.2.4 (“Secrets”) after full CDK deploy succeeded.
Failure: [ERROR] could not read secret 'op://Arda-DevOAM/ARDA-API-KEY/password': error initializing client: authorization timeout. A second 1P item beyond the Sentry DSN + Amplify GitHub token; this lives in the partition vault Arda-DevOAM. The 1P CLI re-prompted during the long run and the prompt was not surfaced (see operator note re: biometric-unlock-for-CLI prompts not appearing when op is invoked from inside a script).
Rollout-critical artifacts already in place at this point:
- Alpha002-Secrets stack CREATE_COMPLETE (18:23:33Z)
- Alpha002-SentryDsn secret value shape OK
- Cluster parameter group is alpha002-dev-auroradbcluster-aurorapostgresclusterclusterparametergrouped1eb2f9-dvlftis1srce (not engine default)
- Alpha002-dev-AuroraDBCluster UPDATE_COMPLETE (18:27:22Z)
Operator decision: retry amm.sh so it runs to completion; the remaining work is idempotent.

Step 2 — `amm.sh` attempt 4 — PASS

Start 11:37:15 PDT, end 11:42:12 PDT (~5 min), exit 0, status succeeded.
Re-runs of the earlier CDK deploys were no-ops. The previously-failing post-CDK secret push at Step 2.2.4 succeeded this time and the script reached Step 3: Done!.
All four runbook success checks still green (re-validated after attempt 3; no further work needed).
Step 2 PASS.

Step 3 — Verify dynamic parameters

Verified ~9 min after amm.sh finish (>>90s requirement met).
Bastion module chosen: system.reference.item → DB dev-operations.item_db. Pod miguel-psql-show-dev-<pid> (postgres:16-alpine3.20, ephemeral, cleaned up via EXIT trap).
Dynamic parameter values from psql against the writer endpoint:
- log_min_duration_statement = 500ms ✅ (expected 500ms)
- log_statement = ddl ✅ (expected ddl)
- log_lock_waits = on ✅ (expected on)
- log_temp_files = 0 ✅ (expected 0)
Pre-reboot static-parameter readings:
- shared_preload_libraries: permission denied — the module DB user (dev-operations.ItemDb) lacks pg_read_all_settings. Expected to still be the old value at this stage; will reattempt at Step 6 from the post-failover writer (may hit the same role limitation — flagged for that step).
- max_connections = 401 (t3.medium default; runbook expected “still old value (pending reboot)” — consistent).
Step 3 PASS.

Step 4 — Reboot reader1

Target: alpha002-dev-auroraclusterreader1.
aws rds reboot-db-instance at 11:57:20 PDT — accepted, status rebooting.
aws rds wait db-instance-available returned at 11:58:23 PDT (~63s).
Post-reboot status: available, parameter-apply status in-sync (no pending-reboot residue — static params were picked up).
Writer (alpha002-dev-auroraclusterwriter) continued serving traffic throughout.
Step 4 PASS.

Step 5 — Failover to reader1

Target: alpha002-dev-auroraclusterreader1.
aws rds failover-db-cluster accepted at 12:03:03 PDT.
Initial wait db-instance-available calls returned within 1–2s because the failover had not yet transitioned instances out of available — the wait pattern in the runbook only catches the trailing edge of the state machine. Switched to a 5s-poll loop on cluster status / writer designation.
12:03:34 PDT — cluster status failing-over, writer already flipped to alpha002-dev-auroraclusterreader1.
12:07:36 PDT — cluster status returned to available.
Final state:
- Writer: alpha002-dev-auroraclusterreader1 (was reader).
- Reader: alpha002-dev-auroraclusterwriter (was writer).
- Both db.t3.medium, both ParamApply: in-sync.
Plan implication (deferred — Plan is locked): Step 5’s success-check commands as written can race the failover. A 5s-poll on cluster status until available is more reliable than the wait calls.
Step 5 PASS.

Step 6 — Verify static parameters + `pg_stat_statements`

Bastion pod via system.reference.item module against the new writer (alpha002-dev-auroraclusterreader1):

SHOW max_connections = 401. The cluster parameter group’s formula is LEAST({DBInstanceClassMemory/9531392},5000) which yields ~401 for db.t3.medium. Coincidentally the same as the engine default for the class — proceed via the AWS-API cross-check to confirm the source is the new group, not the default.
SHOW shared_preload_libraries — permission denied (module user lacks pg_read_all_settings, same as Step 3). Falling back to functional + API checks below.
SELECT count(*) FROM pg_stat_statements = 184 rows. Strongest possible verification: the view returns data only when the extension is loaded via shared_preload_libraries. No “must be loaded via shared_preload_libraries” error. PASS.
AWS API cross-check against parameter group alpha002-dev-auroradbcluster-aurorapostgresclusterclusterparametergrouped1eb2f9-dvlftis1srce:
- shared_preload_libraries = pg_stat_statements (ApplyType static)
- max_connections = LEAST({DBInstanceClassMemory/9531392},5000) (ApplyType static)
- Both ApplyMethod: pending-reboot (group-level apply policy for static params — not a current pending state on the instance, which is ParamApply: in-sync).
Step 6 PASS.

Step 7 — Delete snapshot

Operator confirmed Performance Insights for the new writer (db-CA2DXHCIXASDGHILIGGULDT2HU) shows ~zero AAS — two tiny bars at 12:07–12:08 (HikariCP reconnects post-failover) then flat, consistent with dev idle load.
aws rds delete-db-cluster-snapshot ... Alpha002-dev-pdev479-pre-rollout at 12:26:34 PDT — accepted, prior status available. Subsequent describe-db-cluster-snapshots returned DBClusterSnapshotNotFoundFault (deletion was effectively immediate).
Step 7 PASS.

Step 8 — Deploy PR #170 (operations chart bump) — PASS (with restart)

Deploy executed via operations CI workflow run https://github.com/Arda-cards/operations/actions/runs/25871797543 (deploy (dev) / dev: 2.24.0, 2m 7s, green). However this ran ~2h before the Aurora rollout work today, and the running pods at the start of this step were 170m old.
Pre-restart check on operations-7bd795b655-*:
- Deployment spec had SENTRY_DSN env wired from secret be-sentry-dsn/dsn.
- Running pods env did not have SENTRY_DSN populated — only SENTRY_RELEASE, SENTRY_ENVIRONMENT, SENTRY_TRACES_SAMPLE_RATE.
- Root cause: pods rolled out at ~9:45 PDT, but Alpha002-SentryDsn was created at 11:23 PDT by amm.sh attempt 3. The ExternalSecret synced (Ready=True reason=SecretSynced) later, but the old pods never saw the new secret.
Verified be-sentry-dsn k8s secret exists in dev-operations and contains the expected DSN value.
kubectl rollout restart deployment/operations issued at ~12:36 PDT. Rollout completed cleanly (two new replicas up, two old replicas terminated).
Post-restart verification on operations-59d5d94b99-*:
- Both pods Running 1/1, RESTARTS 0.
- SENTRY_DSN env now present in operations container.
- All 6 init containers (init-create-{order,businessaffiliate,item,facility,kanban,station}-db) exited 0, image postgres-database-initializer:2.5.0. Each ran CREATE EXTENSION IF NOT EXISTS pg_stat_statements (no-op, already present) and the new fail-loud PERFORM 1 FROM pg_stat_statements LIMIT 1 verify — all succeeded.
Plan implication (deferred — Plan is locked): when the CI helm deploy runs before amm.sh, the resulting pods come up before Alpha002-SentryDsn exists and do not pick up the DSN. Future revisions should either (a) sequence helm deploy strictly after amm.sh for the partition, or (b) add an explicit kubectl rollout restart deployment/operations as the last action in Step 8.
Linear PDEV-509 updated with an observation from this rollout (zero AAS on the demoted instance post-failover) and a worked-out options menu for read/write split via two HikariCP pools.
Step 8 PASS.

Step 9 — Playwright application-health check

URL: https://dev.alpha002.app.arda.cards → redirected to /signin then auto-completed to /items?justSignedIn=true via browser-saved credentials (no explicit submit click needed; the auth flow ran end-to-end).
Credentials used: op://Private/Miguel-new-dev/{username,password} → resolved successfully (login succeeded, account miguel-new-dev@arda.cards, role Account Admin).
Post-login landing: Items list with 3 rows rendered; sidebar (Dashboard, Items, Order Queue, Receiving) populated; user menu shows correct identity.

Page timings (Performance Navigation Timing on /items):

Metric	Value
`responseEnd`	4193 ms
`domContentLoaded`	4400 ms
`loadEvent`	4848 ms
`first-paint`	4424 ms
`first-contentful-paint`	8824 ms
total `duration`	4848 ms
resource entries	106

Backend request times (22 /api/ + Cognito calls):

Total: 22 requests. Median: 536 ms. All returned 200 except one 401 /api/pylon/email-hash that immediately retried to 200.
Requests >500 ms (the slow-query threshold the new parameter group is configured to log): 12 of 22, including 8 over 2 s.
- 3463 ms POST /api/arda/kanban/kanban-card/details/requesting
- 3384 ms POST /api/auth/secret-hash
- 3289 ms POST /api/auth/secret-hash
- 2751 ms POST /api/arda/kanban/kanban-card/details/requesting
- 2663 ms POST /api/arda/kanban/kanban-card/details/requested
- 2579 ms POST /api/arda/items/query-ssrm
- 2449 ms POST /api/arda/kanban/kanban-card/details/in-process
- 2444 ms POST /api/arda/kanban/kanban-card/details/requested
- 1203 ms POST cognito-idp.us-east-1.amazonaws.com/ (Cognito auth)
- 574 ms POST /api/arda/kanban/kanban-card/query-details-by-item
- 536 ms POST /api/arda/kanban/kanban-card/query-details-by-item
- 515 ms GET /api/arda/kanban/kanban-card/query-by-item?eId=...
These are pre-existing slow paths — surfaced now by the rollout’s slow-query logging at the log_min_duration_statement=500ms threshold. The repeated kanban-card detail endpoints under ~2.4–3.5s are likely the most impactful starting point for the performance investigation this rollout enables. Expect them to appear in the Postgres slow-query log on the new writer (cluster Alpha002-dev-AuroraCluster).

Console errors (9 total):

1 × 401 /api/pylon/email-hash — retried to 200 in subsequent call.
7 × [WorkspaceSwitcher] Zero workspaces returned — this should never happen for an authenticated user. — appears unrelated to this rollout (frontend error condition; the page rendered with data despite the message). Worth a follow-up ticket if not already tracked.

Pass / fail: PASS (functional). Login flow works, page renders, backend returns 200 for all real calls. The slow-request count is expected to drop once PDEV-509 (graceful degradation / read split) and follow-up query tuning land; tracked separately.

Step 9 PASS.

Alpha002-dev — rollout complete

All 9 steps green. Pausing for operator go-ahead before starting Alpha002-stage.

Alpha002-stage

Operator: Miguel Pinilla. Driver: Claude Code. Operator authorized start at ~12:43 PDT.

Drift checks

aws rds describe-db-clusters --db-cluster-identifier Alpha002-stage-AuroraCluster ... --profile Alpha002-Admin →
- Writer: alpha002-stage-auroraclusterwriter
- Reader1: alpha002-stage-auroraclusterreader1
- Same lowercase pattern as dev. No drift.
kubectl --context Alpha002 get ns stage-operations → exists, age 279d. Matches placeholder convention. No drift.
Pre-rollout Aurora cluster param group: default.aurora-postgresql16 (engine default) — confirms stage has not yet been touched.
op whoami → “account is not signed in”. 1P session expired again; operator action required before Step 2.

Step 1 — Snapshot

Snapshot ID: alpha002-stage-pdev479-pre-rollout
Created at 12:49:13 PDT (run in background while resolving the 1P session + Linear ticket draft for PDEV-513).
aws rds wait db-cluster-snapshot-available returned at 12:54:46 PDT (~5.5 min). Final status available, progress 100%.
Step 1 PASS.

Step 2 — `amm.sh` PASS (first attempt)

Start 14:34:03 PDT, end 14:42:27 PDT (~8 min), exit 0, status: succeeded. No retries needed (1P session was warm in this shell after op signin re-authorized the parent process).
Success checks (all green):
- Alpha002-Secrets stack CREATE_COMPLETE (already created during dev’s amm.sh, not re-created).
- Cluster parameter group is alpha002-stage-auroradbcluster-aurorapostgresclusterclusterparametergrouped1eb2f9-gnhrcc7nclxh (not engine default).
- Both instances db.t3.medium, ParamApply: in-sync.
Notable CFN behavior observed: amm.sh’s CDK deploy automatically rebooted both instances during the partition stack update (writer 14:36:59 → 14:39:05, reader1 14:39:06 → 14:40:10). Investigation against dev’s CFN history (attempt 3) shows the same behavior occurred there too (writer 11:28:06 → 11:30:13, reader1 11:30:14 → 11:31:19) — Steps 4 and 5 in the runbook were effectively idempotent for dev.
Root cause / Plan implication (deferred — Plan is locked): infrastructure/src/main/cdk/platforms.ts sets databaseTuning.applyImmediately per partition. For dev / stage / demo: applyImmediately: true → CDK propagates the cluster param-group attach to each AWS::RDS::DBInstance modification with applyImmediately=true → AWS reboots immediately. For prod: applyImmediately: false → CDK queues changes; manual reboot required, plus an extra modify-db-instance --apply-immediately step to flush the queued instance-class swap (db.t3.medium → db.r7g.large). The runbook conflates “param group attached” with “param group live” and over-prescribes for the applyImmediately=true partitions.
Step 2 PASS.

Step 3 — Verify dynamic parameters

Bastion pod via system.reference.item module against the writer (alpha002-stage-auroraclusterwriter at the time of check):

log_min_duration_statement = 500ms ✅
log_statement = ddl ✅
log_lock_waits = on ✅
log_temp_files = 0 ✅

Static-param probe (since CDK already rebooted, these should be live):

max_connections = 401 (consistent with new param group’s LEAST({DBInstanceClassMemory/9531392},5000) formula for db.t3.medium).
SELECT count(*) FROM pg_stat_statements → ERROR: relation "pg_stat_statements" does not exist. The extension was not yet created in stage-operations.item_db. Expected: stage operations is still on the old chart (init 2.3.0); the 2.5.0 init container has not run yet. Will be installed by Step 8 (helm upgrade) and re-verified afterward.
Step 3 PASS (dynamic params correct; pg_stat_statements extension creation deferred to Step 8).

Step 4 — Reboot reader1 — SKIPPED

Per operator decision after the Plan-vs-CDK reconciliation above: both stage instances were already rebooted by amm.sh’s CDK deploy during Step 2 (writer 14:36:59–14:39:05, reader1 14:39:06– 14:40:10). An additional manual reboot would be redundant.

Step 5 — Failover to reader1 (smoke test)

Run as a smoke test per operator direction — not strictly required for parameter activation (CDK already covered that), but useful as a validation that failover works in stage.

aws rds failover-db-cluster ... --target-db-instance-identifier alpha002-stage-auroraclusterreader1 at 14:54:59 PDT.
Polling cluster status / writer designation every 10s:
- 14:55:02 cluster=available, writer=...writer (pre-flip)
- 14:55:13 cluster=failing-over, writer=...writer
- 14:55:25 cluster=failing-over, writer=...reader1 ← flipped
- 14:55:37 cluster=available, writer=...reader1 ← settled
Total failover window: ~38s. Faster than dev (~4 min cluster- status return) — probably because stage has lighter ambient load.
Step 5 PASS.

Step 8 — Deploy PR #170 (operations chart 2.24.0) — PASS

Operator approved the queued stage deploy in the operations CI run.
Rollout watched live:
- New ReplicaSet operations-5dd7cc9546 (image operations:2.24.0, init postgres-database-initializer:2.5.0) spun up at 14:58 PDT.
- Old ReplicaSet operations-7c894bf4b9 (was running operations:2.23.0) terminated.
- kubectl rollout status returned successfully rolled out at 14:59:00 PDT.
- Both new pods (-m2vbt, -tgrg7) Running 1/1, RESTARTS 0.
No race condition this time — stage deploy happened after amm.sh had already created Alpha002-Secrets (during dev’s run), so SENTRY_DSN was wired correctly from the start; no manual rollout-restart needed.
Verifications on new pods:
- All 6 init containers (init-create-{order,businessaffiliate,item,facility,kanban,station}-db) exited 0.
- SENTRY_DSN populated, SENTRY_ENVIRONMENT=Alpha002-stage.
Step 8 PASS.

Step 6 — pg_stat_statements verify (post-Step-8)

Re-ran the bastion SELECT against the stage writer:

SELECT count(*) FROM pg_stat_statements = 659 rows. Significantly more than dev’s 184 — stage has been actively collecting query stats since amm.sh rebooted both instances ~80 min before this check.
Step 6 PASS (deferred from earlier — now satisfied).

Step 7 — Delete snapshot — PASS

aws rds delete-db-cluster-snapshot ... Alpha002-stage-pdev479-pre-rollout at 15:02:13 PDT — accepted, prior status available.
Step 7 PASS.

Step 9 — Playwright application-health check

URL: https://stage.alpha002.app.arda.cards → /signin → typed credentials from op://Private/Arda-stage/{username,password} → Sign in → landed on /items?justSignedIn=true.
User: miguel@arda.cards (resolved from 1P).

Page timings:

Metric	Stage	Dev (for comparison)
`responseEnd`	598 ms	4193 ms
`domContentLoaded`	696 ms	4400 ms
`loadEvent`	810 ms	4848 ms
`first-paint`	728 ms	4424 ms
`first-contentful-paint`	1260 ms	8824 ms

Backend request times (22 calls, median 399 ms):

Requests > 500 ms: 7 of 22 (dev: 12 of 22). Peak 1099 ms on Cognito (expected auth overhead). Application API peak 593 ms on POST /api/arda/kanban/kanban-card/query-details-by-item.
Slowest 8 backend calls all between 475–593 ms (vs dev’s 2444–3463 ms range on the same kanban endpoints).
Stage is materially faster than dev — likely a combination of warmer caches, fewer items, and a more recently rebooted writer with no accumulated noisy queries.

Console errors: 0 (dev: 9, including 7× WorkspaceSwitcher “Zero workspaces returned”). The absence of the WorkspaceSwitcher error on stage with a different user account reinforces that the issue tracked in PDEV-513 is data-/account-state driven, not a generic code bug. Updated PDEV-513 with this observation implicitly via this log; explicit Linear comment skipped to avoid churn.

Step 9 PASS.

Alpha002-stage — rollout complete

All 9 steps green. Pausing for operator go-ahead before starting Alpha001-demo.

Alpha001-demo

Operator: Miguel Pinilla. Driver: Claude Code. Operator authorized start after running aws sso login --profile Admin-Alpha1.

Drift checks

aws rds describe-db-clusters --db-cluster-identifier Alpha001-demo-AuroraCluster ... --profile Admin-Alpha1 →
- Writer: alpha001-demo-auroraclusterwriter
- Reader1: alpha001-demo-auroraclusterreader1
- Same lowercase pattern as dev/stage. No drift.
kubectl --context Alpha001 get ns demo-operations → exists, age 87d. Matches placeholder convention. No drift.
Pre-rollout Aurora cluster param group: default.aurora-postgresql16 (engine default) — confirms demo has not yet been touched.
1P paths all reachable: op://Arda-SystemsOAM/be-sentry-dsn/dsn OK, op://Arda-SystemsOAM/Amplify_GitHub_AccessToken/password OK, op://Arda-DemoOAM/ARDA-API-KEY/password OK.

Step 1 — Snapshot

Snapshot ID: alpha001-demo-pdev479-pre-rollout
Created at 16:34:57 PDT.
aws rds wait db-cluster-snapshot-available returned at 16:38:30 PDT (~3.5 min). Final status available, progress 100%.
Step 1 PASS.

Step 2 — `amm.sh` attempt 1 — PARTIAL (1P timeout at Step 2.2.4)

Start 17:01:29 PDT, end 17:13:31 PDT (~12 min), exit 1.
Same failure mode as dev attempt 3: [ERROR] could not read secret 'op://Arda-DemoOAM/ARDA-API-KEY/password': error initializing client: authorization timeout. Operator: “I never got a prompt” — the desktop integration’s prompt did not surface during the long-running script. Confirmed later that even an interactive op signin from a fresh shell could time out silently when the desktop app is locked.
Rollout-critical artifacts already in place at failure point:
- Alpha001-Secrets stack CREATE_COMPLETE (17:02:33 PDT)
- Alpha001-SentryDsn Secrets Manager value shape OK
- Cluster param group is alpha001-demo-auroradbcluster-aurorapostgresclusterclusterparametergrouped1eb2f9-v3sluztqcnrj
- CDK rebooted both instances during the partition deploy (writer 17:07:10–17:09:16, reader1 17:09:17–17:10:22) — same applyImmediately=true behavior as dev and stage.

Step 2 — `amm.sh` retry 1 — FAIL (op signin timeout)

17:45:24 PDT, op signin itself returned authorization timeout after 60s because the desktop app had re-locked and the prompt did not surface. amm.sh never ran (short-circuited by &&).

Step 2 — `amm.sh` retry 2 — PASS

Operator unlocked desktop 1Password, biometric prompt surfaced for op signin, approval granted.
Start 17:47:31 PDT, end 17:52:55 PDT (~5.5 min), exit 0, status: succeeded.
All re-runs of CDK deploys were no-ops; the previously-failing post-CDK secret push at Step 2.2.4 succeeded.
Step 2 PASS.

Step 3 — Verify dynamic parameters

Bastion via system.reference.item against the demo writer:

log_min_duration_statement = 500ms ✅
log_statement = ddl ✅
log_lock_waits = on ✅
log_temp_files = 0 ✅
max_connections = 401 (correct for db.t3.medium per the new param group’s LEAST({DBInstanceClassMemory/9531392},5000) formula).
SELECT count(*) FROM pg_stat_statements → “relation does not exist” — expected; demo operations is still on the OLD chart (init postgres-database-initializer:2.3.0, app operations:2.23.0, no SENTRY_DSN env, no be-sentry-dsn secret yet). Will be installed by Step 8.
Step 3 PASS.

Step 4 — Reboot reader1 — SKIPPED

CDK already rebooted both instances during Step 2’s partition deploy (writer 17:07:10–17:09:16, reader1 17:09:17–17:10:22). Same as dev and stage; an additional manual reboot would be redundant.

Step 5 — Failover to reader1 — SKIPPED

Operator decision: stage already validated the failover path (stage Step 5, 14:54:59–14:55:37 PDT, ~38s); no additional smoke test value in repeating on demo.

Step 6 — Verify pg_stat_statements — DEFERRED

Pending Step 8: demo operations is still on init container 2.3.0 which does not create the pg_stat_statements extension. Will re-run the SELECT after the operations chart upgrade.

Step 8 — Deploy operations chart 2.24.0 — PASS

Operator triggered the demo deploy.
Rollout watched: new ReplicaSet operations-579bdd455d spun up at ~18:08 PDT, kubectl rollout status returned success at 18:09:26 PDT (~30s rollout). Old operations-56684688fb pods terminated.
New pods both Running 1/1, RESTARTS 0.
New images: init postgres-database-initializer:2.5.0, app operations:2.24.0 (replaced 2.3.0 / 2.23.0).
All 6 init containers (init-create-{order,businessaffiliate,item,facility,kanban,station}-db) exited 0.
SENTRY_DSN populated, SENTRY_ENVIRONMENT=Alpha001-demo, SENTRY_TRACES_SAMPLE_RATE=0.1 (note: demo uses 10% sampling vs 100% on dev/stage — expected per-env override).
ExternalSecret be-sentry-dsn reports Ready=True, reason=SecretSynced.
Step 8 PASS.

Step 6 — pg_stat_statements verify (post-Step-8)

SELECT count(*) FROM pg_stat_statements against the demo writer via system.reference.item bastion = 839 rows.
Step 6 PASS.

Step 7 — Delete snapshot — PASS

aws rds delete-db-cluster-snapshot ... Alpha001-demo-pdev479-pre-rollout at 18:19:13 PDT — accepted, prior status available.
Step 7 PASS.

Step 9 — Playwright application-health check

URL: https://demo.alpha001.app.arda.cards → initial navigation showed “Your session has expired. Please sign in again.” (stale cookie from a prior session) → redirected to /signin?next=%2F → typed credentials from op://Private/arda-demo/{username,password} → Sign in → landed on /items?justSignedIn=true.
User: miguel@arda.cards (Account Admin). Items grid empty (no rows in demo). “For Trial Use Only” banner visible.

Page timings:

Metric	Demo	Stage	Dev
`responseEnd`	4919 ms	598 ms	4193 ms
`domContentLoaded`	5137 ms	696 ms	4400 ms
`loadEvent`	5404 ms	810 ms	4848 ms
`first-paint`	5152 ms	728 ms	4424 ms
`first-contentful-paint`	9604 ms	1260 ms	8824 ms

Backend request times (19 calls, median 462 ms):

Requests > 500 ms: 9 of 19 (43%). Profile closer to dev than to stage.
Slowest 8:
- 3599 ms POST /api/auth/secret-hash
- 3439 ms POST /api/auth/secret-hash
- 2563 ms POST /api/arda/kanban/kanban-card/details/requested
- 2010 ms POST /api/arda/user-account/query
- 1545 ms POST /api/arda/items/query-ssrm
- 1206 ms POST /api/arda/kanban/kanban-card/details/in-process
- 1190 ms POST /api/arda/kanban/kanban-card/details/requesting
- 876 ms POST cognito-idp.us-east-1.amazonaws.com/

Console errors (14):

1 × 401 /api/pylon/email-hash (transient, retry pattern).
3 × 400 from cognito-idp plus [CLIENT] Token refresh failed / Authentication token has expired — side effect of the stale session cookie that triggered the “session expired” toast on initial navigation. Resolved after fresh sign-in.
7 × AG Grid Enterprise “License Key Not Found” notice — cosmetic, trial mode. Demo-specific (not seen on dev/stage). Worth a separate ticket if not already tracked: production-bound charts should ship with a real license to avoid the watermark / banner.
No WorkspaceSwitcher errors (PDEV-513 is dev-account-specific, confirmed across stage and now demo).

Pass / fail: PASS. App functional, all real API calls return 200 modulo the transient pylon 401 (auto-retried). Performance profile is “slow path” like dev rather than “fast path” like stage — likely a cold-cache or scale-of-data effect that’s separate from this rollout.

Step 9 PASS.

Alpha001-demo — rollout complete

All 9 steps green. Pausing for operator go-ahead before starting Alpha001-prod.

Alpha001-prod

Operator: Miguel Pinilla. Driver: Claude Code. Window: 2026-05-14 23:00 PDT (planned failover moment). Actual failover fired at 23:16:22 PDT after recovering from a hard CFN failure (see Step 2 attempt 1 below).

Drift checks

AWS profile Admin-Alpha1 SSO confirmed.
Aurora Alpha001-prod-AuroraCluster writer alpha001-prod-auroraclusterwriter, reader1 alpha001-prod-auroraclusterreader1. Same lowercase pattern as the other partitions.
kubectl --context Alpha001 get ns prod-operations → exists, age 265d.
Pre-rollout cluster: param group default.aurora-postgresql16, both instances db.t3.medium, PerformanceInsightsRetentionPeriod=465, DatabaseInsightsMode=advanced. The Advanced mode + 465-day retention turned out to be a drift versus the CDK construct (which models DatabaseInsightsMode=undefined ⇒ standard and performanceInsightRetention=DEFAULT ⇒ 7 days). Will reconcile during Step 2.
1P paths all reachable after op signin.

Step 1 — Snapshot

Snapshot ID: alpha001-prod-pdev479-pre-rollout
Created 22:25:06 PDT, available 22:30:44 PDT (~5.5 min).
Step 1 PASS.

Step 2 — `amm.sh` attempt 1 — HARD FAIL (Advanced Insights drift)

Start 22:30:44 PDT, end 22:33:28 PDT (~3 min), exit 1.
Root cause: Alpha001-prod-AuroraDBCluster CFN stack tried to modify the AWS::RDS::DBCluster to set PerformanceInsightsRetentionPeriod to 7 (CDK default), but the live cluster had DatabaseInsightsMode=advanced which mandates ≥ 465 days. AWS rejected, CFN tried to roll back, the rollback ALSO failed with the same error, leaving the stack in UPDATE_ROLLBACK_FAILED.
Stack audit at this point:
- Alpha001-Networking, Alpha001-Compute, Alpha001-Ingress, Alpha001-CloudWatchLog, Alpha001-Secrets — all UPDATE_COMPLETE/CREATE_COMPLETE (from earlier deploys).
- Alpha001-prod-Imported/Compute/Authentication/BulkStores/ImageStorage — all UPDATE_COMPLETE (succeeded earlier in this attempt before the Aurora failure).
- Alpha001-prod-AuroraDBCluster — UPDATE_ROLLBACK_FAILED.
- Alpha001-prod-Ingress, Alpha001-prod-DnsConfiguration — NOT redeployed (CDK chain stopped at Aurora).
Cluster + instances functionally unchanged; customer traffic unaffected.

Step 2 — Recovery: disable Advanced Insights + continue rollback

aws rds modify-db-cluster --database-insights-mode standard --apply-immediately at 22:40:28 PDT.
Cluster + both instances transitioned to standard mode immediately. PI retention now mutable.
continue-update-rollback retry 1 at 22:42:20 PDT failed with DB cluster isn't available for modification with status configuring-enhanced-monitoring (transient race with the manual modify above).
Waited for cluster available, retried continue-update-rollback, succeeded — stack reached UPDATE_ROLLBACK_COMPLETE at 22:55:56 PDT.
Final state: PIRetention dropped to 7, cluster on default.aurora-postgresql16, no Advanced Insights, ready for re-deploy.

Step 2 — `amm.sh` retry — PASS

Start 22:57:25 PDT, end 23:04:15 PDT (~7 min), exit 0.
Alpha001-prod-AuroraDBCluster UPDATE_COMPLETE at 22:59:43.
Notable CFN behavior (different from dev/stage/demo): the partition Aurora stack’s DBInstance modifications returned UPDATE_COMPLETE in ~30s each (no reboot), with the changes queued as PendingModifiedValues rather than applied immediately. This matches applyImmediately=false in platforms.ts for prod.
Alpha001-prod-Ingress UPDATE_COMPLETE 22:59:43 (was skipped in attempt 1).
Alpha001-prod-DnsConfiguration UPDATE_COMPLETE 23:00:03 (also was skipped in attempt 1).
Post-amm state:
- Cluster param group: alpha001-prod-auroradbcluster-aurorapostgresclusterclusterparametergrouped1eb2f9-nanabjqq5qcd
- Both instances db.t3.medium, ParamApply: in-sync, PendingModifiedValues.DBInstanceClass: db.r7g.large queued.
Step 2 PASS.

Step 3 — Verify dynamic parameters

Bastion via system.reference.item against the (still-original) writer:
- log_min_duration_statement = 500ms ✅
- log_statement = ddl ✅
- log_lock_waits = on ✅
- log_temp_files = 0 ✅
max_connections = 401 (pre-class-swap, t3.medium default).
Step 3 PASS.

Step 4 — Flush reader1 pending mods (apply-immediately)

Prod-specific equivalent of “reboot reader1”. For applyImmediately=false the runbook’s plain reboot would not flush the queued class swap; an explicit modify-db-instance --apply-immediately is required.
23:06:15 PDT — aws rds modify-db-instance --db-instance-identifier alpha001-prod-auroraclusterreader1 --apply-immediately.
23:07:45 — reader1 enters modifying with pending class swap.
23:12:34 — reader1 visible as db.r7g.large, configuring-enhanced-monitoring.
23:13:37 — reader1 available, db.r7g.large, no pending. Total ~7 min.
Step 4 PASS.

Step 5 — Failover to reader1

23:16:22 PDT — aws rds failover-db-cluster --db-cluster-identifier Alpha001-prod-AuroraCluster --target-db-instance-identifier alpha001-prod-auroraclusterreader1.
23:16:35 — cluster failing-over, writer still old (in flight).
23:16:46 — writer flipped to alpha001-prod-auroraclusterreader1.
23:16:57 — cluster available. Total failover window ~35s.
Post-failover state:
- Writer: alpha001-prod-auroraclusterreader1 on db.r7g.large, ParamApply: in-sync, no pending.
- Reader (the demoted instance): alpha001-prod-auroraclusterwriter still on db.t3.medium with PendingModifiedValues.DBInstanceClass: db.r7g.large — failover-induced restart did not auto-flush the queue (confirming the runbook’s prediction was over-optimistic for applyImmediately=false).
Step 5 PASS (writer flipped; class swap on demoted writer handled below).

Step 5b — Flush demoted writer pending mods

23:18:31 PDT — modify-db-instance --apply-immediately alpha001-prod-auroraclusterwriter.
23:18:50 — demoted writer modifying.
23:24:59 — class swap to db.r7g.large complete; configuring-enhanced-monitoring.
23:26:03 — available, no pending. Total ~7.5 min.

Final cluster state after failover + both class swaps

Cluster param group: alpha001-prod-auroradbcluster-aurorapostgresclusterclusterparametergrouped1eb2f9-nanabjqq5qcd
Writer: alpha001-prod-auroraclusterreader1 (promoted)
Reader: alpha001-prod-auroraclusterwriter (demoted)
Both instances db.r7g.large, ParamApply: in-sync, Status: available.
SHOW max_connections = 500 (prod’s explicit override) ✅.

Step 8 — Deploy operations chart 2.24.0 to prod

Operator triggered the deploy via the operations CI workflow.
Rollout watched: new ReplicaSet operations-f5956475d spun up at 23:30:05 PDT, rollout status returned successfully rolled out at 23:31:15 PDT (~70s). Both new pods Running 1/1, RESTARTS 0.
New images: init postgres-database-initializer:2.5.0, app operations:2.24.0 (replaced 2.3.0 / 2.23.0).
All 6 init containers exited 0.
SENTRY_DSN populated, SENTRY_ENVIRONMENT=Alpha001-prod, SENTRY_TRACES_SAMPLE_RATE=0.1 (same conservative sampling as demo).
ExternalSecret be-sentry-dsn Ready=True, reason=SecretSynced.
Step 8 PASS.

Step 6 — pg_stat_statements verify (post-Step-8)

SELECT count(*) FROM pg_stat_statements against prod writer via system.reference.item = 769 rows.
Step 6 PASS.

Step 7 — Delete snapshot — PASS

aws rds delete-db-cluster-snapshot ... Alpha001-prod-pdev479-pre-rollout at 23:35:44 PDT — accepted, prior status available.
Step 7 PASS.

Step 9 — Playwright application-health check

URL: https://prod.alpha001.app.arda.cards → /signin → typed credentials from op://Private/Arda-live/{username,password} → landed on /items?justSignedIn=true.
User: miguel@arda.cards (Account Admin).

Page timings (fastest of all four environments):

Metric	Prod	Stage	Demo	Dev
`responseEnd`	449 ms	598	4919	4193
`domContentLoaded`	577 ms	696	5137	4400
`loadEvent`	742 ms	810	5404	4848
`first-paint`	608 ms	728	5152	4424
`first-contentful-paint`	1176 ms	1260	9604	8824

Backend request times (24 calls, median 577 ms):

Requests > 500 ms: 13 of 24 (54%), but the slowest 8 are all in the 644–794 ms range — much tighter than dev/demo’s multi-second tail.
Peak: 794 ms POST /api/arda/kanban/kanban-card/query-details-by-item (vs dev’s 3463 ms on the same path). The db.r7g.large upgrade is showing.
Cognito at 746 ms (auth overhead).

Console errors: 0. No WorkspaceSwitcher, no AG Grid license warnings, no auth retries.

Step 9 PASS.

Alpha001-prod — rollout complete

All 9 steps green. Customer-visible failover at 23:16:22 PDT, ~35-second writer-endpoint disruption window. Cluster now on db.r7g.large with the new parameter group; operations component on chart 2.24.0 / init 2.5.0 with Sentry wired to Alpha001-SentryDsn. pg_stat_statements collecting (769 rows within first minutes of the new init container running).

Aurora parameter group + operations 2.5.0 bump rollout

Plan

Prerequisites

Rollout sequence

Per-environment runbook

Drift checks (run before Step 1 of each environment)

Step 1 — Snapshot the cluster

Step 2 — Run amm.sh for the partition

Step 3 — Verify dynamic parameters propagated

Step 4 — Reboot reader1

Step 5 — Failover to reader1

Step 6 — Verify static parameters + pg_stat_statements

Step 7 — Delete the snapshot

Step 8 — Deploy PR #170 to this partition

When to deploy PR #170 — quick reference

Playwright application-health verification

Endpoint

Credentials

Playwright MCP procedure

Pass / fail criteria

Rollback

Per-step log

Alpha002-dev — operator: ____

Alpha002-stage — operator: ____

Alpha001-demo — operator: ____

Alpha001-prod — operator: ____ — window: ____

References

Execution Log

Prerequisites — 2026-05-14

Alpha002-dev

Drift checks

Step 1 — Snapshot

Step 2 — amm.sh attempt 1 — FAIL (1P)

Step 2 — amm.sh attempt 2 — FAIL (Plan/script drift)

Step 2 — amm.sh attempt 3 — PARTIAL (1P timeout at Step 2.2.4)

Step 2 — amm.sh attempt 4 — PASS

Step 3 — Verify dynamic parameters

Step 4 — Reboot reader1

Step 5 — Failover to reader1

Step 6 — Verify static parameters + pg_stat_statements

Step 7 — Delete snapshot

Step 8 — Deploy PR #170 (operations chart bump) — PASS (with restart)

Step 9 — Playwright application-health check

Alpha002-dev — rollout complete

Alpha002-stage

Drift checks

Step 1 — Snapshot

Step 2 — amm.sh PASS (first attempt)

Step 3 — Verify dynamic parameters

Step 4 — Reboot reader1 — SKIPPED

Step 5 — Failover to reader1 (smoke test)

Step 8 — Deploy PR #170 (operations chart 2.24.0) — PASS

Step 6 — pg_stat_statements verify (post-Step-8)

Step 7 — Delete snapshot — PASS

Step 9 — Playwright application-health check

Alpha002-stage — rollout complete

Alpha001-demo

Drift checks

Step 1 — Snapshot

Step 2 — amm.sh attempt 1 — PARTIAL (1P timeout at Step 2.2.4)

Step 2 — amm.sh retry 1 — FAIL (op signin timeout)

Step 2 — amm.sh retry 2 — PASS

Step 3 — Verify dynamic parameters

Step 4 — Reboot reader1 — SKIPPED

Step 5 — Failover to reader1 — SKIPPED

Step 6 — Verify pg_stat_statements — DEFERRED

Step 8 — Deploy operations chart 2.24.0 — PASS

Step 6 — pg_stat_statements verify (post-Step-8)

Step 7 — Delete snapshot — PASS

Step 9 — Playwright application-health check

Alpha001-demo — rollout complete

Alpha001-prod

Drift checks

Step 1 — Snapshot

Step 2 — amm.sh attempt 1 — HARD FAIL (Advanced Insights drift)

Step 2 — Recovery: disable Advanced Insights + continue rollback

Step 2 — amm.sh retry — PASS

Step 3 — Verify dynamic parameters

Step 4 — Flush reader1 pending mods (apply-immediately)

Step 5 — Failover to reader1

Step 2 — Run `amm.sh` for the partition

Step 6 — Verify static parameters + `pg_stat_statements`

Alpha001-prod — operator: — window:

Step 2 — `amm.sh` attempt 1 — FAIL (1P)

Step 2 — `amm.sh` attempt 2 — FAIL (Plan/script drift)

Step 2 — `amm.sh` attempt 3 — PARTIAL (1P timeout at Step 2.2.4)

Step 2 — `amm.sh` attempt 4 — PASS

Step 6 — Verify static parameters + `pg_stat_statements`

Step 2 — `amm.sh` PASS (first attempt)

Step 2 — `amm.sh` attempt 1 — PARTIAL (1P timeout at Step 2.2.4)

Step 2 — `amm.sh` retry 1 — FAIL (op signin timeout)

Step 2 — `amm.sh` retry 2 — PASS

Step 2 — `amm.sh` attempt 1 — HARD FAIL (Advanced Insights drift)

Step 2 — `amm.sh` retry — PASS