Aurora parameter group + operations 2.5.0 bump rollout
This runbook covers a single coordinated rollout that lands four
interlocking changes across all four Arda partitions
(Alpha002-dev, Alpha002-stage, Alpha001-demo, Alpha001-prod):
| Change | Repo | Reference |
|---|---|---|
Aurora cluster parameter group with slow-query logging, lock-wait logging, pg_stat_statements preload, and (prod only) max_connections=500 + db.r7g.large instance class | infrastructure | PDEV-479 |
{Infra}-SentryDsn Secrets Manager secret in Alpha001 and Alpha002 | infrastructure | PDEV-500 |
Per-application-database pg_stat_statements extension via the new initializer (image 2.5.0) | postgres-database-initializer | PDEV-498 |
Operations chart bump: init container 2.3.0 → 2.5.0 + SENTRY_DSN env via the new secret + pod resizing | operations-monolith-component | PDEV-488, PR #170 |
This is one-time-use — once the four partitions are through it,
the steady-state procedure is the standard amm.sh deploy + helm
upgrade. Future Aurora parameter-group changes follow the same shape
but get their own dated operation note.
Prerequisites
Section titled “Prerequisites”Before starting any environment:
infrastructuremain is at929b3c8(or newer) — the merge of PR #457. This carries both thetuningsub-interface on the Aurora construct (PDEV-479) and the{Infra}-Secretsstack (PDEV-500), plus theamm.shchanges that resolve the Sentry DSN from 1Password and add--forceto the infrastructure CDK invocation.postgres-database-initializerv2.5.0is published asghcr.io/arda-cards/postgres-database-initializer:2.5.0. Verify:Terminal window docker manifest inspect ghcr.io/arda-cards/postgres-database-initializer:2.5.0 >/dev/null \&& echo "image present" || echo "MISSING — stop"operations-monolith-componentPR #170 is open and CI-green and has been reviewed-and-approved. The rollout deploys this PR’s chart to each partition AFTER the database reboot for that partition has completed. Do not merge PR #170 until the prod rollout is complete and observed healthy — the chart onmainshould match what is running everywhere.- 1Password access: the operator must have
opsigned in and read access toop://Arda-SystemsOAM/be-sentry-dsn/dsn. Test:Terminal window op read 'op://Arda-SystemsOAM/be-sentry-dsn/dsn' >/dev/null \&& echo "1P reachable" || echo "MISSING — stop" - AWS SSO sessions in both partitions (
Alpha002-Adminfor dev/stage,Admin-Alpha1for demo/prod):Terminal window aws sso login --profile Alpha002-Admin # for dev + stageaws sso login --profile Admin-Alpha1 # for demo + prod
Rollout sequence
Section titled “Rollout sequence”| Order | Partition | Profile | applyImmediately | Instance class change |
|---|---|---|---|---|
| 1 | Alpha002-dev | Alpha002-Admin | true | none |
| 2 | Alpha002-stage | Alpha002-Admin | true | none |
| 3 | Alpha001-demo | Admin-Alpha1 | true | none |
| 4 | Alpha001-prod | Admin-Alpha1 | false | db.t3.medium → db.r7g.large |
Each partition is an independent gate: do not pipeline. After step 8 of every environment, run a Playwright application-health verification against that environment and wait for the operator’s explicit go-ahead before starting step 1 of the next environment.
The per-environment runbook below is identical in structure for all four environments. The only difference is that the prod writer + reader also change instance class during the failover step, which adds ~5–10 min per node to that step. Plan the prod window accordingly.
Per-environment runbook
Section titled “Per-environment runbook”The eight steps below are written for Alpha002-dev (the first
environment). For each subsequent environment, substitute:
| Placeholder | dev | stage | demo | prod |
|---|---|---|---|---|
${INFRA} | Alpha002 | Alpha002 | Alpha001 | Alpha001 |
${PARTITION} | dev | stage | demo | prod |
${PROFILE} | Alpha002-Admin | Alpha002-Admin | Admin-Alpha1 | Admin-Alpha1 |
${CLUSTER_ID} | Alpha002-dev-AuroraCluster | Alpha002-stage-AuroraCluster | Alpha001-demo-AuroraCluster | Alpha001-prod-AuroraCluster |
All commands use --profile ${PROFILE} at the end of the
command line (per workspace convention). Region is us-east-1 for
both partitions. Capture each step’s wall-clock start and end time
for the per-step log entries at the bottom of this file.
Drift checks (run before Step 1 of each environment)
Section titled “Drift checks (run before Step 1 of each environment)”The runbook placeholders are the CDK logical names. Real values can drift (renamed instances, alternative namespace conventions). Before starting Step 1 in each environment, run these read-only checks and record the actual values in the Execution Log:
-
Aurora writer + reader instance identifiers — do not trust
${CLUSTER_ID}Writer/${CLUSTER_ID}Reader1. Read them from RDS:Terminal window aws rds describe-db-clusters \--db-cluster-identifier "${CLUSTER_ID}" \--region us-east-1 --profile "${PROFILE}" \--query 'DBClusters[0].DBClusterMembers[].{Id:DBInstanceIdentifier,Writer:IsClusterWriter}'Use the returned identifiers for steps 4–6 instead of the placeholders. If they differ from the placeholders, pause and consult the operator before continuing — a rename may indicate the cluster is not the one this runbook was authored for.
-
Operations namespace exists — runbook assumes
${PARTITION}-operations. Confirm:Terminal window kubectl --context "${INFRA}" get ns "${PARTITION}-operations"If the namespace name differs, pause and consult the operator before continuing.
Any drift detected by these checks must be recorded in the Execution Log section under that environment’s entry, with a note explaining the operator decision that resolved it.
Step 1 — Snapshot the cluster
Section titled “Step 1 — Snapshot the cluster”SNAPSHOT_ID="${INFRA}-${PARTITION}-pdev479-pre-rollout"
aws rds create-db-cluster-snapshot \ --db-cluster-identifier "${CLUSTER_ID}" \ --db-cluster-snapshot-identifier "${SNAPSHOT_ID}" \ --region us-east-1 \ --profile "${PROFILE}"
aws rds wait db-cluster-snapshot-available \ --db-cluster-snapshot-identifier "${SNAPSHOT_ID}" \ --region us-east-1 \ --profile "${PROFILE}"Success check: the wait command returns 0 (typically 1–3 min).
Confirm:
aws rds describe-db-cluster-snapshots \ --db-cluster-snapshot-identifier "${SNAPSHOT_ID}" \ --region us-east-1 --profile "${PROFILE}" \ --query 'DBClusterSnapshots[0].Status'## expect: "available"Failure handling:
create-db-cluster-snapshotfails withDBClusterSnapshotAlreadyExistsFault: a previous attempt left a snapshot with the same name. Delete it viaaws rds delete-db-cluster-snapshot --db-cluster-snapshot-identifier "${SNAPSHOT_ID}"(after confirming it is not still needed as rollback insurance for a previous partition’s run), then retry step 1.create-db-cluster-snapshotfails withInvalidDBClusterStateFault: the cluster is mid-modification (e.g. a maintenance-window task is in flight). Wait 5 min, retry. If persistent, runaws rds describe-db-clusters --db-cluster-identifier "${CLUSTER_ID}"and checkStatus— investigate whatever is in flight before proceeding.waittimes out: snapshot is taking longer than expected. Re-run thewaitcommand. Do not proceed to step 2 until the snapshot isavailable— without it, you have no rollback insurance.- IAM
AccessDenied: the AWS SSO session has expired or the wrong profile is selected. Re-runaws sso login --profile "${PROFILE}"and confirmaws sts get-caller-identity --profile "${PROFILE}"shows the expected account.
Step 2 — Run amm.sh for the partition
Section titled “Step 2 — Run amm.sh for the partition”From an up-to-date infrastructure working tree (must be on main at
929b3c8 or newer):
git -C /Users/jmp/code/arda/infrastructure status --short --branch## expect: ## main...origin/main (and clean working tree)
cd /Users/jmp/code/arda/infrastructure./amm.sh "${INFRA}" "${PARTITION}"This single command does three things in sequence:
- Resolves
SENTRY_DSNfrom 1Password and exports it (masked under${GITHUB_ACTIONS}; visible only in the local shell otherwise). - Deploys the
${INFRA}infrastructure stacks (networking, compute, ingress, and the new${INFRA}-Secretsstack).--forceis on, so even no-op infra stacks resubmit a change set to CloudFormation — this is intentional so DSN rotations propagate. Stack-scoped--parameters ${INFRA}-Secrets:SentryDsn=…is appended to the invocation. - Deploys the
${INFRA}-${PARTITION}partition stacks (Authentication, BulkStores, ImageStorage, Aurora, Ingress, Compute, Dns). The Aurora cluster stack attaches the new parameter group at the cluster level.
Success check:
amm.shexits 0.- The new
${INFRA}-Secretsstack exists in CloudFormation:Terminal window aws cloudformation describe-stacks \--stack-name "${INFRA}-Secrets" \--region us-east-1 --profile "${PROFILE}" \--query 'Stacks[0].StackStatus'# expect: "CREATE_COMPLETE" or "UPDATE_COMPLETE" - The
${INFRA}-SentryDsnSecrets Manager secret exists with a valid Sentry DSN value:Terminal window aws secretsmanager get-secret-value \--secret-id "${INFRA}-SentryDsn" \--region us-east-1 --profile "${PROFILE}" \--query 'SecretString' --output text | grep -E '^https://[a-f0-9]+@.+\.sentry\.io/[0-9]+$' \&& echo "DSN shape OK" || echo "DSN shape BAD — investigate" - The cluster’s
DBClusterParameterGroupis the newly created one:Terminal window aws rds describe-db-clusters \--db-cluster-identifier "${CLUSTER_ID}" \--region us-east-1 --profile "${PROFILE}" \--query 'DBClusters[0].DBClusterParameterGroup'# expect a value that ends with -ClusterParameterGroup-<hash>, not the# engine default "default.aurora-postgresql16".
Failure handling:
amm.shaborts at the 1P resolution step (ERROR: Sentry DSN not available in 1Password): confirmopis signed in (op whoami) and that the entry exists (op item get 'be-sentry-dsn' --vault 'Arda-SystemsOAM'). Re-run step 2 onceop read 'op://Arda-SystemsOAM/be-sentry-dsn/dsn'prints the DSN. Nothing was deployed, so no rollback needed.amm.shfails inside a CDK stack deploy (CloudFormationUPDATE_FAILEDorROLLBACK_COMPLETE):${INFRA}-Networking/${INFRA}-Compute/${INFRA}-Ingressfailure: unrelated to this rollout. CFN auto-rolled-back; the Aurora parameter group attach in the partition stack never ran (CDK aborts on first failure). Investigate the failing stack directly viaaws cloudformation describe-stack-events, fix, rerunamm.sh. Snapshot from step 1 is still valid.${INFRA}-Secretsstack failure: typically a parameter validation rejection (emptySentryDsnvalue or shape mismatch). The cluster has not yet been touched (Secrets is part of the infrastructure layer, deployed before the partition’s Aurora stack). Investigate, rerun. Snapshot still valid.${INFRA}-${PARTITION}-AuroraDBClusterstack failure: this is where the cluster parameter group attach happens. CFN auto-rolls-back the stack to its pre-deploy state, which means the old parameter group is reattached. The cluster ends up in the same configuration it had before step 2 — no manual rollback needed. Investigate via stack events, fix the construct/platforms change if applicable (in a hotfix PR), and rerun the rollout from step 1 on a clean main.
- The new
${INFRA}-Secretsstack reportsCREATE_COMPLETEbut the secret value is empty: 1P read returned an empty string. Stop and investigate the 1P entry; the cluster has the new parameter group attached but the operations chart will be unable to sourceSENTRY_DSN. Treat as a stalled rollout: continue with steps 3–7 (the DB side is fine), then before step 8 verifyaws secretsmanager get-secret-valuereturns a non-empty DSN; if not, restore the 1P entry and runamm.shagain to repopulate the secret. - The cluster’s
DBClusterParameterGroupquery still shows the engine default afteramm.shreturns 0: CDK didn’t synth the parameter group resource for this partition. This indicates theplatforms.tsdatabaseTuningblock is missing for the partition or the construct change wasn’t merged. Cross-check theinfrastructureHEAD is at929b3c8or newer (git -C /Users/jmp/code/arda/infrastructure rev-parse HEAD). If yes, file a bug and stop the rollout.
Step 3 — Verify dynamic parameters propagated
Section titled “Step 3 — Verify dynamic parameters propagated”Wait ~90 seconds after step 2 completes. Connect to the writer (via the existing bastion / port-forward) and confirm dynamic parameters are live:
SHOW log_min_duration_statement; -- expect: 500msSHOW log_statement; -- expect: ddlSHOW log_lock_waits; -- expect: onSHOW log_temp_files; -- expect: 0SHOW shared_preload_libraries; -- still old value (pending reboot)SHOW max_connections; -- still old value (pending reboot)The static parameters (shared_preload_libraries, max_connections)
will still show their old values — those activate at step 5. If any
of the dynamic ones is wrong, stop and investigate before
rebooting; a reboot will not fix dynamic-parameter propagation issues.
Failure handling:
- Dynamic params still show old values after 5 minutes: the cluster
parameter group may not have associated correctly even though CFN
reported success. Run
aws rds describe-db-cluster-parameters --db-cluster-parameter-group-name <name>(use the name from step 2’s success check) and verify the expected values are in the parameter group itself. If yes, the cluster has not yet applied them — checkaws rds describe-db-clusters ... --query 'DBClusters[0].DBClusterParameterGroupStatus'forpending-reboot(means it never propagated dynamically) vsin-sync(means propagation done). Forpending-rebooton parameters that should be dynamic, the parameter group may have been built with the wrongapply-type— likely a construct bug, file an issue and roll back via the section below. - One dynamic parameter is wrong (e.g.
log_min_duration_statementis-1instead of500): the parameter group was synthesised with an incorrect value. Treat as a partial-fix candidate: you can hot-patch the cluster parameter group in place withaws rds modify-db-cluster-parameter-group --db-cluster-parameter-group-name <name> --parameters "ParameterName=...,ParameterValue=...,ApplyMethod=immediate"to correct the value, then re-run step 3’s verify. This is faster than a full CDK re-deploy. Fix the construct/platforms code in a follow-up PR so the next deploy doesn’t drift back.
Step 4 — Reboot reader1
Section titled “Step 4 — Reboot reader1”Reboot the reader first. This picks up the static parameters on the reader-side instance while the writer continues serving traffic.
aws rds reboot-db-instance \ --db-instance-identifier "${CLUSTER_ID}Reader1" \ --region us-east-1 --profile "${PROFILE}"
aws rds wait db-instance-available \ --db-instance-identifier "${CLUSTER_ID}Reader1" \ --region us-east-1 --profile "${PROFILE}"Success check: wait returns 0 (~30–60s). The reader instance is
back to available. Application traffic is unaffected (writer is
still serving). For prod, this step does not swap the instance
class — that happens during the failover in step 5.
Failure handling:
- Reader stays in
rebootingfor >5 minutes: very rare. Runaws rds describe-events --source-identifier "${CLUSTER_ID}Reader1" --source-type db-instance --duration 30to see what AWS reports. If the instance is genuinely stuck, open an AWS support case — do not proceed to step 5, because a failover with one unhealthy instance can leave the cluster writerless. - Reader returns to
availablebut withpending-rebootparameter group status still set: the reboot did not pick up the static parameters. Re-run step 4 once (single retry). If stillpending-reboot, treat as a construct bug and roll back via the section below.
Step 5 — Failover to reader1
Section titled “Step 5 — Failover to reader1”Explicitly promote reader1 to writer. The old writer is restarted
during the demotion and picks up the static parameters; for prod, it
also swaps from db.t3.medium to db.r7g.large during the restart.
aws rds failover-db-cluster \ --db-cluster-identifier "${CLUSTER_ID}" \ --target-db-instance-identifier "${CLUSTER_ID}Reader1" \ --region us-east-1 --profile "${PROFILE}"
## Wait for both instances to settle.aws rds wait db-instance-available \ --db-instance-identifier "${CLUSTER_ID}Writer" \ --region us-east-1 --profile "${PROFILE}"aws rds wait db-instance-available \ --db-instance-identifier "${CLUSTER_ID}Reader1" \ --region us-east-1 --profile "${PROFILE}"Application impact: ~10–15 seconds of writer endpoint disruption.
In-flight requests at the moment of the endpoint flip fail; HikariCP’s
connectionTimeout=30000 + validationTimeout=1000 absorb the gap;
new requests succeed once the new writer is up. For prod, total
window is ~15–25s because the demoted writer additionally provisions
the new instance class.
Success check (post-failover): the previously-promoted reader1 is now the cluster writer:
aws rds describe-db-clusters \ --db-cluster-identifier "${CLUSTER_ID}" \ --region us-east-1 --profile "${PROFILE}" \ --query 'DBClusters[0].DBClusterMembers[?IsClusterWriter==`true`].DBInstanceIdentifier'## expect: ["${CLUSTER_ID}Reader1"]For prod, also confirm both instances are now on db.r7g.large:
aws rds describe-db-instances \ --filters "Name=db-cluster-id,Values=${CLUSTER_ID}" \ --region us-east-1 --profile "${PROFILE}" \ --query 'DBInstances[].DBInstanceClass'## expect for prod: ["db.r7g.large","db.r7g.large"]## expect for dev/demo/stage: ["db.t3.medium","db.t3.medium"]Failure handling:
failover-db-clusterreturnsInvalidDBClusterStateFault: the target reader is not in sync, or the cluster is mid-modification. Wait 2 min, retry. Application traffic continues on the original writer in the meantime.- Failover initiates but the writer endpoint never resolves to the
new node (>2 min): the cluster is in a degraded state.
- Check
aws rds describe-db-clusters ... --query 'DBClusters[0].Status'.failing-overis expected briefly;inaccessible-encryption-credentialsor similar terminal states require an AWS support case. - Application is hard-down (writer unreachable). If the prior writer
is healthy enough to take traffic back, you can re-fail-over to
it:
aws rds failover-db-cluster --db-cluster-identifier "${CLUSTER_ID}" --target-db-instance-identifier "${CLUSTER_ID}Writer"(note: this targets the original writer name, which is now the reader). That restores service while you debug.
- Check
- For prod only: failover succeeded but one of the two instances is
still on
db.t3.mediumafter both reportavailable: the instance-class swap did not take. This indicates theapplyImmediately: falsedeferred change did not apply during the reboot. Trigger an explicit reboot of the affected instance withaws rds reboot-db-instance --db-instance-identifier <id> --force-failover(note: that flag triggers a failover-style restart). Re-check the instance class afteravailable. If it still does not change, the modification is pending another window —aws rds describe-db-instances ... --query 'DBInstances[].PendingModifiedValues'will show what is queued. - Application sees prolonged 5xx after the failover (more than ~30s):
HikariCP may have all connections wedged on the dead endpoint.
kubectl rollout restart deployment/operations --context "${INFRA}" -n "${PARTITION}-operations"forces a pod refresh and re-establishes connections cleanly. Not needed if the application recovers on its own within the timeout window.
Step 6 — Verify static parameters + pg_stat_statements
Section titled “Step 6 — Verify static parameters + pg_stat_statements”Reconnect to the writer (the cluster writer endpoint is the same DNS; your client must drop stale connections from the pre-failover writer).
SHOW shared_preload_libraries;-- expect: pg_stat_statements
SHOW max_connections;-- expect: 500 in prod, default-for-class elsewhere (around 800 on-- t3.medium per Aurora's instance-derived defaults)
-- Confirm pg_stat_statements is functional (the verify query the-- initializer would run):SELECT 1 FROM pg_stat_statements LIMIT 1;-- expect: a single row, NOT-- "pg_stat_statements must be loaded via shared_preload_libraries"If pg_stat_statements errors with the preload-missing message, the
failover did not pick up the new parameter group on the new writer.
Stop and investigate before deploying the operations chart in
step 8 — the new init container will fail-loud and the operations pod
will CrashLoopBackOff.
Failure handling:
SHOW shared_preload_librariesreturns the empty / old value on the new writer:- Confirm the reader-promoted-to-writer was actually restarted.
The failover restarts the demoted writer, not the promoted
reader. Run
aws rds describe-db-instances --db-instance-identifier "${CLUSTER_ID}Reader1" --query 'DBInstances[0].DBParameterGroups[0].ParameterApplyStatus'and look forin-sync. Ifpending-reboot, you need to reboot this instance too:This is now the cluster writer, so the reboot causes another ~10–15s writer disruption. Confirm static params after.Terminal window aws rds reboot-db-instance \--db-instance-identifier "${CLUSTER_ID}Reader1" \--region us-east-1 --profile "${PROFILE}" - If still wrong, the parameter group itself is misconfigured.
Inspect via
aws rds describe-db-cluster-parametersand compare to the expected values fromplatforms.ts’ssharedDatabaseParametersblock. Fix via directmodify-db-cluster-parameter-groupfor an emergency unblock, or roll back per the section below.
- Confirm the reader-promoted-to-writer was actually restarted.
The failover restarts the demoted writer, not the promoted
reader. Run
SELECT 1 FROM pg_stat_statements LIMIT 1errors withpg_stat_statements must be loaded via shared_preload_libraries: same root cause as above. The schema objects are installed but the shared-memory hash is not allocated because the library wasn’t preloaded at server start. Apply the same investigate-then-reboot remedy.- For prod only:
SHOW max_connectionsstill returns the t3.medium default (~800) instead of500: same family of failure as static params not picked up. The instance was restarted but did not read the new value. Re-run the reboot for whichever instance is wrong; if both are wrong, the parameter group itself is suspect.
Step 7 — Delete the snapshot
Section titled “Step 7 — Delete the snapshot”Only after step 6 is green and dashboards (Performance Insights, CloudWatch slow-query log) show normal traffic on the new writer:
aws rds delete-db-cluster-snapshot \ --db-cluster-snapshot-identifier "${SNAPSHOT_ID}" \ --region us-east-1 --profile "${PROFILE}"Why now: the snapshot is rollback insurance for the rollout window only. Once the cluster is healthy on the new configuration, it is pay-for-storage overhead.
Failure handling:
- Delete fails with
InvalidDBClusterSnapshotStateFault: snapshot is being copied or shared. Wait 1 min and retry. Cosmetic — does not block the rollout. - If you discover an issue after deleting the snapshot but the
cluster is otherwise healthy: roll-back via direct
modify-db-cluster-parameter-groupto neutralise problematic parameter values without needing the snapshot. The snapshot is insurance for catastrophic cluster-state corruption only.
Step 8 — Deploy PR #170 to this partition
Section titled “Step 8 — Deploy PR #170 to this partition”This is the gate between database-side and application-side. Do
not run step 8 until step 6 is green: the new init container’s
verify query will fail otherwise and the pod will enter
Init:CrashLoopBackOff.
The operations component is deployed via the standard Arda helm
upgrade path for operations-monolith-component. Build PR #170’s
chart and apply to the partition:
## (Substitute the exact command for your local operations workflow;## see the operations repo README for the standard helm-upgrade## invocation. The chart on PR #170's head ref is what should land.)
helm upgrade --install operations \ ./src/main/helm \ --namespace "${PARTITION}-operations" \ --values src/main/helm/values-"${PARTITION}".yaml \ --kube-context "${INFRA}"Success check:
- Operations pod transitions through
Init→Running:Terminal window kubectl --context "${INFRA}" -n "${PARTITION}-operations" \get pods -l app=operations -o wide -w# expect: STATUS Running, READY 1/1, RESTARTS 0 - The init container ran the new initializer image and exited 0:
Terminal window kubectl --context "${INFRA}" -n "${PARTITION}-operations" \logs -l app=operations -c db-init --tail=50# expect: "Creating database … with role …" followed by the SELECT# against pg_stat_statements returning, no error stack. - The Sentry SDK initialised against the new DSN:
Terminal window kubectl --context "${INFRA}" -n "${PARTITION}-operations" \logs -l app=operations -c operations --tail=200 | grep -iE 'sentry|dsn'# expect: a "Sentry SDK initialized" line or equivalent.# If you see "Sentry SDK disabled because no DSN was set", the# ExternalSecret has not yet synced — wait 30s and re-check. pg_stat_statementsis queryable in the application database (not just the admin DB), exercising what the new init container created:\c <app-db-name>SELECT count(*) FROM pg_stat_statements;-- expect: an integer >= 0, NOT the preload-missing error
If any of those fail, do not proceed to the next environment; roll back per the section below and surface the failure.
Failure handling:
- Init container exits with
pg_stat_statements must be loaded via shared_preload_libraries: the cluster did not actually activate the preload at step 6 even thoughSHOWreported it correctly. Re-run step 6’s verifications from a fresh psql session; the previous result may have been from a cached connection routed to a pre-failover instance. If the cluster genuinely has the preload, the init container’s verify query should not be erroring — open an issue againstpostgres-database-initializerwith the container logs and roll back the operations chart via the section below. - Init container exits with a
pg_trgm/btree_ginfailure: those extensions were already in the default floor before PR #30, so a failure here is unrelated to this rollout (cluster permissions, Postgres version mismatch, etc.). Treat as a regular operations incident. - Operations pod runs but logs show
Sentry SDK disabled because no DSN was set: theExternalSecrethas not synced from${INFRA}-SentryDsnto a Kubernetes secret yet.- Wait 60s and re-check.
- If still missing:
kubectl --context "${INFRA}" -n "${PARTITION}-operations" describe externalsecret be-sentry-dsnwill show the sync status and any error. - Common cause: the partition’s IRSA role does not yet have the
Secrets Manager read permission for
${INFRA}-SentryDsn. The${INFRA}-ReadSecretsmanaged policy on the cluster’s IRSA role chain uses the wildcard${INFRA}-*and should already cover this, but if it doesn’t, follow up on the IAM side; the operations pod stays fail-soft (Sentry just stays off) in the meantime.
- Operations pod transitions through Running but then crashes / restarts: regular incident, not specific to this rollout. Roll back the chart per the section below to unblock the rollout sequence, then debug separately.
- HikariCP cannot acquire connections on startup
(
Connection is not available): often the prior step’s failover left the new writer with a cold connection pool that hasn’t recovered. Wait 30s; if persistent, re-check the cluster endpoint resolves to the new writer.
When to deploy PR #170 — quick reference
Section titled “When to deploy PR #170 — quick reference”| Phase | dev | stage | demo | prod |
|---|---|---|---|---|
| 1. Snapshot | ✓ | |||
2. amm.sh dev | ✓ | |||
| 3. Verify dynamic params | ✓ | |||
| 4. Reboot reader1 | ✓ | |||
5. failover-db-cluster | ✓ | |||
| 6. Verify static params + pg_stat_statements | ✓ | |||
| 7. Delete snapshot | ✓ | |||
| 8. Deploy PR #170 → dev | ✓ | |||
| 9. Playwright application-health check + operator go-ahead | ✓ | |||
| 10–17. Repeat 1–8 for stage | ✓ | |||
| 18. Playwright application-health check + operator go-ahead | ✓ | |||
| 19–26. Repeat 1–8 for demo | ✓ | |||
| 27. Playwright application-health check + operator go-ahead | ✓ | |||
| 28. Schedule prod window | (pending) | |||
| 29–36. Repeat 1–8 for prod | ✓ | |||
37. Merge PR #170 to operations main | — | — | — | ✓ |
The PR #170 merge to main happens only after all four
environments are running its chart. Until then, the chart on PR #170
is what is live in each environment that has completed steps 1–8.
Playwright application-health verification
Section titled “Playwright application-health verification”After step 8 of each environment, run a Playwright-driven login + smoke test against the partition’s web frontend. This replaces the time-based soak: the gate is “operator confirms the app is healthy”, not a wall-clock interval.
Endpoint
Section titled “Endpoint”The frontend host follows a fixed pattern, all lowercase:
https://${PARTITION}.${INFRA_LOWER}.app.arda.cardswhere ${INFRA_LOWER} is the lowercased infrastructure name.
| Environment | URL |
|---|---|
| dev | https://dev.alpha002.app.arda.cards |
| stage | https://stage.alpha002.app.arda.cards |
| demo | https://demo.alpha001.app.arda.cards |
| prod | https://prod.alpha001.app.arda.cards |
Credentials
Section titled “Credentials”All test credentials live in the operator’s Private 1Password
vault. One item per environment:
| Environment | 1Password item (vault Private) |
|---|---|
| dev | Miguel-new-dev |
| stage | Arda-stage |
| demo | arda-demo |
| prod | Arda-live |
Read username and password via op read:
## Example for dev — substitute the per-environment item name.OP_ITEM="Miguel-new-dev"APP_USERNAME="$(op read "op://Private/${OP_ITEM}/username")"APP_PASSWORD="$(op read "op://Private/${OP_ITEM}/password")"If either op read returns an empty string, stop and confirm the
item’s field names with op item get "${OP_ITEM}" --vault Private
before retrying — some older items use email instead of username.
Playwright MCP procedure
Section titled “Playwright MCP procedure”The verification is driven through the mcp__playwright__* tools.
Per environment:
- Resolve credentials with the
op readsnippet above. Keep them out of the conversation transcript — pass them only to thebrowser_fill_form/browser_typetool calls below. - Navigate to the environment URL:
mcp__playwright__browser_navigatewithurlset to the entry from the table above. The unauthenticated landing should redirect to the login page. - Snapshot the page with
mcp__playwright__browser_snapshotto identify the username / password input refs and the submit button. - Fill the login form with
mcp__playwright__browser_fill_form, supplying theusernameandpasswordvalues from step 1. Submit viabrowser_clickon the login button (orbrowser_press_keywithEnterif the form supports it). - Wait for navigation with
mcp__playwright__browser_wait_for(e.g. wait for a known post-login element — the main navigation shell or the user-menu avatar — to appear). A successful login lands on the application home; an authentication failure stays on/loginwith an error banner. - Smoke check at least one authenticated page (typically the
default landing route after login). Confirm the page renders
without 5xx and that no console error in
mcp__playwright__browser_console_messagesreferences the operations backend (/api/...5xx, Sentry-init failure, etc.). - Record timings — this is a required measurement, not optional.
The point of this rollout is the slow-query / Sentry observability
work; the post-login timings are the headline before/after data
point.
- Render time: capture the time from the login-submit click
(step 4/5) until the post-login landing’s “ready” signal — the
same element that
browser_wait_forresolved on. Read it fromperformance.timing/PerformanceNavigationTimingviamcp__playwright__browser_evaluate, e.g.performance.getEntriesByType('navigation')[0].domContentLoadedEventEnd - performance.getEntriesByType('navigation')[0].startTime, and also note the wall-clock seconds between the click and thebrowser_wait_forresolution. - Network request times: pull the full network log with
mcp__playwright__browser_network_requestsafter the landing resolves. For every backend call (anything to the operations API under/api/), record the URL, HTTP status, and duration in ms. Flag any request slower than 500ms (the new slow-query threshold) so the operator can correlate againstpg_stat_statements. - Persist both measurements into the per-step log section for this environment under “Step 9 Playwright check” so the four environments are comparable side-by-side at the end of the rollout.
- Render time: capture the time from the login-submit click
(step 4/5) until the post-login landing’s “ready” signal — the
same element that
- Close the tab with
mcp__playwright__browser_closeso the next environment starts from a clean context.
Pass / fail criteria
Section titled “Pass / fail criteria”- Pass: login succeeds, the post-login landing renders, no backend 5xx in the network panel, and no operations-side error in the console. Operator confirms — proceed to ask permission for the next environment’s step 1.
- Fail: login fails, the page errors out, or the backend returns 5xx. Treat as a stalled rollout for the affected environment: surface the failure with the captured snapshot + console log and follow step 8’s failure-handling guidance or roll back via the section below.
Rollback
Section titled “Rollback”Two levers, in order of preference:
-
Revert at the operations layer (step 8) — if the operations pod fails to start cleanly after step 8 of an environment:
Terminal window helm rollback operations <previous-revision> \--namespace "${PARTITION}-operations" \--kube-context "${INFRA}"The previous revision is the operations chart with
:2.3.0init container and noSENTRY_DSNenv. The cluster is already on the new parameter group at this point, but the old init container did not touchpg_stat_statements, so the rollback is safe even though the cluster has changed underneath. -
Restore the cluster from the step-1 snapshot — if the failure is at the database layer (steps 3–6) and cannot be diagnosed quickly:
Terminal window # Restore creates a new cluster identifier; the application chart# must be re-pointed to it. Coordinate with the operations chart# values for the partition.aws rds restore-db-cluster-from-snapshot \--db-cluster-identifier "${CLUSTER_ID}-restored" \--snapshot-identifier "${SNAPSHOT_ID}" \--engine aurora-postgresql \--region us-east-1 --profile "${PROFILE}"Use this only as a last resort — restoring is a multi-hour operation and disrupts everything in the partition.
Per-step log
Section titled “Per-step log”Fill in as you go. Replicate this block per environment.
Alpha002-dev — operator: ____
Section titled “Alpha002-dev — operator: ____”- Step 1 snapshot: : start / : available
- Step 2
amm.sh: : start / : complete - Step 3 dynamic verify: : ok
- Step 4 reboot reader1: : start / : available
- Step 5 failover: : start / : both available
- Step 6 static verify: : ok (pg_stat_statements returned __ rows)
- Step 7 snapshot drop: :
- Step 8 PR #170 deploy: : start / : Running
- Sentry first event: :
- Step 9 Playwright check:
- login submit → landing ready: __ ms wall-clock / __ ms navigation
- backend
/api/requests (URL, status, ms): -
- requests > 500ms (slow-query threshold): ____
- Anomalies: ____
Alpha002-stage — operator: ____
Section titled “Alpha002-stage — operator: ____”(same shape)
Alpha001-demo — operator: ____
Section titled “Alpha001-demo — operator: ____”(same shape)
Alpha001-prod — operator: ____ — window: ____
Section titled “Alpha001-prod — operator: ____ — window: ____”(same shape; note: failover step includes instance-class swap, expect +5–10 min vs other environments)
References
Section titled “References”infrastructurePR #456 — Aurora parameter group + production sizing (PDEV-479) —35180deinfrastructurePR #457 —{Infra}-SentryDsnsecret (PDEV-500) —929b3c8postgres-database-initializerPR #30 — per-DBpg_stat_statementsextension (PDEV-498) —afd9798, released asv2.5.0operations-monolith-componentPR #170 — chart bump + Sentry wiring (PDEV-488)- Project goal:
_docs/analysis/db-configuration.md(parameter group rationale + activation theory) - Project implementation plan:
_docs/implementation/infrastructure/db-plan.md(multi-env rollout sequence that this runbook is the operator-facing version of) - Per-environment status during rollout: append to the per-step log above
Execution Log
Section titled “Execution Log”Operator: Miguel Pinilla (driven by Claude Code, session perf-upgrades).
All timestamps are local (America system clock); convert as needed.
Prerequisites — 2026-05-14
Section titled “Prerequisites — 2026-05-14”infrastructureHEAD:929b3c8c8654868e7629cfe888903596a1666f1b==origin/main. Tree clean. Ancestry check confirms929b3c8is included. PASS.ghcr.io/arda-cards/postgres-database-initializer:2.5.0—docker manifest inspectreturned 0. PASS.- 1Password
op://Arda-SystemsOAM/be-sentry-dsn/dsn— initial read returned MISSING; retry after operator confirmed the entry exists succeeded (op whoamishowsmiguel@arda.cardssigned in). PASS. - AWS SSO —
amm.shself-manages; non-amm.shawscommands will prompt the operator on session expiry.
Alpha002-dev
Section titled “Alpha002-dev”Operator: Miguel Pinilla. Driver: Claude Code.
Drift checks
Section titled “Drift checks”aws rds describe-db-clusters --db-cluster-identifier Alpha002-dev-AuroraCluster ... --profile Alpha002-Admin→- Writer:
alpha002-dev-auroraclusterwriter - Reader1:
alpha002-dev-auroraclusterreader1 - RDS lowercases instance identifiers; cluster identifier is case-preserved on the lookup. Not a rename — proceeding without operator consultation. Steps 4–6 will use the lowercased IDs.
- Writer:
kubectl --context Alpha002 get ns dev-operations→ namespacedev-operationsexists, age 272d. Matches the placeholder convention. No drift.- AWS SSO for
Alpha002-Adminwas expired on first attempt; operator ranaws sso login --profile Alpha002-Adminand confirmed.
Step 1 — Snapshot
Section titled “Step 1 — Snapshot”- Snapshot ID:
alpha002-dev-pdev479-pre-rollout aws rds create-db-cluster-snapshot ...at 11:09:55 PDT — accepted, statuscreating, snapshot create time2026-05-14T18:09:56Z.aws rds wait db-cluster-snapshot-available ...returned at 11:13:56 PDT (≈4 min). Final statusavailable, progress 100%.- Step 1 PASS.
Step 2 — amm.sh attempt 1 — FAIL (1P)
Section titled “Step 2 — amm.sh attempt 1 — FAIL (1P)”- Start 11:15:29 PDT, end 11:16:33 PDT (~64s), exit 1.
- Log:
infrastructure/scratch/amm-Alpha002-dev.log. - Root cause:
[ERROR] could not read secret 'op://Arda-SystemsOAM/Amplify_GitHub_AccessToken/password': error initializing client: authorization timeout. - Note:
amm.shresolves a second 1P secret beyond the Sentry DSN called out in Plan prereq #4 — the Amplify GitHub access token. The Plan’s prereq check is therefore insufficient to fully validate 1P access; operator should consider adding this item to a future revision of the Plan (deferred — Plan is locked). - Post-failure CloudFormation state:
Alpha002-Secretsstack does not exist (ValidationError).- Sentry DSN secret value is empty / unread.
- Cluster parameter group still
default.aurora-postgresql16(engine default — unchanged).
- Per runbook failure handling (Step 2, “amm.sh aborts at the 1P resolution step”): nothing was deployed, no rollback needed.
- Operator action required: re-sign into
op(session expired —op whoamireturns “account is not signed in”). Snapshot from Step 1 still valid as rollback insurance.
Step 2 — amm.sh attempt 2 — FAIL (Plan/script drift)
Section titled “Step 2 — amm.sh attempt 2 — FAIL (Plan/script drift)”- Start 11:21:55 PDT, exit 1 within seconds.
- Root cause:
amm.shauto-derives the AWS profile name asAdmin-${infrastructure}(line 240). ForAlpha002that producesAdmin-Alpha002, which does not exist on this machine; the real profile isAlpha002-Admin(naming asymmetry vs.Admin-Alpha1). Also:aws_regiondefaulted to "". - Operator decision: pass overrides on the command line —
./amm.sh --profile Alpha002-Admin --region us-east-1 Alpha002 dev. No script or AWS-config changes. Same approach will be used for the Alpha002-stage run. - Plan implication (deferred — Plan is locked): the runbook’s Step-2
command should call out the override pattern for Alpha002, or the
AWS config should grow an
Admin-Alpha002alias.
Step 2 — amm.sh attempt 3 — PARTIAL (1P timeout at Step 2.2.4)
Section titled “Step 2 — amm.sh attempt 3 — PARTIAL (1P timeout at Step 2.2.4)”- Start 11:22:25 PDT, end 11:34:33 PDT (~12 min), exit 1.
- Reached Step 2.2.4 (“Secrets”) after full CDK deploy succeeded.
- Failure:
[ERROR] could not read secret 'op://Arda-DevOAM/ARDA-API-KEY/password': error initializing client: authorization timeout. A second 1P item beyond the Sentry DSN + Amplify GitHub token; this lives in the partition vaultArda-DevOAM. The 1P CLI re-prompted during the long run and the prompt was not surfaced (see operator note re: biometric-unlock-for-CLI prompts not appearing whenopis invoked from inside a script). - Rollout-critical artifacts already in place at this point:
Alpha002-SecretsstackCREATE_COMPLETE(18:23:33Z)Alpha002-SentryDsnsecret value shape OK- Cluster parameter group is
alpha002-dev-auroradbcluster-aurorapostgresclusterclusterparametergrouped1eb2f9-dvlftis1srce(not engine default) Alpha002-dev-AuroraDBClusterUPDATE_COMPLETE(18:27:22Z)
- Operator decision: retry
amm.shso it runs to completion; the remaining work is idempotent.
Step 2 — amm.sh attempt 4 — PASS
Section titled “Step 2 — amm.sh attempt 4 — PASS”- Start 11:37:15 PDT, end 11:42:12 PDT (~5 min), exit 0,
status
succeeded. - Re-runs of the earlier CDK deploys were no-ops. The
previously-failing post-CDK secret push at Step 2.2.4 succeeded
this time and the script reached
Step 3: Done!. - All four runbook success checks still green (re-validated after attempt 3; no further work needed).
- Step 2 PASS.
Step 3 — Verify dynamic parameters
Section titled “Step 3 — Verify dynamic parameters”- Verified ~9 min after
amm.shfinish (>>90s requirement met). - Bastion module chosen:
system.reference.item→ DBdev-operations.item_db. Podmiguel-psql-show-dev-<pid>(postgres:16-alpine3.20, ephemeral, cleaned up via EXIT trap). - Dynamic parameter values from psql against the writer endpoint:
log_min_duration_statement=500ms✅ (expected 500ms)log_statement=ddl✅ (expected ddl)log_lock_waits=on✅ (expected on)log_temp_files=0✅ (expected 0)
- Pre-reboot static-parameter readings:
shared_preload_libraries: permission denied — the module DB user (dev-operations.ItemDb) lackspg_read_all_settings. Expected to still be the old value at this stage; will reattempt at Step 6 from the post-failover writer (may hit the same role limitation — flagged for that step).max_connections=401(t3.medium default; runbook expected “still old value (pending reboot)” — consistent).
- Step 3 PASS.
Step 4 — Reboot reader1
Section titled “Step 4 — Reboot reader1”- Target:
alpha002-dev-auroraclusterreader1. aws rds reboot-db-instanceat 11:57:20 PDT — accepted, statusrebooting.aws rds wait db-instance-availablereturned at 11:58:23 PDT (~63s).- Post-reboot status:
available, parameter-apply statusin-sync(nopending-rebootresidue — static params were picked up). - Writer (
alpha002-dev-auroraclusterwriter) continued serving traffic throughout. - Step 4 PASS.
Step 5 — Failover to reader1
Section titled “Step 5 — Failover to reader1”- Target:
alpha002-dev-auroraclusterreader1. aws rds failover-db-clusteraccepted at 12:03:03 PDT.- Initial
wait db-instance-availablecalls returned within 1–2s because the failover had not yet transitioned instances out ofavailable— the wait pattern in the runbook only catches the trailing edge of the state machine. Switched to a 5s-poll loop on cluster status / writer designation. - 12:03:34 PDT — cluster status
failing-over, writer already flipped toalpha002-dev-auroraclusterreader1. - 12:07:36 PDT — cluster status returned to
available. - Final state:
- Writer:
alpha002-dev-auroraclusterreader1(was reader). - Reader:
alpha002-dev-auroraclusterwriter(was writer). - Both
db.t3.medium, bothParamApply: in-sync.
- Writer:
- Plan implication (deferred — Plan is locked): Step 5’s success-check
commands as written can race the failover. A 5s-poll on cluster
status until
availableis more reliable than thewaitcalls. - Step 5 PASS.
Step 6 — Verify static parameters + pg_stat_statements
Section titled “Step 6 — Verify static parameters + pg_stat_statements”Bastion pod via system.reference.item module against the new
writer (alpha002-dev-auroraclusterreader1):
SHOW max_connections=401. The cluster parameter group’s formula isLEAST({DBInstanceClassMemory/9531392},5000)which yields ~401 fordb.t3.medium. Coincidentally the same as the engine default for the class — proceed via the AWS-API cross-check to confirm the source is the new group, not the default.SHOW shared_preload_libraries— permission denied (module user lackspg_read_all_settings, same as Step 3). Falling back to functional + API checks below.SELECT count(*) FROM pg_stat_statements= 184 rows. Strongest possible verification: the view returns data only when the extension is loaded viashared_preload_libraries. No “must be loaded via shared_preload_libraries” error. PASS.- AWS API cross-check against parameter group
alpha002-dev-auroradbcluster-aurorapostgresclusterclusterparametergrouped1eb2f9-dvlftis1srce:shared_preload_libraries = pg_stat_statements(ApplyType static)max_connections = LEAST({DBInstanceClassMemory/9531392},5000)(ApplyType static)- Both
ApplyMethod: pending-reboot(group-level apply policy for static params — not a current pending state on the instance, which isParamApply: in-sync).
- Step 6 PASS.
Step 7 — Delete snapshot
Section titled “Step 7 — Delete snapshot”- Operator confirmed Performance Insights for the new writer
(
db-CA2DXHCIXASDGHILIGGULDT2HU) shows ~zero AAS — two tiny bars at 12:07–12:08 (HikariCP reconnects post-failover) then flat, consistent with dev idle load. aws rds delete-db-cluster-snapshot ... Alpha002-dev-pdev479-pre-rolloutat 12:26:34 PDT — accepted, prior statusavailable. Subsequentdescribe-db-cluster-snapshotsreturnedDBClusterSnapshotNotFoundFault(deletion was effectively immediate).- Step 7 PASS.
Step 8 — Deploy PR #170 (operations chart bump) — PASS (with restart)
Section titled “Step 8 — Deploy PR #170 (operations chart bump) — PASS (with restart)”- Deploy executed via operations CI workflow run
https://github.com/Arda-cards/operations/actions/runs/25871797543(deploy (dev) / dev: 2.24.0, 2m 7s, green). However this ran ~2h before the Aurora rollout work today, and the running pods at the start of this step were 170m old. - Pre-restart check on
operations-7bd795b655-*:- Deployment spec had
SENTRY_DSNenv wired from secretbe-sentry-dsn/dsn. - Running pods env did not have
SENTRY_DSNpopulated — onlySENTRY_RELEASE,SENTRY_ENVIRONMENT,SENTRY_TRACES_SAMPLE_RATE. - Root cause: pods rolled out at ~9:45 PDT, but
Alpha002-SentryDsnwas created at 11:23 PDT byamm.shattempt 3. The ExternalSecret synced (Ready=True reason=SecretSynced) later, but the old pods never saw the new secret.
- Deployment spec had
- Verified
be-sentry-dsnk8s secret exists indev-operationsand contains the expected DSN value. kubectl rollout restart deployment/operationsissued at ~12:36 PDT. Rollout completed cleanly (two new replicas up, two old replicas terminated).- Post-restart verification on
operations-59d5d94b99-*:- Both pods
Running 1/1,RESTARTS 0. SENTRY_DSNenv now present in operations container.- All 6 init containers
(
init-create-{order,businessaffiliate,item,facility,kanban,station}-db) exited 0, imagepostgres-database-initializer:2.5.0. Each ranCREATE EXTENSION IF NOT EXISTS pg_stat_statements(no-op, already present) and the new fail-loudPERFORM 1 FROM pg_stat_statements LIMIT 1verify — all succeeded.
- Both pods
- Plan implication (deferred — Plan is locked): when the CI helm
deploy runs before
amm.sh, the resulting pods come up beforeAlpha002-SentryDsnexists and do not pick up the DSN. Future revisions should either (a) sequence helm deploy strictly afteramm.shfor the partition, or (b) add an explicitkubectl rollout restart deployment/operationsas the last action in Step 8. - Linear
PDEV-509updated with an observation from this rollout (zero AAS on the demoted instance post-failover) and a worked-out options menu for read/write split via two HikariCP pools. - Step 8 PASS.
Step 9 — Playwright application-health check
Section titled “Step 9 — Playwright application-health check”- URL:
https://dev.alpha002.app.arda.cards→ redirected to/signinthen auto-completed to/items?justSignedIn=truevia browser-saved credentials (no explicit submit click needed; the auth flow ran end-to-end). - Credentials used:
op://Private/Miguel-new-dev/{username,password}→ resolved successfully (login succeeded, accountmiguel-new-dev@arda.cards, roleAccount Admin). - Post-login landing: Items list with 3 rows rendered; sidebar (Dashboard, Items, Order Queue, Receiving) populated; user menu shows correct identity.
Page timings (Performance Navigation Timing on /items):
| Metric | Value |
|---|---|
responseEnd | 4193 ms |
domContentLoaded | 4400 ms |
loadEvent | 4848 ms |
first-paint | 4424 ms |
first-contentful-paint | 8824 ms |
total duration | 4848 ms |
| resource entries | 106 |
Backend request times (22 /api/ + Cognito calls):
- Total: 22 requests. Median: 536 ms. All returned 200 except one
401 /api/pylon/email-hashthat immediately retried to 200. - Requests >500 ms (the slow-query threshold the new parameter group
is configured to log): 12 of 22, including 8 over 2 s.
- 3463 ms
POST /api/arda/kanban/kanban-card/details/requesting - 3384 ms
POST /api/auth/secret-hash - 3289 ms
POST /api/auth/secret-hash - 2751 ms
POST /api/arda/kanban/kanban-card/details/requesting - 2663 ms
POST /api/arda/kanban/kanban-card/details/requested - 2579 ms
POST /api/arda/items/query-ssrm - 2449 ms
POST /api/arda/kanban/kanban-card/details/in-process - 2444 ms
POST /api/arda/kanban/kanban-card/details/requested - 1203 ms
POST cognito-idp.us-east-1.amazonaws.com/(Cognito auth) - 574 ms
POST /api/arda/kanban/kanban-card/query-details-by-item - 536 ms
POST /api/arda/kanban/kanban-card/query-details-by-item - 515 ms
GET /api/arda/kanban/kanban-card/query-by-item?eId=...
- 3463 ms
- These are pre-existing slow paths — surfaced now by the
rollout’s slow-query logging at the
log_min_duration_statement=500msthreshold. The repeated kanban-card detail endpoints under ~2.4–3.5s are likely the most impactful starting point for the performance investigation this rollout enables. Expect them to appear in the Postgres slow-query log on the new writer (clusterAlpha002-dev-AuroraCluster).
Console errors (9 total):
- 1 ×
401 /api/pylon/email-hash— retried to 200 in subsequent call. - 7 ×
[WorkspaceSwitcher] Zero workspaces returned — this should never happen for an authenticated user.— appears unrelated to this rollout (frontend error condition; the page rendered with data despite the message). Worth a follow-up ticket if not already tracked.
Pass / fail: PASS (functional). Login flow works, page renders, backend returns 200 for all real calls. The slow-request count is expected to drop once PDEV-509 (graceful degradation / read split) and follow-up query tuning land; tracked separately.
- Step 9 PASS.
Alpha002-dev — rollout complete
Section titled “Alpha002-dev — rollout complete”All 9 steps green. Pausing for operator go-ahead before starting Alpha002-stage.
Alpha002-stage
Section titled “Alpha002-stage”Operator: Miguel Pinilla. Driver: Claude Code. Operator authorized start at ~12:43 PDT.
Drift checks
Section titled “Drift checks”aws rds describe-db-clusters --db-cluster-identifier Alpha002-stage-AuroraCluster ... --profile Alpha002-Admin→- Writer:
alpha002-stage-auroraclusterwriter - Reader1:
alpha002-stage-auroraclusterreader1 - Same lowercase pattern as dev. No drift.
- Writer:
kubectl --context Alpha002 get ns stage-operations→ exists, age 279d. Matches placeholder convention. No drift.- Pre-rollout Aurora cluster param group:
default.aurora-postgresql16(engine default) — confirms stage has not yet been touched. op whoami→ “account is not signed in”. 1P session expired again; operator action required before Step 2.
Step 1 — Snapshot
Section titled “Step 1 — Snapshot”- Snapshot ID:
alpha002-stage-pdev479-pre-rollout - Created at 12:49:13 PDT (run in background while resolving the 1P session + Linear ticket draft for PDEV-513).
aws rds wait db-cluster-snapshot-availablereturned at 12:54:46 PDT (~5.5 min). Final statusavailable, progress 100%.- Step 1 PASS.
Step 2 — amm.sh PASS (first attempt)
Section titled “Step 2 — amm.sh PASS (first attempt)”- Start 14:34:03 PDT, end 14:42:27 PDT (~8 min), exit 0,
status: succeeded. No retries needed (1P session was warm in this shell afterop signinre-authorized the parent process). - Success checks (all green):
Alpha002-SecretsstackCREATE_COMPLETE(already created during dev’samm.sh, not re-created).- Cluster parameter group is
alpha002-stage-auroradbcluster-aurorapostgresclusterclusterparametergrouped1eb2f9-gnhrcc7nclxh(not engine default). - Both instances
db.t3.medium,ParamApply: in-sync.
- Notable CFN behavior observed:
amm.sh’s CDK deploy automatically rebooted both instances during the partition stack update (writer 14:36:59 → 14:39:05, reader1 14:39:06 → 14:40:10). Investigation against dev’s CFN history (attempt 3) shows the same behavior occurred there too (writer 11:28:06 → 11:30:13, reader1 11:30:14 → 11:31:19) — Steps 4 and 5 in the runbook were effectively idempotent for dev. - Root cause / Plan implication (deferred — Plan is locked):
infrastructure/src/main/cdk/platforms.tssetsdatabaseTuning.applyImmediatelyper partition. For dev / stage / demo:applyImmediately: true→ CDK propagates the cluster param-group attach to eachAWS::RDS::DBInstancemodification with applyImmediately=true → AWS reboots immediately. For prod:applyImmediately: false→ CDK queues changes; manual reboot required, plus an extramodify-db-instance --apply-immediatelystep to flush the queued instance-class swap (db.t3.medium → db.r7g.large). The runbook conflates “param group attached” with “param group live” and over-prescribes for theapplyImmediately=truepartitions. - Step 2 PASS.
Step 3 — Verify dynamic parameters
Section titled “Step 3 — Verify dynamic parameters”Bastion pod via system.reference.item module against the writer
(alpha002-stage-auroraclusterwriter at the time of check):
log_min_duration_statement=500ms✅log_statement=ddl✅log_lock_waits=on✅log_temp_files=0✅
Static-param probe (since CDK already rebooted, these should be live):
-
max_connections=401(consistent with new param group’sLEAST({DBInstanceClassMemory/9531392},5000)formula fordb.t3.medium). -
SELECT count(*) FROM pg_stat_statements→ERROR: relation "pg_stat_statements" does not exist. The extension was not yet created instage-operations.item_db. Expected: stage operations is still on the old chart (init 2.3.0); the 2.5.0 init container has not run yet. Will be installed by Step 8 (helm upgrade) and re-verified afterward. -
Step 3 PASS (dynamic params correct; pg_stat_statements extension creation deferred to Step 8).
Step 4 — Reboot reader1 — SKIPPED
Section titled “Step 4 — Reboot reader1 — SKIPPED”Per operator decision after the Plan-vs-CDK reconciliation above:
both stage instances were already rebooted by amm.sh’s CDK
deploy during Step 2 (writer 14:36:59–14:39:05, reader1 14:39:06–
14:40:10). An additional manual reboot would be redundant.
Step 5 — Failover to reader1 (smoke test)
Section titled “Step 5 — Failover to reader1 (smoke test)”Run as a smoke test per operator direction — not strictly required for parameter activation (CDK already covered that), but useful as a validation that failover works in stage.
aws rds failover-db-cluster ... --target-db-instance-identifier alpha002-stage-auroraclusterreader1at 14:54:59 PDT.- Polling cluster status / writer designation every 10s:
- 14:55:02 cluster=
available, writer=...writer(pre-flip) - 14:55:13 cluster=
failing-over, writer=...writer - 14:55:25 cluster=
failing-over, writer=...reader1← flipped - 14:55:37 cluster=
available, writer=...reader1← settled
- 14:55:02 cluster=
- Total failover window: ~38s. Faster than dev (~4 min cluster- status return) — probably because stage has lighter ambient load.
- Step 5 PASS.
Step 8 — Deploy PR #170 (operations chart 2.24.0) — PASS
Section titled “Step 8 — Deploy PR #170 (operations chart 2.24.0) — PASS”- Operator approved the queued stage deploy in the operations CI run.
- Rollout watched live:
- New ReplicaSet
operations-5dd7cc9546(imageoperations:2.24.0, initpostgres-database-initializer:2.5.0) spun up at 14:58 PDT. - Old ReplicaSet
operations-7c894bf4b9(was runningoperations:2.23.0) terminated. kubectl rollout statusreturnedsuccessfully rolled outat 14:59:00 PDT.- Both new pods (
-m2vbt,-tgrg7)Running 1/1,RESTARTS 0.
- New ReplicaSet
- No race condition this time — stage deploy happened after
amm.shhad already createdAlpha002-Secrets(during dev’s run), soSENTRY_DSNwas wired correctly from the start; no manual rollout-restart needed. - Verifications on new pods:
- All 6 init containers
(
init-create-{order,businessaffiliate,item,facility,kanban,station}-db) exited 0. SENTRY_DSNpopulated,SENTRY_ENVIRONMENT=Alpha002-stage.
- All 6 init containers
(
- Step 8 PASS.
Step 6 — pg_stat_statements verify (post-Step-8)
Section titled “Step 6 — pg_stat_statements verify (post-Step-8)”Re-ran the bastion SELECT against the stage writer:
SELECT count(*) FROM pg_stat_statements= 659 rows. Significantly more than dev’s 184 — stage has been actively collecting query stats sinceamm.shrebooted both instances ~80 min before this check.- Step 6 PASS (deferred from earlier — now satisfied).
Step 7 — Delete snapshot — PASS
Section titled “Step 7 — Delete snapshot — PASS”aws rds delete-db-cluster-snapshot ... Alpha002-stage-pdev479-pre-rolloutat 15:02:13 PDT — accepted, prior statusavailable.- Step 7 PASS.
Step 9 — Playwright application-health check
Section titled “Step 9 — Playwright application-health check”- URL:
https://stage.alpha002.app.arda.cards→/signin→ typed credentials fromop://Private/Arda-stage/{username,password}→ Sign in → landed on/items?justSignedIn=true. - User:
miguel@arda.cards(resolved from 1P).
Page timings:
| Metric | Stage | Dev (for comparison) |
|---|---|---|
responseEnd | 598 ms | 4193 ms |
domContentLoaded | 696 ms | 4400 ms |
loadEvent | 810 ms | 4848 ms |
first-paint | 728 ms | 4424 ms |
first-contentful-paint | 1260 ms | 8824 ms |
Backend request times (22 calls, median 399 ms):
- Requests > 500 ms: 7 of 22 (dev: 12 of 22). Peak 1099 ms on
Cognito (expected auth overhead). Application API peak 593 ms on
POST /api/arda/kanban/kanban-card/query-details-by-item. - Slowest 8 backend calls all between 475–593 ms (vs dev’s 2444–3463 ms range on the same kanban endpoints).
- Stage is materially faster than dev — likely a combination of warmer caches, fewer items, and a more recently rebooted writer with no accumulated noisy queries.
Console errors: 0 (dev: 9, including 7× WorkspaceSwitcher
“Zero workspaces returned”). The absence of the WorkspaceSwitcher
error on stage with a different user account reinforces that the
issue tracked in PDEV-513 is data-/account-state driven, not a
generic code bug. Updated PDEV-513 with this observation
implicitly via this log; explicit Linear comment skipped to avoid
churn.
- Step 9 PASS.
Alpha002-stage — rollout complete
Section titled “Alpha002-stage — rollout complete”All 9 steps green. Pausing for operator go-ahead before starting Alpha001-demo.
Alpha001-demo
Section titled “Alpha001-demo”Operator: Miguel Pinilla. Driver: Claude Code. Operator authorized
start after running aws sso login --profile Admin-Alpha1.
Drift checks
Section titled “Drift checks”aws rds describe-db-clusters --db-cluster-identifier Alpha001-demo-AuroraCluster ... --profile Admin-Alpha1→- Writer:
alpha001-demo-auroraclusterwriter - Reader1:
alpha001-demo-auroraclusterreader1 - Same lowercase pattern as dev/stage. No drift.
- Writer:
kubectl --context Alpha001 get ns demo-operations→ exists, age 87d. Matches placeholder convention. No drift.- Pre-rollout Aurora cluster param group:
default.aurora-postgresql16(engine default) — confirms demo has not yet been touched. - 1P paths all reachable:
op://Arda-SystemsOAM/be-sentry-dsn/dsnOK,op://Arda-SystemsOAM/Amplify_GitHub_AccessToken/passwordOK,op://Arda-DemoOAM/ARDA-API-KEY/passwordOK.
Step 1 — Snapshot
Section titled “Step 1 — Snapshot”- Snapshot ID:
alpha001-demo-pdev479-pre-rollout - Created at 16:34:57 PDT.
aws rds wait db-cluster-snapshot-availablereturned at 16:38:30 PDT (~3.5 min). Final statusavailable, progress 100%.- Step 1 PASS.
Step 2 — amm.sh attempt 1 — PARTIAL (1P timeout at Step 2.2.4)
Section titled “Step 2 — amm.sh attempt 1 — PARTIAL (1P timeout at Step 2.2.4)”- Start 17:01:29 PDT, end 17:13:31 PDT (~12 min), exit 1.
- Same failure mode as dev attempt 3:
[ERROR] could not read secret 'op://Arda-DemoOAM/ARDA-API-KEY/password': error initializing client: authorization timeout. Operator: “I never got a prompt” — the desktop integration’s prompt did not surface during the long-running script. Confirmed later that even an interactiveop signinfrom a fresh shell could time out silently when the desktop app is locked. - Rollout-critical artifacts already in place at failure point:
Alpha001-SecretsstackCREATE_COMPLETE(17:02:33 PDT)Alpha001-SentryDsnSecrets Manager value shape OK- Cluster param group is
alpha001-demo-auroradbcluster-aurorapostgresclusterclusterparametergrouped1eb2f9-v3sluztqcnrj - CDK rebooted both instances during the partition deploy
(writer 17:07:10–17:09:16, reader1 17:09:17–17:10:22) — same
applyImmediately=truebehavior as dev and stage.
Step 2 — amm.sh retry 1 — FAIL (op signin timeout)
Section titled “Step 2 — amm.sh retry 1 — FAIL (op signin timeout)”- 17:45:24 PDT,
op signinitself returned authorization timeout after 60s because the desktop app had re-locked and the prompt did not surface.amm.shnever ran (short-circuited by&&).
Step 2 — amm.sh retry 2 — PASS
Section titled “Step 2 — amm.sh retry 2 — PASS”- Operator unlocked desktop 1Password, biometric prompt surfaced
for
op signin, approval granted. - Start 17:47:31 PDT, end 17:52:55 PDT (~5.5 min), exit 0,
status: succeeded. - All re-runs of CDK deploys were no-ops; the previously-failing post-CDK secret push at Step 2.2.4 succeeded.
- Step 2 PASS.
Step 3 — Verify dynamic parameters
Section titled “Step 3 — Verify dynamic parameters”Bastion via system.reference.item against the demo writer:
-
log_min_duration_statement=500ms✅ -
log_statement=ddl✅ -
log_lock_waits=on✅ -
log_temp_files=0✅ -
max_connections=401(correct fordb.t3.mediumper the new param group’sLEAST({DBInstanceClassMemory/9531392},5000)formula). -
SELECT count(*) FROM pg_stat_statements→ “relation does not exist” — expected; demo operations is still on the OLD chart (initpostgres-database-initializer:2.3.0, appoperations:2.23.0, noSENTRY_DSNenv, nobe-sentry-dsnsecret yet). Will be installed by Step 8. -
Step 3 PASS.
Step 4 — Reboot reader1 — SKIPPED
Section titled “Step 4 — Reboot reader1 — SKIPPED”CDK already rebooted both instances during Step 2’s partition deploy (writer 17:07:10–17:09:16, reader1 17:09:17–17:10:22). Same as dev and stage; an additional manual reboot would be redundant.
Step 5 — Failover to reader1 — SKIPPED
Section titled “Step 5 — Failover to reader1 — SKIPPED”Operator decision: stage already validated the failover path (stage Step 5, 14:54:59–14:55:37 PDT, ~38s); no additional smoke test value in repeating on demo.
Step 6 — Verify pg_stat_statements — DEFERRED
Section titled “Step 6 — Verify pg_stat_statements — DEFERRED”Pending Step 8: demo operations is still on init container 2.3.0
which does not create the pg_stat_statements extension. Will
re-run the SELECT after the operations chart upgrade.
Step 8 — Deploy operations chart 2.24.0 — PASS
Section titled “Step 8 — Deploy operations chart 2.24.0 — PASS”- Operator triggered the demo deploy.
- Rollout watched: new ReplicaSet
operations-579bdd455dspun up at ~18:08 PDT,kubectl rollout statusreturned success at 18:09:26 PDT (~30s rollout). Oldoperations-56684688fbpods terminated. - New pods both
Running 1/1,RESTARTS 0. - New images: init
postgres-database-initializer:2.5.0, appoperations:2.24.0(replaced 2.3.0 / 2.23.0). - All 6 init containers
(
init-create-{order,businessaffiliate,item,facility,kanban,station}-db) exited 0. SENTRY_DSNpopulated,SENTRY_ENVIRONMENT=Alpha001-demo,SENTRY_TRACES_SAMPLE_RATE=0.1(note: demo uses 10% sampling vs 100% on dev/stage — expected per-env override).- ExternalSecret
be-sentry-dsnreportsReady=True, reason=SecretSynced. - Step 8 PASS.
Step 6 — pg_stat_statements verify (post-Step-8)
Section titled “Step 6 — pg_stat_statements verify (post-Step-8)”SELECT count(*) FROM pg_stat_statementsagainst the demo writer viasystem.reference.itembastion = 839 rows.- Step 6 PASS.
Step 7 — Delete snapshot — PASS
Section titled “Step 7 — Delete snapshot — PASS”aws rds delete-db-cluster-snapshot ... Alpha001-demo-pdev479-pre-rolloutat 18:19:13 PDT — accepted, prior statusavailable.- Step 7 PASS.
Step 9 — Playwright application-health check
Section titled “Step 9 — Playwright application-health check”- URL:
https://demo.alpha001.app.arda.cards→ initial navigation showed “Your session has expired. Please sign in again.” (stale cookie from a prior session) → redirected to/signin?next=%2F→ typed credentials fromop://Private/arda-demo/{username,password}→ Sign in → landed on/items?justSignedIn=true. - User:
miguel@arda.cards(Account Admin). Items grid empty (no rows in demo). “For Trial Use Only” banner visible.
Page timings:
| Metric | Demo | Stage | Dev |
|---|---|---|---|
responseEnd | 4919 ms | 598 ms | 4193 ms |
domContentLoaded | 5137 ms | 696 ms | 4400 ms |
loadEvent | 5404 ms | 810 ms | 4848 ms |
first-paint | 5152 ms | 728 ms | 4424 ms |
first-contentful-paint | 9604 ms | 1260 ms | 8824 ms |
Backend request times (19 calls, median 462 ms):
- Requests > 500 ms: 9 of 19 (43%). Profile closer to dev than to stage.
- Slowest 8:
- 3599 ms
POST /api/auth/secret-hash - 3439 ms
POST /api/auth/secret-hash - 2563 ms
POST /api/arda/kanban/kanban-card/details/requested - 2010 ms
POST /api/arda/user-account/query - 1545 ms
POST /api/arda/items/query-ssrm - 1206 ms
POST /api/arda/kanban/kanban-card/details/in-process - 1190 ms
POST /api/arda/kanban/kanban-card/details/requesting - 876 ms
POST cognito-idp.us-east-1.amazonaws.com/
- 3599 ms
Console errors (14):
- 1 ×
401 /api/pylon/email-hash(transient, retry pattern). - 3 ×
400fromcognito-idpplus[CLIENT] Token refresh failed/Authentication token has expired— side effect of the stale session cookie that triggered the “session expired” toast on initial navigation. Resolved after fresh sign-in. - 7 × AG Grid Enterprise “License Key Not Found” notice — cosmetic, trial mode. Demo-specific (not seen on dev/stage). Worth a separate ticket if not already tracked: production-bound charts should ship with a real license to avoid the watermark / banner.
- No WorkspaceSwitcher errors (PDEV-513 is dev-account-specific, confirmed across stage and now demo).
Pass / fail: PASS. App functional, all real API calls return 200 modulo the transient pylon 401 (auto-retried). Performance profile is “slow path” like dev rather than “fast path” like stage — likely a cold-cache or scale-of-data effect that’s separate from this rollout.
- Step 9 PASS.
Alpha001-demo — rollout complete
Section titled “Alpha001-demo — rollout complete”All 9 steps green. Pausing for operator go-ahead before starting Alpha001-prod.
Alpha001-prod
Section titled “Alpha001-prod”Operator: Miguel Pinilla. Driver: Claude Code. Window: 2026-05-14 23:00 PDT (planned failover moment). Actual failover fired at 23:16:22 PDT after recovering from a hard CFN failure (see Step 2 attempt 1 below).
Drift checks
Section titled “Drift checks”- AWS profile
Admin-Alpha1SSO confirmed. - Aurora
Alpha001-prod-AuroraClusterwriteralpha001-prod-auroraclusterwriter, reader1alpha001-prod-auroraclusterreader1. Same lowercase pattern as the other partitions. kubectl --context Alpha001 get ns prod-operations→ exists, age 265d.- Pre-rollout cluster: param group
default.aurora-postgresql16, both instancesdb.t3.medium,PerformanceInsightsRetentionPeriod=465,DatabaseInsightsMode=advanced. The Advanced mode + 465-day retention turned out to be a drift versus the CDK construct (which modelsDatabaseInsightsMode=undefined⇒ standard andperformanceInsightRetention=DEFAULT⇒ 7 days). Will reconcile during Step 2. - 1P paths all reachable after
op signin.
Step 1 — Snapshot
Section titled “Step 1 — Snapshot”- Snapshot ID:
alpha001-prod-pdev479-pre-rollout - Created 22:25:06 PDT, available 22:30:44 PDT (~5.5 min).
- Step 1 PASS.
Step 2 — amm.sh attempt 1 — HARD FAIL (Advanced Insights drift)
Section titled “Step 2 — amm.sh attempt 1 — HARD FAIL (Advanced Insights drift)”- Start 22:30:44 PDT, end 22:33:28 PDT (~3 min), exit 1.
- Root cause:
Alpha001-prod-AuroraDBClusterCFN stack tried to modify theAWS::RDS::DBClusterto setPerformanceInsightsRetentionPeriodto 7 (CDK default), but the live cluster hadDatabaseInsightsMode=advancedwhich mandates≥ 465 days. AWS rejected, CFN tried to roll back, the rollback ALSO failed with the same error, leaving the stack inUPDATE_ROLLBACK_FAILED. - Stack audit at this point:
Alpha001-Networking,Alpha001-Compute,Alpha001-Ingress,Alpha001-CloudWatchLog,Alpha001-Secrets— allUPDATE_COMPLETE/CREATE_COMPLETE(from earlier deploys).Alpha001-prod-Imported/Compute/Authentication/BulkStores/ImageStorage— allUPDATE_COMPLETE(succeeded earlier in this attempt before the Aurora failure).Alpha001-prod-AuroraDBCluster—UPDATE_ROLLBACK_FAILED.Alpha001-prod-Ingress,Alpha001-prod-DnsConfiguration— NOT redeployed (CDK chain stopped at Aurora).
- Cluster + instances functionally unchanged; customer traffic unaffected.
Step 2 — Recovery: disable Advanced Insights + continue rollback
Section titled “Step 2 — Recovery: disable Advanced Insights + continue rollback”aws rds modify-db-cluster --database-insights-mode standard --apply-immediatelyat 22:40:28 PDT.- Cluster + both instances transitioned to
standardmode immediately. PI retention now mutable. continue-update-rollbackretry 1 at 22:42:20 PDT failed withDB cluster isn't available for modification with status configuring-enhanced-monitoring(transient race with the manual modify above).- Waited for cluster
available, retriedcontinue-update-rollback, succeeded — stack reachedUPDATE_ROLLBACK_COMPLETEat 22:55:56 PDT. - Final state: PIRetention dropped to 7, cluster on
default.aurora-postgresql16, no Advanced Insights, ready for re-deploy.
Step 2 — amm.sh retry — PASS
Section titled “Step 2 — amm.sh retry — PASS”- Start 22:57:25 PDT, end 23:04:15 PDT (~7 min), exit 0.
Alpha001-prod-AuroraDBClusterUPDATE_COMPLETEat 22:59:43.- Notable CFN behavior (different from dev/stage/demo): the
partition Aurora stack’s DBInstance modifications returned
UPDATE_COMPLETEin ~30s each (no reboot), with the changes queued asPendingModifiedValuesrather than applied immediately. This matchesapplyImmediately=falseinplatforms.tsfor prod. Alpha001-prod-IngressUPDATE_COMPLETE 22:59:43 (was skipped in attempt 1).Alpha001-prod-DnsConfigurationUPDATE_COMPLETE 23:00:03 (also was skipped in attempt 1).- Post-amm state:
- Cluster param group:
alpha001-prod-auroradbcluster-aurorapostgresclusterclusterparametergrouped1eb2f9-nanabjqq5qcd - Both instances
db.t3.medium,ParamApply: in-sync,PendingModifiedValues.DBInstanceClass: db.r7g.largequeued.
- Cluster param group:
- Step 2 PASS.
Step 3 — Verify dynamic parameters
Section titled “Step 3 — Verify dynamic parameters”- Bastion via
system.reference.itemagainst the (still-original) writer:log_min_duration_statement=500ms✅log_statement=ddl✅log_lock_waits=on✅log_temp_files=0✅
max_connections=401(pre-class-swap, t3.medium default).- Step 3 PASS.
Step 4 — Flush reader1 pending mods (apply-immediately)
Section titled “Step 4 — Flush reader1 pending mods (apply-immediately)”- Prod-specific equivalent of “reboot reader1”. For
applyImmediately=falsethe runbook’s plain reboot would not flush the queued class swap; an explicitmodify-db-instance --apply-immediatelyis required. - 23:06:15 PDT —
aws rds modify-db-instance --db-instance-identifier alpha001-prod-auroraclusterreader1 --apply-immediately. - 23:07:45 — reader1 enters
modifyingwith pending class swap. - 23:12:34 — reader1 visible as
db.r7g.large,configuring-enhanced-monitoring. - 23:13:37 — reader1
available,db.r7g.large, no pending. Total ~7 min. - Step 4 PASS.
Step 5 — Failover to reader1
Section titled “Step 5 — Failover to reader1”- 23:16:22 PDT —
aws rds failover-db-cluster --db-cluster-identifier Alpha001-prod-AuroraCluster --target-db-instance-identifier alpha001-prod-auroraclusterreader1. - 23:16:35 — cluster
failing-over, writer still old (in flight). - 23:16:46 — writer flipped to
alpha001-prod-auroraclusterreader1. - 23:16:57 — cluster
available. Total failover window ~35s. - Post-failover state:
- Writer:
alpha001-prod-auroraclusterreader1ondb.r7g.large,ParamApply: in-sync, no pending. - Reader (the demoted instance):
alpha001-prod-auroraclusterwriterstill ondb.t3.mediumwithPendingModifiedValues.DBInstanceClass: db.r7g.large— failover-induced restart did not auto-flush the queue (confirming the runbook’s prediction was over-optimistic forapplyImmediately=false).
- Writer:
- Step 5 PASS (writer flipped; class swap on demoted writer handled below).
Step 5b — Flush demoted writer pending mods
Section titled “Step 5b — Flush demoted writer pending mods”- 23:18:31 PDT —
modify-db-instance --apply-immediately alpha001-prod-auroraclusterwriter. - 23:18:50 — demoted writer
modifying. - 23:24:59 — class swap to
db.r7g.largecomplete;configuring-enhanced-monitoring. - 23:26:03 —
available, no pending. Total ~7.5 min.
Final cluster state after failover + both class swaps
Section titled “Final cluster state after failover + both class swaps”- Cluster param group:
alpha001-prod-auroradbcluster-aurorapostgresclusterclusterparametergrouped1eb2f9-nanabjqq5qcd - Writer:
alpha001-prod-auroraclusterreader1(promoted) - Reader:
alpha001-prod-auroraclusterwriter(demoted) - Both instances
db.r7g.large,ParamApply: in-sync,Status: available. SHOW max_connections=500(prod’s explicit override) ✅.
Step 8 — Deploy operations chart 2.24.0 to prod
Section titled “Step 8 — Deploy operations chart 2.24.0 to prod”- Operator triggered the deploy via the operations CI workflow.
- Rollout watched: new ReplicaSet
operations-f5956475dspun up at 23:30:05 PDT,rollout statusreturnedsuccessfully rolled outat 23:31:15 PDT (~70s). Both new podsRunning 1/1,RESTARTS 0. - New images: init
postgres-database-initializer:2.5.0, appoperations:2.24.0(replaced 2.3.0 / 2.23.0). - All 6 init containers exited 0.
SENTRY_DSNpopulated,SENTRY_ENVIRONMENT=Alpha001-prod,SENTRY_TRACES_SAMPLE_RATE=0.1(same conservative sampling as demo).- ExternalSecret
be-sentry-dsnReady=True, reason=SecretSynced. - Step 8 PASS.
Step 6 — pg_stat_statements verify (post-Step-8)
Section titled “Step 6 — pg_stat_statements verify (post-Step-8)”SELECT count(*) FROM pg_stat_statementsagainst prod writer viasystem.reference.item= 769 rows.- Step 6 PASS.
Step 7 — Delete snapshot — PASS
Section titled “Step 7 — Delete snapshot — PASS”aws rds delete-db-cluster-snapshot ... Alpha001-prod-pdev479-pre-rolloutat 23:35:44 PDT — accepted, prior statusavailable.- Step 7 PASS.
Step 9 — Playwright application-health check
Section titled “Step 9 — Playwright application-health check”- URL:
https://prod.alpha001.app.arda.cards→/signin→ typed credentials fromop://Private/Arda-live/{username,password}→ landed on/items?justSignedIn=true. - User:
miguel@arda.cards(Account Admin).
Page timings (fastest of all four environments):
| Metric | Prod | Stage | Demo | Dev |
|---|---|---|---|---|
responseEnd | 449 ms | 598 | 4919 | 4193 |
domContentLoaded | 577 ms | 696 | 5137 | 4400 |
loadEvent | 742 ms | 810 | 5404 | 4848 |
first-paint | 608 ms | 728 | 5152 | 4424 |
first-contentful-paint | 1176 ms | 1260 | 9604 | 8824 |
Backend request times (24 calls, median 577 ms):
- Requests > 500 ms: 13 of 24 (54%), but the slowest 8 are all in the 644–794 ms range — much tighter than dev/demo’s multi-second tail.
- Peak: 794 ms
POST /api/arda/kanban/kanban-card/query-details-by-item(vs dev’s 3463 ms on the same path). Thedb.r7g.largeupgrade is showing. - Cognito at 746 ms (auth overhead).
Console errors: 0. No WorkspaceSwitcher, no AG Grid license
warnings, no auth retries.
- Step 9 PASS.
Alpha001-prod — rollout complete
Section titled “Alpha001-prod — rollout complete”All 9 steps green. Customer-visible failover at 23:16:22 PDT,
~35-second writer-endpoint disruption window. Cluster now on
db.r7g.large with the new parameter group; operations component
on chart 2.24.0 / init 2.5.0 with Sentry wired to
Alpha001-SentryDsn. pg_stat_statements collecting (769 rows
within first minutes of the new init container running).
Copyright: © Arda Systems 2025-2026, All rights reserved