PDEV-479 — Aurora cluster configuration: implementation plan
Implementation plan for PDEV-479.
The plan has two phases:
- Code + PR phase — land the CDK change behind a single PR. CloudFormation
alone won’t restart writer/reader, so the deploy is safe-by-default for
prod (
applyImmediately: false); demo/stage/dev are immediate. - Rollout phase — per-environment, operator-driven activation: snapshot → CDK deploy → verify dynamic params → reboot → verify all params → drop snapshot. One operations note per environment.
PDEV-498 (pg_stat_statements per-DB CREATE EXTENSION) lands separately on
its own PR. It is a hard prerequisite for the data to be queryable, but
its deploy is independent — it can land before or after this one. The
init container’s fail-loud PERFORM count(*) FROM pg_stat_statements
verify-query means a redeploy after the parameter group is in place is
when the extension actually becomes queryable.
Phase A — Code + PR
Section titled “Phase A — Code + PR”All work happens inside the existing worktree at
projects/product-slow-responses-worktrees/infrastructure/ on its
existing branch.
A.1 Branch state
Section titled “A.1 Branch state”This worktree already exists with a branch tracked through the project. No new branch needed. Verify before starting:
git -C /Users/jmp/code/arda/projects/product-slow-responses-worktrees/infrastructure \ status --short --branchExpect a clean working tree on the project branch.
A.2 Implementation order
Section titled “A.2 Implementation order”Apply the changes from db-updates.md in this order, committing
incrementally so each step compiles cleanly on its own:
-
Construct —
aurora-postgres-cluster.ts: addDbConfiguration+ParameterGroupSettingsinterfaces, internal parameter-group build, wideneddefaultDbPropssignature. Backwards-compatible — no caller changes yet.- Commit:
feat(rds): aurora construct accepts dbConfiguration sub-interface
- Commit:
-
Stack —
purpose-storage.ts: adddbConfiguration?to itsConfiguration, pass through.- Commit:
feat(rds): purpose-storage stack threads dbConfiguration
- Commit:
-
Platform model —
platforms.ts: extendPartitionInfo/PartitionwithdbConfiguration?, populate the four production partitions (prod, demo, stage, dev) with their per-env values.- Commit:
feat(rds): platform partitions declare dbConfiguration
- Commit:
-
App wiring —
apps/Al1x/partition.ts: forwardpartition.dbConfigurationinto the storage stack invocation.- Commit:
feat(rds): partition app forwards dbConfiguration into storage stack
- Commit:
Four commits, all on one branch. Pre-existing prerequisites (the actual test files) determine whether a fifth “tests” commit is needed; expect yes — see A.3.
A.3 Test updates
Section titled “A.3 Test updates”Run the existing test suite incrementally; snapshot drifts are expected and intended:
cd /Users/jmp/code/arda/projects/product-slow-responses-worktrees/infrastructurenpm run buildnpm testSnapshot updates:
purpose-storagestack snapshots for Alpha001-prod, Alpha001-demo, Alpha002-stage, Alpha002-dev all gain a newAWS::RDS::DBClusterParameterGroupresource and the cluster’sDBClusterParameterGroupNameref.- Alpha001-prod snapshot additionally shows
DBInstanceClass: db.r7g.largeon both writer + reader1, andApplyImmediately: false. - Alpha001-demo, Alpha002-stage, Alpha002-dev snapshots remain on
db.t3.mediumwithApplyImmediately: true.
Add a focused construct-level unit test (or fixture-based snapshot) that asserts:
- Parameter map keys (
shared_preload_libraries,log_min_duration_statement,log_statement,log_lock_waits,log_temp_files,pg_stat_statements.track) are present with the expected values whenparameterGroupSettingsis supplied. max_connectionsis present only when the caller sets it (verifies prod-vs-non-prod divergence inplatforms.tsshows up correctly).AWS::RDS::DBClusterParameterGroupis absent whenparameterGroupSettingsis omitted (preserves construct back-compat).
Commit: test(rds): snapshot + unit coverage for parameter group.
A.4 Local pre-push checks
Section titled “A.4 Local pre-push checks”Run every check before pushing — per the workspace rule “Always run checks before push”:
cd /Users/jmp/code/arda/projects/product-slow-responses-worktrees/infrastructurenpm run lintnpm run typecheck # or whatever the repo's tsc gate is — check package.jsonnpm testnpm run synth # synth all targets; confirms templates renderIf any synth target fails, fix locally before the push — CI’s synth job is the same code path and will fail identically.
A.5 Changelog
Section titled “A.5 Changelog”The infrastructure repo uses direct-edit changelog (per the
workspace changelog-every-pr rule). Add a single entry near the top
of CHANGELOG.md:
## [Unreleased]
### Changed
- Aurora cluster construct now accepts a `dbConfiguration` sub-interface on its `Configuration`. The construct builds a custom DB cluster parameter group internally from the supplied settings (covering `shared_preload_libraries`, slow-query logging, lock-wait and temp-file logging, and `pg_stat_statements` tracking). Per-partition values live in `platforms.ts`. Production gains `db.r7g.large` instances and `max_connections=500`; `applyImmediately` is false in production so the operator triggers the writer/reader reboot explicitly. See Arda PDEV-479.Category is Changed (per the user’s “only Changed/Removed for
API-breaking” rule — the construct’s interface change is API-impacting
for callers of AuroraPostgresCluster.Configuration).
A.6 Push and open PR
Section titled “A.6 Push and open PR”git -C /Users/jmp/code/arda/projects/product-slow-responses-worktrees/infrastructure push -u origin <branch>gh -R Arda-cards/infrastructure pr create --draft \ --title "PDEV-479 Aurora parameter group + prod sizing" \ --body-file <body>PR body must include:
-
Summary of the construct change + per-partition values.
-
Closes PDEV-479. -
The deployment + reboot sequence (link to
_docs/analysis/db-configuration.md). -
The Arda attribution block:
> [!note]> Authored by Claude Opus for jmpicnic
Leave as draft until tests are green and the user opts in to review. After the PR is open:
/pr-steward <PR-URL>In the background. The pr-steward agent watches CI, triages reviewer comments, and never merges (per memory).
A.7 Decision-log comment on PDEV-479 (Q16 (a))
Section titled “A.7 Decision-log comment on PDEV-479 (Q16 (a))”After the PR is opened, add a comment on PDEV-479 capturing the instance-class decision so the rationale stays alongside the ticket:
Decision: r7g.large for production
Selected
db.r7g.largeoverr6i.large/r7i.largefor the prod writer + reader on Aurora PostgreSQL 16.6.Reasoning:
- Graviton-3 (r7g) gives ~20 % better price/perf than equivalent x86 r-class instances and avoids needing to revisit sizing for a year-plus of expected load growth.
- 16 GB RAM is comfortably above today’s working set; eliminates the buffer-pool pressure that the t3.medium causes during failover warmup.
- r7g.large supports the
max_connections=500ceiling without sliding into instance-derived defaults that t3.medium would have forced lower.- Both writer + reader sized identically to keep failover behaviour symmetric.
Phase B — Per-environment rollout
Section titled “Phase B — Per-environment rollout”Rollout order is dev → demo → stage → prod. Each environment uses the same five-step sequence. The CDK deploy is the only AWS-CDK moment; everything else is AWS CLI snapshot/reboot/delete.
B.0 Pre-flight (each environment)
Section titled “B.0 Pre-flight (each environment)”Before deploying to an environment, confirm:
- PDEV-498’s PR #30 is merged + container image published in the
postgres-database-initializerrepo, AND the operations chart in that environment has been redeployed at least once with the new initializer image after the cluster parameter group is in place. (If the order is inverted, the verify-query will fail-loud and the init pod will crash until the parameter group is reattached and the cluster rebooted. This is by design — but worth scheduling so it doesn’t surprise the operator.) - The previous environment’s rollout (where applicable) completed cleanly. Don’t pipeline; verify each step.
B.1 Snapshot the cluster
Section titled “B.1 Snapshot the cluster”Resolve the AWS profile (per memory: Admin-Alpha1 for prod/demo,
Alpha002-Admin for dev/stage; --profile flag at end). Cluster
identifier follows the existing naming (Alpha00X-PurposeAurora…) —
discover via aws rds describe-db-clusters:
aws rds create-db-cluster-snapshot \ --db-cluster-identifier <cluster-id> \ --db-cluster-snapshot-identifier <cluster-id>-pdev479-pre-restart \ --region us-east-1 \ --profile <profile>
aws rds wait db-cluster-snapshot-available \ --db-cluster-snapshot-identifier <cluster-id>-pdev479-pre-restart \ --region us-east-1 \ --profile <profile>Wait for available before proceeding.
B.2 CDK deploy
Section titled “B.2 CDK deploy”cd /Users/jmp/code/arda/projects/product-slow-responses-worktrees/infrastructure# Follow the existing deploy entry point (amm.sh + cdk deploy);# the actual command shape lives in dev-workflows.md../scripts/deploy.sh Alpha00X <env> # placeholder — use real entryFor dev/demo/stage this attaches the new parameter group and
applies it immediately to the cluster instances (since
applyImmediately: true). Static parameters
(shared_preload_libraries, max_connections) are now in
pending-reboot state on the instances; dynamic parameters propagate
on their own within ~1 minute.
For prod (applyImmediately: false) the parameter group is attached
at the cluster level but the instance-level effect waits for
the operator’s explicit reboot in B.4. Static and dynamic
parameters at the instance level are deferred.
Watch CloudFormation events; expect the
AWS::RDS::DBClusterParameterGroup resource to create cleanly. The
cluster + instance resources should update without replacement.
B.3 Verify dynamic parameters (dev/demo/stage only)
Section titled “B.3 Verify dynamic parameters (dev/demo/stage only)”Wait ~90 seconds, then connect to the writer (via the existing bastion / port-forward) and verify the dynamic parameters are live without a reboot:
SHOW log_min_duration_statement; -- expect: 500msSHOW log_statement; -- expect: ddlSHOW log_lock_waits; -- expect: onSHOW log_temp_files; -- expect: 0SHOW pg_stat_statements.track; -- expect: all (errors until reboot -- if shared_preload_libraries -- is still pending)SHOW shared_preload_libraries; -- expect: still old value (static)SHOW max_connections; -- expect: still old value (static)If the dynamic ones are wrong, stop and investigate before rebooting — a reboot won’t fix dynamic-parameter propagation problems.
For prod (applyImmediately: false) this verification happens
after the reboot in B.4 — there is no in-between state where
dynamic parameters are live but static aren’t.
B.4 Reboot writer + reader
Section titled “B.4 Reboot writer + reader”Reboot reader1 first, wait for available, then reboot the writer
(which doubles as a failover). This minimises connection drop time —
the reader is the half that bears the bitemporal scan load, and
gating it first means the application is talking to a writer with the
old parameter group while the reader comes up with the new one. The
writer then reboots once the reader is back, which triggers a failover
to the (already-rebooted) reader; the new writer ends up on the new
parameter group too.
# Reader firstaws rds reboot-db-instance \ --db-instance-identifier <cluster-id>-AuroraClusterReader1 \ --region us-east-1 \ --profile <profile>aws rds wait db-instance-available \ --db-instance-identifier <cluster-id>-AuroraClusterReader1 \ --region us-east-1 \ --profile <profile>
# Then writer (this triggers failover)aws rds reboot-db-instance \ --db-instance-identifier <cluster-id>-AuroraClusterWriter \ --region us-east-1 \ --profile <profile>aws rds wait db-instance-available \ --db-instance-identifier <cluster-id>-AuroraClusterWriter \ --region us-east-1 \ --profile <profile>HikariCP’s connectionTimeout=30000 rides out the brief failover
window (verified up front; the operations chart needs no change).
B.5 Verify static parameters and the extension
Section titled “B.5 Verify static parameters and the extension”After the reboot completes, reconnect to the writer:
SHOW shared_preload_libraries; -- expect: pg_stat_statementsSHOW max_connections; -- expect: 500 (prod) or default (others)SELECT count(*) FROM pg_stat_statements; -- expect: a row count, not an errorA non-error response from pg_stat_statements confirms the
shared-memory hash is allocated. If it errors with
pg_stat_statements must be loaded via shared_preload_libraries, the
reboot didn’t pick up the parameter group — escalate before
proceeding.
Per-application-database verification (after the initializer redeploy):
\c <app-db-name>SELECT count(*) FROM pg_stat_statements;Should succeed for every application database the
postgres-database-initializer provisions.
B.6 Delete the snapshot
Section titled “B.6 Delete the snapshot”Once the cluster is healthy, the dashboards are clean, and the application is serving traffic on the new sizing:
aws rds delete-db-cluster-snapshot \ --db-cluster-snapshot-identifier <cluster-id>-pdev479-pre-restart \ --region us-east-1 \ --profile <profile>Per Q10 — snapshot is the rollback insurance for the rollout window only; once green, it goes away to avoid pay-for-storage drag.
B.7 Operations note
Section titled “B.7 Operations note”For each environment, write one file under
documentation/src/content/docs/process/operation-notes/ named
YYYYMMDD-<env>-aurora-parameter-group-restart.md (per Q16 (b)).
Suggested skeleton:
---title: "<env> Aurora parameter group rollout — <date>"description: "Operations log for the PDEV-479 parameter group + (prod only) sizing rollout against the Alpha00X-<env> Aurora cluster."tags: [process, operation-notes, aurora]domain: processmaturity: publishedauthor: "Miguel Pinilla"---
## Summary
- Cluster: `<cluster-id>`- Environment: `<env>` (Alpha00X)- Trigger: PDEV-479 (Aurora parameter group + slow-query observability)- Operator: <name>- Start: <UTC timestamp>- End: <UTC timestamp>
## Sequence
1. Snapshot created: `<snapshot-id>` at <UTC ts>2. CDK deploy started: <UTC ts>3. CDK deploy completed: <UTC ts>4. Dynamic parameter verification (if applicable): <UTC ts>5. Reader reboot: <UTC ts> → available <UTC ts>6. Writer reboot (failover): <UTC ts> → available <UTC ts>7. Static parameter + pg_stat_statements verification: <UTC ts>8. Snapshot deleted: <UTC ts>
## Verifications
(paste `SHOW <param>;` outputs and the `SELECT count(*) FROMpg_stat_statements;` results)
## Anomalies
(record anything unexpected — failover duration, app errors,dashboard blips. If nothing, say so.)
## References
- PDEV-479- `_docs/analysis/db-configuration.md`- `_docs/pdev-479/implementation/infrastructure/db-updates.md`- `_docs/pdev-479/implementation/infrastructure/db-plan.md`The directory does not exist yet — create it on the first operation note (per the user’s clarification). Commit to the documentation worktree on the project branch and roll it into the documentation PR at project completion.
Phase C — Rollout schedule
Section titled “Phase C — Rollout schedule”| Env | Partition | When | applyImmediately |
|---|---|---|---|
| dev | Alpha002 | Day 0 (after PR merge) | true |
| demo | Alpha001 | Day 0 or +1 | true |
| stage | Alpha002 | After demo soaks ≥24h | true |
| prod | Alpha001 | Window TBD post-demo | false |
Per Q9, the prod window is determined after demo is observed healthy. The CDK deploy to prod can happen earlier (it’s safe — no instance-level effect until the operator reboots), but the snapshot
- reboot sequence is gated on the explicit window.
Phase D — Rollback
Section titled “Phase D — Rollback”Two rollback levers, in order of preference:
-
Detach the parameter group (reversal of the CDK change).
- Revert the PR (or a follow-up PR that empties
dbConfigurationon the offending partition). - CDK deploy reattaches Aurora’s default parameter group.
- For static parameters to revert, the writer + reader need to reboot again — same B.4 sequence.
- For prod, also reverts the instance class from r7g.large back to t3.medium; expect ~5-10 min instance-replace per node. Plan a second outage window if rolling back sizing.
- Revert the PR (or a follow-up PR that empties
-
Restore from snapshot (last resort).
- Use the
<cluster-id>-pdev479-pre-restartsnapshot. - This involves a new cluster identifier and an application reconnect; coordinate with operations chart redeploy to point at the restored cluster.
- Use the
Lever 1 is the default; lever 2 only if lever 1 doesn’t recover the fault.
Phase E — Project completion
Section titled “Phase E — Project completion”After the prod rollout completes:
- Move
_docs/analysis/db-configuration.mdand these two implementation files (db-updates.md,db-plan.md) into the project’scompleteddocumentation tree per the workspace project lifecycle. - Update PDEV-479 to
Donein Linear and post the close-out comment summarising what shipped and pointing at the four operation-notes files. - Confirm PDEV-498 is also closed (separate PR but the two are coupled in user-visible behaviour).
- Remove the infrastructure worktree per the worktree cleanup procedure once the PR is merged and pushed.
Cross-references
Section titled “Cross-references”_docs/analysis/db-configuration.md— the goal/spec for PDEV-479._docs/analysis/db-init.md— PDEV-498 spec for thepg_stat_statementsper-DB extension (sub-issue)._docs/analysis/infrastructure-improvements.md§ 2 + § 3 — umbrella scope narrative._docs/pdev-479/implementation/infrastructure/db-updates.md— change specification companion to this plan.
Copyright: © Arda Systems 2025-2026, All rights reserved