PDEV-479 — Aurora cluster configuration: implementation plan

Implementation plan for PDEV-479.

The plan has two phases:

Code + PR phase — land the CDK change behind a single PR. CloudFormation alone won’t restart writer/reader, so the deploy is safe-by-default for prod (applyImmediately: false); demo/stage/dev are immediate.
Rollout phase — per-environment, operator-driven activation: snapshot → CDK deploy → verify dynamic params → reboot → verify all params → drop snapshot. One operations note per environment.

PDEV-498 (pg_stat_statements per-DB CREATE EXTENSION) lands separately on its own PR. It is a hard prerequisite for the data to be queryable, but its deploy is independent — it can land before or after this one. The init container’s fail-loud PERFORM count(*) FROM pg_stat_statements verify-query means a redeploy after the parameter group is in place is when the extension actually becomes queryable.

Phase A — Code + PR

All work happens inside the existing worktree at projects/product-slow-responses-worktrees/infrastructure/ on its existing branch.

A.1 Branch state

This worktree already exists with a branch tracked through the project. No new branch needed. Verify before starting:

git -C /Users/jmp/code/arda/projects/product-slow-responses-worktrees/infrastructure \
  status --short --branch

Expect a clean working tree on the project branch.

A.2 Implementation order

Apply the changes from db-updates.md in this order, committing incrementally so each step compiles cleanly on its own:

Construct — aurora-postgres-cluster.ts: add DbConfiguration + ParameterGroupSettings interfaces, internal parameter-group build, widened defaultDbProps signature. Backwards-compatible — no caller changes yet.
- Commit: feat(rds): aurora construct accepts dbConfiguration sub-interface
Stack — purpose-storage.ts: add dbConfiguration? to its Configuration, pass through.
- Commit: feat(rds): purpose-storage stack threads dbConfiguration
Platform model — platforms.ts: extend PartitionInfo / Partition with dbConfiguration?, populate the four production partitions (prod, demo, stage, dev) with their per-env values.
- Commit: feat(rds): platform partitions declare dbConfiguration
App wiring — apps/Al1x/partition.ts: forward partition.dbConfiguration into the storage stack invocation.
- Commit: feat(rds): partition app forwards dbConfiguration into storage stack

Four commits, all on one branch. Pre-existing prerequisites (the actual test files) determine whether a fifth “tests” commit is needed; expect yes — see A.3.

A.3 Test updates

Run the existing test suite incrementally; snapshot drifts are expected and intended:

cd /Users/jmp/code/arda/projects/product-slow-responses-worktrees/infrastructure
npm run build
npm test

Snapshot updates:

purpose-storage stack snapshots for Alpha001-prod, Alpha001-demo, Alpha002-stage, Alpha002-dev all gain a new AWS::RDS::DBClusterParameterGroup resource and the cluster’s DBClusterParameterGroupName ref.
Alpha001-prod snapshot additionally shows DBInstanceClass: db.r7g.large on both writer + reader1, and ApplyImmediately: false.
Alpha001-demo, Alpha002-stage, Alpha002-dev snapshots remain on db.t3.medium with ApplyImmediately: true.

Add a focused construct-level unit test (or fixture-based snapshot) that asserts:

Parameter map keys (shared_preload_libraries, log_min_duration_statement, log_statement, log_lock_waits, log_temp_files, pg_stat_statements.track) are present with the expected values when parameterGroupSettings is supplied.
max_connections is present only when the caller sets it (verifies prod-vs-non-prod divergence in platforms.ts shows up correctly).
AWS::RDS::DBClusterParameterGroup is absent when parameterGroupSettings is omitted (preserves construct back-compat).

Commit: test(rds): snapshot + unit coverage for parameter group.

A.4 Local pre-push checks

Run every check before pushing — per the workspace rule “Always run checks before push”:

cd /Users/jmp/code/arda/projects/product-slow-responses-worktrees/infrastructure
npm run lint
npm run typecheck   # or whatever the repo's tsc gate is — check package.json
npm test
npm run synth       # synth all targets; confirms templates render

If any synth target fails, fix locally before the push — CI’s synth job is the same code path and will fail identically.

A.5 Changelog

The infrastructure repo uses direct-edit changelog (per the workspace changelog-every-pr rule). Add a single entry near the top of CHANGELOG.md:

## [Unreleased]

### Changed

- Aurora cluster construct now accepts a `dbConfiguration` sub-interface
  on its `Configuration`. The construct builds a custom DB cluster
  parameter group internally from the supplied settings (covering
  `shared_preload_libraries`, slow-query logging, lock-wait and
  temp-file logging, and `pg_stat_statements` tracking). Per-partition
  values live in `platforms.ts`. Production gains
  `db.r7g.large` instances and `max_connections=500`;
  `applyImmediately` is false in production so the operator triggers
  the writer/reader reboot explicitly. See Arda PDEV-479.

Category is Changed (per the user’s “only Changed/Removed for API-breaking” rule — the construct’s interface change is API-impacting for callers of AuroraPostgresCluster.Configuration).

A.6 Push and open PR

git -C /Users/jmp/code/arda/projects/product-slow-responses-worktrees/infrastructure push -u origin <branch>
gh -R Arda-cards/infrastructure pr create --draft \
   --title "PDEV-479 Aurora parameter group + prod sizing" \
   --body-file <body>

PR body must include:

Summary of the construct change + per-partition values.
Closes PDEV-479.
The deployment + reboot sequence (link to _docs/analysis/db-configuration.md).

The Arda attribution block:

> [!note]
> Authored by Claude Opus for jmpicnic

Leave as draft until tests are green and the user opts in to review. After the PR is open:

/pr-steward <PR-URL>

In the background. The pr-steward agent watches CI, triages reviewer comments, and never merges (per memory).

A.7 Decision-log comment on PDEV-479 (Q16 (a))

After the PR is opened, add a comment on PDEV-479 capturing the instance-class decision so the rationale stays alongside the ticket:

Decision: r7g.large for production

Selected db.r7g.large over r6i.large / r7i.large for the prod writer + reader on Aurora PostgreSQL 16.6.

Reasoning:

Graviton-3 (r7g) gives ~20 % better price/perf than equivalent x86 r-class instances and avoids needing to revisit sizing for a year-plus of expected load growth.

16 GB RAM is comfortably above today’s working set; eliminates the buffer-pool pressure that the t3.medium causes during failover warmup.

r7g.large supports the max_connections=500 ceiling without sliding into instance-derived defaults that t3.medium would have forced lower.

Both writer + reader sized identically to keep failover behaviour symmetric.

Phase B — Per-environment rollout

Rollout order is dev → demo → stage → prod. Each environment uses the same five-step sequence. The CDK deploy is the only AWS-CDK moment; everything else is AWS CLI snapshot/reboot/delete.

B.0 Pre-flight (each environment)

Before deploying to an environment, confirm:

PDEV-498’s PR #30 is merged + container image published in the postgres-database-initializer repo, AND the operations chart in that environment has been redeployed at least once with the new initializer image after the cluster parameter group is in place. (If the order is inverted, the verify-query will fail-loud and the init pod will crash until the parameter group is reattached and the cluster rebooted. This is by design — but worth scheduling so it doesn’t surprise the operator.)
The previous environment’s rollout (where applicable) completed cleanly. Don’t pipeline; verify each step.

B.1 Snapshot the cluster

Resolve the AWS profile (per memory: Admin-Alpha1 for prod/demo, Alpha002-Admin for dev/stage; --profile flag at end). Cluster identifier follows the existing naming (Alpha00X-PurposeAurora…) — discover via aws rds describe-db-clusters:

aws rds create-db-cluster-snapshot \
  --db-cluster-identifier <cluster-id> \
  --db-cluster-snapshot-identifier <cluster-id>-pdev479-pre-restart \
  --region us-east-1 \
  --profile <profile>

aws rds wait db-cluster-snapshot-available \
  --db-cluster-snapshot-identifier <cluster-id>-pdev479-pre-restart \
  --region us-east-1 \
  --profile <profile>

Wait for available before proceeding.

B.2 CDK deploy

cd /Users/jmp/code/arda/projects/product-slow-responses-worktrees/infrastructure
# Follow the existing deploy entry point (amm.sh + cdk deploy);
# the actual command shape lives in dev-workflows.md.
./scripts/deploy.sh Alpha00X <env>      # placeholder — use real entry

For dev/demo/stage this attaches the new parameter group and applies it immediately to the cluster instances (since applyImmediately: true). Static parameters (shared_preload_libraries, max_connections) are now in pending-reboot state on the instances; dynamic parameters propagate on their own within ~1 minute.

For prod (applyImmediately: false) the parameter group is attached at the cluster level but the instance-level effect waits for the operator’s explicit reboot in B.4. Static and dynamic parameters at the instance level are deferred.

Watch CloudFormation events; expect the AWS::RDS::DBClusterParameterGroup resource to create cleanly. The cluster + instance resources should update without replacement.

B.3 Verify dynamic parameters (dev/demo/stage only)

Wait ~90 seconds, then connect to the writer (via the existing bastion / port-forward) and verify the dynamic parameters are live without a reboot:

SHOW log_min_duration_statement;   -- expect: 500ms
SHOW log_statement;                 -- expect: ddl
SHOW log_lock_waits;                -- expect: on
SHOW log_temp_files;                -- expect: 0
SHOW pg_stat_statements.track;      -- expect: all (errors until reboot
                                    --   if shared_preload_libraries
                                    --   is still pending)
SHOW shared_preload_libraries;      -- expect: still old value (static)
SHOW max_connections;               -- expect: still old value (static)

If the dynamic ones are wrong, stop and investigate before rebooting — a reboot won’t fix dynamic-parameter propagation problems.

For prod (applyImmediately: false) this verification happens after the reboot in B.4 — there is no in-between state where dynamic parameters are live but static aren’t.

B.4 Reboot writer + reader

Reboot reader1 first, wait for available, then reboot the writer (which doubles as a failover). This minimises connection drop time — the reader is the half that bears the bitemporal scan load, and gating it first means the application is talking to a writer with the old parameter group while the reader comes up with the new one. The writer then reboots once the reader is back, which triggers a failover to the (already-rebooted) reader; the new writer ends up on the new parameter group too.

# Reader first
aws rds reboot-db-instance \
  --db-instance-identifier <cluster-id>-AuroraClusterReader1 \
  --region us-east-1 \
  --profile <profile>
aws rds wait db-instance-available \
  --db-instance-identifier <cluster-id>-AuroraClusterReader1 \
  --region us-east-1 \
  --profile <profile>

# Then writer (this triggers failover)
aws rds reboot-db-instance \
  --db-instance-identifier <cluster-id>-AuroraClusterWriter \
  --region us-east-1 \
  --profile <profile>
aws rds wait db-instance-available \
  --db-instance-identifier <cluster-id>-AuroraClusterWriter \
  --region us-east-1 \
  --profile <profile>

HikariCP’s connectionTimeout=30000 rides out the brief failover window (verified up front; the operations chart needs no change).

B.5 Verify static parameters and the extension

After the reboot completes, reconnect to the writer:

SHOW shared_preload_libraries;      -- expect: pg_stat_statements
SHOW max_connections;               -- expect: 500 (prod) or default (others)
SELECT count(*) FROM pg_stat_statements;   -- expect: a row count, not an error

A non-error response from pg_stat_statements confirms the shared-memory hash is allocated. If it errors with pg_stat_statements must be loaded via shared_preload_libraries, the reboot didn’t pick up the parameter group — escalate before proceeding.

Per-application-database verification (after the initializer redeploy):

\c <app-db-name>
SELECT count(*) FROM pg_stat_statements;

Should succeed for every application database the postgres-database-initializer provisions.

B.6 Delete the snapshot

Once the cluster is healthy, the dashboards are clean, and the application is serving traffic on the new sizing:

aws rds delete-db-cluster-snapshot \
  --db-cluster-snapshot-identifier <cluster-id>-pdev479-pre-restart \
  --region us-east-1 \
  --profile <profile>

Per Q10 — snapshot is the rollback insurance for the rollout window only; once green, it goes away to avoid pay-for-storage drag.

B.7 Operations note

For each environment, write one file under documentation/src/content/docs/process/operation-notes/ named YYYYMMDD-<env>-aurora-parameter-group-restart.md (per Q16 (b)). Suggested skeleton:

---
title: "<env> Aurora parameter group rollout — <date>"
description: "Operations log for the PDEV-479 parameter group + (prod only) sizing rollout against the Alpha00X-<env> Aurora cluster."
tags: [process, operation-notes, aurora]
domain: process
maturity: published
author: "Miguel Pinilla"
---

## Summary

- Cluster: `<cluster-id>`
- Environment: `<env>` (Alpha00X)
- Trigger: PDEV-479 (Aurora parameter group + slow-query observability)
- Operator: <name>
- Start: <UTC timestamp>
- End: <UTC timestamp>

## Sequence

1. Snapshot created: `<snapshot-id>` at <UTC ts>
2. CDK deploy started: <UTC ts>
3. CDK deploy completed: <UTC ts>
4. Dynamic parameter verification (if applicable): <UTC ts>
5. Reader reboot: <UTC ts> → available <UTC ts>
6. Writer reboot (failover): <UTC ts> → available <UTC ts>
7. Static parameter + pg_stat_statements verification: <UTC ts>
8. Snapshot deleted: <UTC ts>

## Verifications

(paste `SHOW <param>;` outputs and the `SELECT count(*) FROM
pg_stat_statements;` results)

## Anomalies

(record anything unexpected — failover duration, app errors,
dashboard blips. If nothing, say so.)

## References

- PDEV-479
- `_docs/analysis/db-configuration.md`
- `_docs/pdev-479/implementation/infrastructure/db-updates.md`
- `_docs/pdev-479/implementation/infrastructure/db-plan.md`

The directory does not exist yet — create it on the first operation note (per the user’s clarification). Commit to the documentation worktree on the project branch and roll it into the documentation PR at project completion.

Phase C — Rollout schedule

Env	Partition	When	applyImmediately
dev	Alpha002	Day 0 (after PR merge)	true
demo	Alpha001	Day 0 or +1	true
stage	Alpha002	After demo soaks ≥24h	true
prod	Alpha001	Window TBD post-demo	false

Per Q9, the prod window is determined after demo is observed healthy. The CDK deploy to prod can happen earlier (it’s safe — no instance-level effect until the operator reboots), but the snapshot

reboot sequence is gated on the explicit window.

Phase D — Rollback

Two rollback levers, in order of preference:

Detach the parameter group (reversal of the CDK change).
- Revert the PR (or a follow-up PR that empties dbConfiguration on the offending partition).
- CDK deploy reattaches Aurora’s default parameter group.
- For static parameters to revert, the writer + reader need to reboot again — same B.4 sequence.
- For prod, also reverts the instance class from r7g.large back to t3.medium; expect ~5-10 min instance-replace per node. Plan a second outage window if rolling back sizing.
Restore from snapshot (last resort).
- Use the <cluster-id>-pdev479-pre-restart snapshot.
- This involves a new cluster identifier and an application reconnect; coordinate with operations chart redeploy to point at the restored cluster.

Lever 1 is the default; lever 2 only if lever 1 doesn’t recover the fault.

Phase E — Project completion

After the prod rollout completes:

Move _docs/analysis/db-configuration.md and these two implementation files (db-updates.md, db-plan.md) into the project’s completed documentation tree per the workspace project lifecycle.
Update PDEV-479 to Done in Linear and post the close-out comment summarising what shipped and pointing at the four operation-notes files.
Confirm PDEV-498 is also closed (separate PR but the two are coupled in user-visible behaviour).
Remove the infrastructure worktree per the worktree cleanup procedure once the PR is merged and pushed.

Cross-references

_docs/analysis/db-configuration.md — the goal/spec for PDEV-479.
_docs/analysis/db-init.md — PDEV-498 spec for the pg_stat_statements per-DB extension (sub-issue).
_docs/analysis/infrastructure-improvements.md § 2 + § 3 — umbrella scope narrative.
_docs/pdev-479/implementation/infrastructure/db-updates.md — change specification companion to this plan.