PDEV-479 — Aurora DB configuration: parameter group, slow-query logging, prod sizing

Updated goal for PDEV-479. Supersedes the original three-finding ticket description; that text was written before the project’s later decisions on per-cluster scope, per-env sizing, and the PDEV-498 sub-issue split. This doc is the source of truth for PDEV-479 scope.

Why this is happening

During the PDEV-442 investigation we found that the database is not the current bottleneck for /items slowness (Performance Insights: 0.49 avg active sessions, writer CPU 19–22 %), but the prod Aurora cluster has three configuration gaps that will bite us as soon as the operations component starts scaling under HPA (PDEV-488), and will keep us blind during the SQL-optimization work in PDEV-490:

No query-level observability. pg_stat_statements is not in shared_preload_libraries on any cluster. Without it we can’t identify slow queries, compute per-query latency trends, or detect regressions after deploys. PDEV-490’s whole SQL-optimization workstream depends on this data being available.
Slow-query log is effectively off. log_min_duration_statement is unset on the active parameter group; the postgres log group received zero events in the 3 h window we inspected (most recent stream ~2 days old). A working slow-query log is the complementary signal to pg_stat_statements — it captures the actual statement text and parameters as they happen, while pg_stat_statements aggregates by normalized text.
Prod runs on a burstable instance class (db.t3.medium, 2 vCPU / 4 GiB). Today this is over-provisioned for actual load (CPU credit balance flat at 533, no burst-drain), but it is not the right class for an OLTP workload that’s about to fan out from 2 operations pods to up to 8 under HPA. One bad N+1 or a sustained business-hour spike would exhaust burst credits and throttle. Coupled with the connection-count growth from HPA + the future accounts component, max_connections also needs a deliberate value rather than the instance-class-derived default.

The three items share one outage window — shared_preload_libraries is a static parameter that requires an instance restart, and the instance-class change requires a restart too — so they bundle into a single coordinated change.

What this ticket delivers

1. Custom DB cluster parameter group — all four Aurora clusters

A new CDK-managed DB cluster parameter group, attached to all four Aurora clusters (Alpha001 prod, Alpha001 demo, Alpha002 stage, Alpha002 dev). Same parameter set across envs; per-env tuning of the slow-query threshold is the only divergence considered, and it’s deferred until we have real signal.

Parameter	Value	Apply type	Rationale
`shared_preload_libraries`	`pg_stat_statements`	static (instance restart required)	Allocates the shared-memory hash at server start. Required for `pg_stat_statements` queries to return data.
`pg_stat_statements.track`	`all`	dynamic	Track top-level and nested statements (e.g., inside functions). The granularity we want for diagnosing application queries.
`pg_stat_statements.max`	(default `5000`, leave as-is unless we hit eviction)	dynamic	Tunable later if Performance Insights shows aging-out of unique-query slots.
`log_min_duration_statement`	`500` (ms)	dynamic	Log every statement that takes ≥500 ms. Tune downward if log volume is too low to be useful, upward if it’s noisy.
`log_statement`	`ddl`	dynamic	Capture schema changes (CREATE, ALTER, DROP). Cheap; useful for change auditing.
`log_lock_waits`	`on`	dynamic	Log when a session waits longer than `deadlock_timeout` on a lock. Lights up contention issues that don’t surface in query duration alone.
`log_temp_files`	`0`	dynamic	Log every temp file created by a query. Surfaces queries that spill to disk — a common cause of slow OLTP queries that look fast in plan output but are slow in practice.

The shared_preload_libraries change is the only restart-requiring parameter; the rest take effect on next session.

2. Prod instance class + `max_connections` — prod only

Resize prod only; dev / stage / demo keep their current instance classes. The parameter group from §1 still attaches to all four — cluster-level parameters cost nothing extra and keep pg_stat_statements available everywhere for cross-env consistency.

Aurora cluster	Instance class today	Instance class after	`max_connections`
Alpha001 prod	`db.t3.medium`	`db.r6g.large` (or `db.r7g.large` if regionally available — decision recorded in this ticket’s decision-log entry)	`500`
Alpha001 demo	`db.t3.medium`	unchanged	default (instance-class-derived)
Alpha002 stage	`db.t3.medium`	unchanged	default
Alpha002 dev	`db.t3.medium`	unchanged	default

max_connections = 500 on prod gives ~3× headroom over the projected HPA-max steady state (8 operations pods × 20-pool + future accounts + ad-hoc ≈ 340 conns). Lives in the parameter group, not as a per-instance override.

Per-cluster sizing parameterisation. Instance class and max_connections are cluster-specific decisions. Define them in each cluster’s per-instance CDK file (e.g. src/main/cdk/instances/Alpha001/prod.ts) and pass them as parameters into the shared CDK construct that provisions the Aurora cluster. Don’t hardcode either value inside the shared construct — keep all per-env divergence at the instance-file boundary, consistent with the rest of the CDK architecture.

3. Coordination with PDEV-498

This ticket installs pg_stat_statements at the cluster level (§1’s parameter group). It does not create the extension inside any database — that’s the per-DB step, and it lives in PDEV-498 in the postgres-database-initializer repo.

Deploy ordering:

PDEV-479 (this ticket) lands the parameter group on each cluster and applies the restart. The shared-memory hash is now allocated; queries are tracked cluster-wide.
PDEV-498’s new image release lands on each cluster’s operations chart. The init container runs CREATE EXTENSION IF NOT EXISTS pg_stat_statements in the application database during pod startup. The view becomes queryable from the app DB.

If the order reverses (PDEV-498’s image deploys before PDEV-479’s parameter group lands on a given cluster), the init container fails loud (pg_stat_statements must be loaded via shared_preload_libraries) and the operations pod enters Init:CrashLoopBackOff. Fix is to apply the parameter group and let K8s retry. Fully recoverable. PDEV-498 documents this fail-loud contract.

Out of scope for this ticket

CREATE EXTENSION pg_stat_statements — PDEV-498’s responsibility.
RDS Proxy decision — deferred to PDEV-499 (evaluation only; gated on PDEV-488 in stage producing real HikariCP metrics).
Aurora reader-instance routing for the bitemporal read workload — possible future optimization; not part of this ticket.
pg_stat_statements.max tuning — leave at the default 5000 for now; revisit if Performance Insights shows query-slot eviction.

Acceptance criteria

A new DB cluster parameter group exists in CDK, attached to all four Aurora clusters (prod, demo, stage, dev), with the parameter set above.
The prod Aurora cluster is on db.r6g.large (or db.r7g.large if chosen) with the parameter group applied and an instance restart completed.
Per-cluster sizing values (instance class, max_connections) are defined in the per-instance CDK files (src/main/cdk/instances/{Infra}/{partition}.ts) and passed as parameters to the shared RDS construct. No hardcoded per-env values in the construct itself.
SHOW shared_preload_libraries; on each cluster’s writer returns a string containing pg_stat_statements.
SHOW max_connections; on the prod writer returns 500. The other three clusters return their instance-class default.
/aws/rds/cluster/{cluster-name}/postgresql log group on each cluster receives at least one event per hour under normal load, within an hour of the change.
After PDEV-498’s image deploys to a cluster (separate ticket), SELECT count(*) FROM pg_stat_statements from any application DB on that cluster returns a non-zero row count within minutes (under typical query load).
The instance-class decision (db.r6g.large vs db.r7g.large) is recorded as a decision-log entry under this ticket — pick whichever is regionally available and note the choice.

Outage window

This is the only planned-outage window in the entire slow-responses project’s infrastructure scope. The prod Aurora instance restart for shared_preload_libraries + the instance-class change happen together. Notify product before the window. Non-prod restarts (parameter group attach on dev / stage / demo) are routine and don’t require coordination.

Estimated downtime: a few seconds of writer unavailability per restart on Aurora’s “fast failover” path; failover is to the reader, then back. If we have HA active in prod (recommended; verify against current state), the user-visible window is single-digit seconds.

Deploy and activation sequence

The CDK deploy alone does not activate every parameter. Aurora splits parameters into two apply types, and the static ones queue as pending-reboot until an explicit operator action. Sequence per cluster:

Step 1 — CDK deploy (`amm.sh` or equivalent)

New DB cluster parameter group is created and attached to the cluster.
Instance-class change (prod only) is applied to the writer/reader instance resources.

Effects after step 1 completes:

Parameter	Apply type	Active after step 1?
`log_min_duration_statement = 500`	dynamic	Yes, within ~1 min
`log_statement = ddl`	dynamic	Yes, within ~1 min
`log_lock_waits = on`	dynamic	Yes, within ~1 min
`log_temp_files = 0`	dynamic	Yes, within ~1 min
`pg_stat_statements.track = all`	dynamic	Yes, within ~1 min
`shared_preload_libraries = pg_stat_statements`	static	No — pending-reboot
`max_connections = 500` (prod only)	static	No — pending-reboot
Instance class change (prod only)	replacement	No — pending-reboot/replacement

Aurora’s aws rds describe-db-instances returns PendingModifiedValues populated and the parameter-application-status flag flipped for each affected instance. The cluster keeps running on the old static values until the operator acts.

Verify the dynamic-parameter activation before moving on (see § Verification commands below — the log_min_duration_statement and log_statement checks). The CloudWatch slow-query log group should also start receiving events within the hour.

Step 2 — manual activation of the static parameters + instance-class change

For prod, this is the one planned-outage window of the project (see § Outage window above). Coordinate before triggering:

# Single command activates both the static parameters and the
# instance-class change in one writer failover.
aws rds reboot-db-instance \
  --db-instance-identifier <writer-instance-id> \
  --profile Admin-Alpha1

# After failover settles, reboot the reader so it matches.
aws rds reboot-db-instance \
  --db-instance-identifier <reader-instance-id> \
  --profile Admin-Alpha1

(Aurora may also trigger the activation automatically at the next preferred maintenance window — verify the cluster’s PreferredMaintenanceWindow and decide whether to wait or push.)

For dev / stage / demo, step 2 is the parameter-group reboot only (no instance-class change). Routine — can be done back-to-back with step 1.

Step 3 — confirm

After the reboot completes, run the verification commands in the next section against each cluster. Specifically:

SHOW shared_preload_libraries; returns a string containing pg_stat_statements.
SHOW max_connections; returns 500 on prod, default elsewhere.
The instance class on the prod writer matches the new choice (db.r6g.large or whichever was selected) per aws rds describe-db-instances.

Why the explicit-reboot model is correct here

Aurora does support an “apply immediately” mode (--apply-immediately on modify operations) that triggers the reboot at deploy time rather than queueing pending-reboot. Deliberately not using that here:

It bundles the deploy step and the maintenance window into one transaction. We want them separated — the deploy lands during normal work hours; the failover happens in a scheduled window with the team available.
Failing the reboot half of an apply-immediately deploy leaves the cluster in an awkward state. Separating the two operations keeps failure modes localized.

The trade-off is one extra command per cluster (the explicit reboot). Worth it.

Verification commands

After the change lands on a cluster, run from a bastion pod connected to the writer (substitute the cluster name in the log-group path):

-- Prerequisite: library loaded
SHOW shared_preload_libraries;
-- Expect a string containing 'pg_stat_statements'

-- Prod only
SHOW max_connections;
-- Expect '500' on prod; instance-class default elsewhere

-- Slow-query logging is active
SHOW log_min_duration_statement;
-- Expect '500ms'

After PDEV-498’s image has rolled through, from a connected app DB:

-- Extension is queryable in the app DB
SELECT count(*) FROM pg_stat_statements;
-- Expect a non-zero row count under typical load

# CloudWatch log group receives events
aws --profile <profile> logs filter-log-events \
  --log-group-name /aws/rds/cluster/<cluster>/postgresql \
  --start-time $(date -u -v-1H +%s)000 \
  --query 'length(events)' --output text
# Expect ≥ 1

References

_docs/analysis/db-init.md — PDEV-498’s implementation proposal for the matching CREATE EXTENSION work in the postgres-database-initializer image.
_docs/analysis/infrastructure-improvements.md § 2 (parameter group), § 3 (sizing + max_connections) — umbrella narrative.
_docs/analysis/be-findings.md — original Performance Insights capture that motivated the slow-query-logging finding.
_docs/analysis/pod_capacity.md § Recommendations — operations pod count and HikariCP pool-size assumptions feeding into the max_connections = 500 choice.