PDEV-479 — Aurora DB configuration: parameter group, slow-query logging, prod sizing
Updated goal for PDEV-479. Supersedes the original three-finding ticket description; that text was written before the project’s later decisions on per-cluster scope, per-env sizing, and the PDEV-498 sub-issue split. This doc is the source of truth for PDEV-479 scope.
Why this is happening
Section titled “Why this is happening”During the PDEV-442 investigation we found that the database is not
the current bottleneck for /items slowness (Performance Insights:
0.49 avg active sessions, writer CPU 19–22 %), but the prod Aurora
cluster has three configuration gaps that will bite us as soon as the
operations component starts scaling under HPA (PDEV-488), and will
keep us blind during the SQL-optimization work in
PDEV-490:
- No query-level observability.
pg_stat_statementsis not inshared_preload_librarieson any cluster. Without it we can’t identify slow queries, compute per-query latency trends, or detect regressions after deploys. PDEV-490’s whole SQL-optimization workstream depends on this data being available. - Slow-query log is effectively off.
log_min_duration_statementis unset on the active parameter group; the postgres log group received zero events in the 3 h window we inspected (most recent stream ~2 days old). A working slow-query log is the complementary signal topg_stat_statements— it captures the actual statement text and parameters as they happen, whilepg_stat_statementsaggregates by normalized text. - Prod runs on a burstable instance class (
db.t3.medium, 2 vCPU / 4 GiB). Today this is over-provisioned for actual load (CPU credit balance flat at 533, no burst-drain), but it is not the right class for an OLTP workload that’s about to fan out from 2 operations pods to up to 8 under HPA. One bad N+1 or a sustained business-hour spike would exhaust burst credits and throttle. Coupled with the connection-count growth from HPA + the future accounts component,max_connectionsalso needs a deliberate value rather than the instance-class-derived default.
The three items share one outage window — shared_preload_libraries
is a static parameter that requires an instance restart, and the
instance-class change requires a restart too — so they bundle into a
single coordinated change.
What this ticket delivers
Section titled “What this ticket delivers”1. Custom DB cluster parameter group — all four Aurora clusters
Section titled “1. Custom DB cluster parameter group — all four Aurora clusters”A new CDK-managed DB cluster parameter group, attached to all four Aurora clusters (Alpha001 prod, Alpha001 demo, Alpha002 stage, Alpha002 dev). Same parameter set across envs; per-env tuning of the slow-query threshold is the only divergence considered, and it’s deferred until we have real signal.
| Parameter | Value | Apply type | Rationale |
|---|---|---|---|
shared_preload_libraries | pg_stat_statements | static (instance restart required) | Allocates the shared-memory hash at server start. Required for pg_stat_statements queries to return data. |
pg_stat_statements.track | all | dynamic | Track top-level and nested statements (e.g., inside functions). The granularity we want for diagnosing application queries. |
pg_stat_statements.max | (default 5000, leave as-is unless we hit eviction) | dynamic | Tunable later if Performance Insights shows aging-out of unique-query slots. |
log_min_duration_statement | 500 (ms) | dynamic | Log every statement that takes ≥500 ms. Tune downward if log volume is too low to be useful, upward if it’s noisy. |
log_statement | ddl | dynamic | Capture schema changes (CREATE, ALTER, DROP). Cheap; useful for change auditing. |
log_lock_waits | on | dynamic | Log when a session waits longer than deadlock_timeout on a lock. Lights up contention issues that don’t surface in query duration alone. |
log_temp_files | 0 | dynamic | Log every temp file created by a query. Surfaces queries that spill to disk — a common cause of slow OLTP queries that look fast in plan output but are slow in practice. |
The shared_preload_libraries change is the only restart-requiring
parameter; the rest take effect on next session.
2. Prod instance class + max_connections — prod only
Section titled “2. Prod instance class + max_connections — prod only”Resize prod only; dev / stage / demo keep their current
instance classes. The parameter group from §1 still attaches to all
four — cluster-level parameters cost nothing extra and keep
pg_stat_statements available everywhere for cross-env consistency.
| Aurora cluster | Instance class today | Instance class after | max_connections |
|---|---|---|---|
| Alpha001 prod | db.t3.medium | db.r6g.large (or db.r7g.large if regionally available — decision recorded in this ticket’s decision-log entry) | 500 |
| Alpha001 demo | db.t3.medium | unchanged | default (instance-class-derived) |
| Alpha002 stage | db.t3.medium | unchanged | default |
| Alpha002 dev | db.t3.medium | unchanged | default |
max_connections = 500 on prod gives ~3× headroom over the projected
HPA-max steady state (8 operations pods × 20-pool + future accounts +
ad-hoc ≈ 340 conns). Lives in the parameter group, not as a
per-instance override.
Per-cluster sizing parameterisation. Instance class and
max_connections are cluster-specific decisions. Define them in each
cluster’s per-instance CDK file (e.g.
src/main/cdk/instances/Alpha001/prod.ts) and pass them as
parameters into the shared CDK construct that provisions the Aurora
cluster. Don’t hardcode either value inside the shared construct —
keep all per-env divergence at the instance-file boundary, consistent
with the rest of the CDK architecture.
3. Coordination with PDEV-498
Section titled “3. Coordination with PDEV-498”This ticket installs pg_stat_statements at the cluster level
(§1’s parameter group). It does not create the extension inside
any database — that’s the per-DB step, and it lives in
PDEV-498
in the postgres-database-initializer repo.
Deploy ordering:
- PDEV-479 (this ticket) lands the parameter group on each cluster and applies the restart. The shared-memory hash is now allocated; queries are tracked cluster-wide.
- PDEV-498’s new image release lands on each cluster’s operations
chart. The init container runs
CREATE EXTENSION IF NOT EXISTS pg_stat_statementsin the application database during pod startup. The view becomes queryable from the app DB.
If the order reverses (PDEV-498’s image deploys before PDEV-479’s
parameter group lands on a given cluster), the init container fails
loud (pg_stat_statements must be loaded via shared_preload_libraries)
and the operations pod enters Init:CrashLoopBackOff. Fix is to
apply the parameter group and let K8s retry. Fully recoverable. PDEV-498
documents this fail-loud contract.
Out of scope for this ticket
Section titled “Out of scope for this ticket”CREATE EXTENSION pg_stat_statements— PDEV-498’s responsibility.- RDS Proxy decision — deferred to PDEV-499 (evaluation only; gated on PDEV-488 in stage producing real HikariCP metrics).
- Aurora reader-instance routing for the bitemporal read workload — possible future optimization; not part of this ticket.
pg_stat_statements.maxtuning — leave at the default 5000 for now; revisit if Performance Insights shows query-slot eviction.
Acceptance criteria
Section titled “Acceptance criteria”- A new DB cluster parameter group exists in CDK, attached to all four Aurora clusters (prod, demo, stage, dev), with the parameter set above.
- The prod Aurora cluster is on
db.r6g.large(ordb.r7g.largeif chosen) with the parameter group applied and an instance restart completed. - Per-cluster sizing values (instance class,
max_connections) are defined in the per-instance CDK files (src/main/cdk/instances/{Infra}/{partition}.ts) and passed as parameters to the shared RDS construct. No hardcoded per-env values in the construct itself. SHOW shared_preload_libraries;on each cluster’s writer returns a string containingpg_stat_statements.SHOW max_connections;on the prod writer returns500. The other three clusters return their instance-class default./aws/rds/cluster/{cluster-name}/postgresqllog group on each cluster receives at least one event per hour under normal load, within an hour of the change.- After PDEV-498’s image deploys to a cluster (separate ticket),
SELECT count(*) FROM pg_stat_statementsfrom any application DB on that cluster returns a non-zero row count within minutes (under typical query load). - The instance-class decision (
db.r6g.largevsdb.r7g.large) is recorded as a decision-log entry under this ticket — pick whichever is regionally available and note the choice.
Outage window
Section titled “Outage window”This is the only planned-outage window in the entire
slow-responses project’s infrastructure scope. The prod Aurora
instance restart for shared_preload_libraries + the
instance-class change happen together. Notify product before the
window. Non-prod restarts (parameter group attach on dev / stage /
demo) are routine and don’t require coordination.
Estimated downtime: a few seconds of writer unavailability per restart on Aurora’s “fast failover” path; failover is to the reader, then back. If we have HA active in prod (recommended; verify against current state), the user-visible window is single-digit seconds.
Deploy and activation sequence
Section titled “Deploy and activation sequence”The CDK deploy alone does not activate every parameter. Aurora
splits parameters into two apply types, and the static ones queue as
pending-reboot until an explicit operator action. Sequence per
cluster:
Step 1 — CDK deploy (amm.sh or equivalent)
Section titled “Step 1 — CDK deploy (amm.sh or equivalent)”- New DB cluster parameter group is created and attached to the cluster.
- Instance-class change (prod only) is applied to the writer/reader instance resources.
Effects after step 1 completes:
| Parameter | Apply type | Active after step 1? |
|---|---|---|
log_min_duration_statement = 500 | dynamic | Yes, within ~1 min |
log_statement = ddl | dynamic | Yes, within ~1 min |
log_lock_waits = on | dynamic | Yes, within ~1 min |
log_temp_files = 0 | dynamic | Yes, within ~1 min |
pg_stat_statements.track = all | dynamic | Yes, within ~1 min |
shared_preload_libraries = pg_stat_statements | static | No — pending-reboot |
max_connections = 500 (prod only) | static | No — pending-reboot |
| Instance class change (prod only) | replacement | No — pending-reboot/replacement |
Aurora’s aws rds describe-db-instances returns PendingModifiedValues
populated and the parameter-application-status flag flipped for each
affected instance. The cluster keeps running on the old static
values until the operator acts.
Verify the dynamic-parameter activation before moving on (see
§ Verification commands below — the log_min_duration_statement and
log_statement checks). The CloudWatch slow-query log group should
also start receiving events within the hour.
Step 2 — manual activation of the static parameters + instance-class change
Section titled “Step 2 — manual activation of the static parameters + instance-class change”For prod, this is the one planned-outage window of the project (see § Outage window above). Coordinate before triggering:
# Single command activates both the static parameters and the# instance-class change in one writer failover.aws rds reboot-db-instance \ --db-instance-identifier <writer-instance-id> \ --profile Admin-Alpha1
# After failover settles, reboot the reader so it matches.aws rds reboot-db-instance \ --db-instance-identifier <reader-instance-id> \ --profile Admin-Alpha1(Aurora may also trigger the activation automatically at the next
preferred maintenance window — verify the cluster’s
PreferredMaintenanceWindow and decide whether to wait or push.)
For dev / stage / demo, step 2 is the parameter-group reboot only (no instance-class change). Routine — can be done back-to-back with step 1.
Step 3 — confirm
Section titled “Step 3 — confirm”After the reboot completes, run the verification commands in the next section against each cluster. Specifically:
SHOW shared_preload_libraries;returns a string containingpg_stat_statements.SHOW max_connections;returns500on prod, default elsewhere.- The instance class on the prod writer matches the new choice
(
db.r6g.largeor whichever was selected) peraws rds describe-db-instances.
Why the explicit-reboot model is correct here
Section titled “Why the explicit-reboot model is correct here”Aurora does support an “apply immediately” mode (--apply-immediately
on modify operations) that triggers the reboot at deploy time rather
than queueing pending-reboot. Deliberately not using that here:
- It bundles the deploy step and the maintenance window into one transaction. We want them separated — the deploy lands during normal work hours; the failover happens in a scheduled window with the team available.
- Failing the reboot half of an apply-immediately deploy leaves the cluster in an awkward state. Separating the two operations keeps failure modes localized.
The trade-off is one extra command per cluster (the explicit reboot). Worth it.
Verification commands
Section titled “Verification commands”After the change lands on a cluster, run from a bastion pod connected to the writer (substitute the cluster name in the log-group path):
-- Prerequisite: library loadedSHOW shared_preload_libraries;-- Expect a string containing 'pg_stat_statements'
-- Prod onlySHOW max_connections;-- Expect '500' on prod; instance-class default elsewhere
-- Slow-query logging is activeSHOW log_min_duration_statement;-- Expect '500ms'After PDEV-498’s image has rolled through, from a connected app DB:
-- Extension is queryable in the app DBSELECT count(*) FROM pg_stat_statements;-- Expect a non-zero row count under typical load# CloudWatch log group receives eventsaws --profile <profile> logs filter-log-events \ --log-group-name /aws/rds/cluster/<cluster>/postgresql \ --start-time $(date -u -v-1H +%s)000 \ --query 'length(events)' --output text# Expect ≥ 1References
Section titled “References”_docs/analysis/db-init.md— PDEV-498’s implementation proposal for the matchingCREATE EXTENSIONwork in thepostgres-database-initializerimage._docs/analysis/infrastructure-improvements.md§ 2 (parameter group), § 3 (sizing +max_connections) — umbrella narrative._docs/analysis/be-findings.md— original Performance Insights capture that motivated the slow-query-logging finding._docs/analysis/pod_capacity.md§ Recommendations — operations pod count and HikariCP pool-size assumptions feeding into themax_connections = 500choice.
Copyright: © Arda Systems 2025-2026, All rights reserved