Skip to content

Goal: PDEV-490 Operations Performance Improvements

Lifecycle: Completed

PDEV-490 is the operations-side strand of PDEV-442 (“product is slow”). It eliminates the bitemporal-read bottlenecks on the two endpoints that the BFF fans out per items-page row — cardsForItem and listWithDetails — and rebuilds the operations component’s persistence layer on the AWS Advanced JDBC Wrapper so that read-only transactions auto-route to Aurora reader instances and transient failures degrade gracefully. The work covers common-module (a single coordinated, strictly-additive release with scheme-detected wrapper opt-in), operations (one combined PR covering Flyway migrations, K12 cleanup, common-module pin bump, and application.conf opt-in to the wrapper), and documentation (architecture pages + runbooks; can proceed in parallel with the operations PR). A controlled synthetic-failover test on dev gates promotion beyond dev.

Adjacent project context (not in this scope, listed for situational awareness):

  • Parent: PDEV-442 — Product is slow.
  • Shipped prerequisites: PDEV-479 (Aurora parameter group + sizing), PDEV-488 (operations chart 2.24.0), PDEV-498 (pg_stat_statements), PDEV-500 (Sentry DSN secret).
  • Active sibling: PDEV-489 — Front-End Performance Improvements. Consumes the K02 composite index from PDEV-490 as its hard back-end prerequisite; otherwise out of scope here.
  • Closed: PDEV-499 (RDS Proxy evaluation) — won’t-do, superseded by the JDBC wrapper.
RepositoryRolePlanned Changes
common-moduleRequiredSingle coordinated release (Wave 2 — ships first): scheme-detected AWS Advanced JDBC Wrapper integration (software.amazon.jdbc:aws-advanced-jdbc-wrapper:4.0.1 dependency; wrapper-plugin pipeline auroraInitialConnection, failover2, efm2, readWriteSplitting configured only when consumer-supplied jdbcUrl starts with jdbc:aws-wrapper:); new AppError.Transient branch with HTTP 503 + Retry-After rendering; retry policy at the inTransaction boundary with new PoolConfig.maxAttempts (default 2) and PoolConfig.backoffMs (default 300) fields; removal of the decorative tenantId.index("TENANT_ID_INDEX") declaration in AbstractScopedUniverse; Flyway mixed=true enablement on DbMigration; two integration tests on the existing ContainerizedPostgres harness. Strictly additive — existing consumers on jdbc:postgresql: URLs see zero behavior change.
operationsRequiredOne combined PR (Waves 1 + 3 — ships after Wave 2): the original Wave 1 scope (Flyway migrations on kanban_card adding composite bitemporal indexes plus the missing idx_kanban_card_tenant_id index, the coupled two-line Kotlin K12 change in kanban/service/ServiceImpl.kt to drop the wasted COUNT in cardsForItem, and a Flyway migration on item adding composite bitemporal indexes) AND the original Wave 3 scope (common-module version pin bump in gradle/libs.versions.toml; application.conf dataSource.jdbcUrl switched to the jdbc:aws-wrapper:postgresql://… scheme; new dataSource.pool.maxAttempts = 2 and dataSource.pool.backoffMs = 300 knobs added). One combined CHANGELOG entry. Branch: product-slow-pdev-490.
documentationRequiredOne PR — new architecture pages on the AWS Advanced JDBC Wrapper, the bitemporal-indexing pattern, the Flyway-authoritative-for-indexes convention, and an updated HTTP 503 + Retry-After error-contract page; three operational runbooks for the wrapper deploy, the wrapper troubleshooting, and the synthetic-failover test procedure. Can proceed in parallel with Wave 3 (operations consumer PR) — both depend on the spec’s stable shape, neither depends on the other’s merge. Must land before the Wave 5 dev failover test. Branch: product-slow-pdev-490.
infrastructureNot modifiedAurora-cluster-level changes are out of scope (handled by PDEV-479, shipped).
arda-frontend-appNot modifiedItems-page front-end consolidation is tracked on PDEV-489. PDEV-490 ships the K02 composite index as the back-end prerequisite; otherwise no contribution from this project.

Verifiable conditions defining “done”. Detailed verification protocols live in the sibling specification/verification.md.

  1. Items-page operations-side latency falls measurably against the post-PDEV-488 baseline in Sentry (platform-be project, Alpha001-prod). Verifiable from before/after pg_stat_statements ranking by total time and from Sentry transaction-latency percentiles.
  2. Transient Aurora failures surface as HTTP 503 + Retry-After instead of HTTP 500. The dev synthetic-failover test confirms the 5xx-window shape: HTTP 503 dominates, HTTP 500 is essentially zero, and most failover-window failures are absorbed by the in-process retry. Promotion to demo / stage / prod proceeds with each environment’s standard soak window only after dev passes.
  3. Failover detection latency drops from ~30 s (JVM DNS cache–bound) to ~2–5 s (Aurora topology API via the wrapper). Verifiable from the dev synthetic-failover test.
  4. Bitemporal SELECTs on kanban_card and item plan against the new composite indexes — EXPLAIN ANALYZE shows an index scan or index-only scan on the inner subquery, not a sequential scan or generic single-column lookup. Verifiable from EXPLAIN and from pg_stat_statements rankings before / after.
  5. Read-only transactions route to Aurora reader instances. Verifiable from pod-side wrapper logging during a representative items-page workload: read-only transactions land on the reader endpoint; writes land on the writer endpoint.

PDEV-490 enters with most of its upstream prerequisites already shipped:

  • PDEV-479 (Aurora parameter group + production sizing) is live. Prod is on db.r7g.large (2 vCPU / 16 GiB) with max_connections = 500; slow-query, lock-wait, and temp-file logging is active in every partition.
  • PDEV-498 (pg_stat_statements) is installed in every application database in every partition. Index decisions are data-driven against live cumulative query stats, not theoretical.
  • PDEV-488 (operations chart 2.24.0) is live. JVM observability is on (GC logs, container-aware memory, JFR, Sentry OTel Java agent); HPA is enabled. Sentry performance data flows into platform-be.
  • PDEV-500 ({Infra}-SentryDsn Secrets Manager secret) is live and underpins the operations Sentry wiring.

The two endpoints in scope, both surfaced as N+1 fan-outs by the items-page workload:

RouteService methodIssue today
GET /v1/kanban/kanban-card/for-item/{itemEId}KanbanCardService.cardsForItemTwo SQL statements per call — an unused COUNT plus the SELECT; wide-row hydration on kanban_card.
POST /v1/kanban/kanban-card/detailsKanbanCardService.listWithDetailsChunked fan-out to item per kanban-card chunk; full KanbanCardDetails hydration per row.

Both queries follow the bitemporal latest-version-per-eId pattern (common-module’s Persistence.kt), where the inner correlated subquery is the dominant cost and plan quality depends entirely on index coverage. Today the only indexes on kanban_card are three single-column indexes (eid, effective_as_of, recorded_as_of); the same gap exists on item. The composite indexes PDEV-490 adds give Postgres a single-index path through the correlated subquery on both tables.

The design phase ran in the project workbook (alternatives, decisions, audits, empirical checks). The most consequential conclusions:

  • Adopt the AWS Advanced JDBC Wrapper rather than implement a dual-HikariCP-pool design for read/write splitting. The wrapper intercepts Connection.setReadOnly(boolean) so business code stays at transaction(readOnly = true) with no awareness of pool selection. Eliminates the data-store-abstraction leak that the dual-pool design would have introduced, subsumes the JVM DNS TTL change (wrapper bypasses DNS for failover detection), and re-shapes the transient-SQL classifier as an adapter over the wrapper’s three typed exception classes (FailoverSuccessSQLException, TransactionStateUnknownSQLException, FailoverFailedSQLException).
  • Composite bitemporal indexes ship as plain (full-table) indexes, with no WHERE retired = FALSE partial-index predicate. The shape is usable by any query against (tenant_id, item_reference_entity_id, eid, effective_as_of, recorded_as_of), including history reads and audit / replay tooling. Storage cost is small.
  • Flyway is the authoritative source for database indexes. The decorative tenantId.index("TENANT_ID_INDEX") declaration in AbstractScopedUniverse.kt is removed; Exposed-level .index(...) calls do not carry runtime guarantees because Exposed’s schema-emit path is not invoked in any deploy environment. The tenant_id audit found one consumer (kanban_card) missing the index; that’s consolidated into the same Wave 1 Flyway migration. Other missing-index modules (facility, station, procurement/orders) are deferred to a future per-module hygiene pass.
  • The items-page N+1 fan-out is resolved on the front end, on PDEV-489, by consolidating onto the already-existing /v1/kanban/kanban-card/query route with Filter.In(item_reference_entity_id, [eIds…]). PDEV-490 contributes the K02 composite index as the back-end prerequisite for the new SQL plan; no new aggregate route is added.
  • cardsForItem’s wasted COUNT is a coupled two-line change. The flag flip (withTotal = true → false) and the deletion of the flatMap { … when (pg.totalCount) … } block must land in the same change; splitting them inverts a dead-code branch into a 100%-failure regression.
  • Operations-side SQLException-handler audit returned zero hits — no handler intercepts the new 503 contract; the canonical StatusPages handler in common-module is the sole renderer.

The full reasoning trail, including alternatives weighed and rejected, is in the project workbook (see Reference Documents).

The workbook tracks individual change candidates as Knn identifiers. The set referenced from this goal and from the spec artefacts:

IDCandidatePDEV-490 disposition
K01New summary/for-items aggregate route on the kanban service.Cancelled. The PDEV-489 front-end consolidation onto the existing /v1/kanban/kanban-card/query route makes the new aggregate redundant.
K02Composite bitemporal index on kanban_card, including (eid, effective_as_of DESC, recorded_as_of DESC) and (tenant_id, item_reference_entity_id, eid, effective_as_of DESC, recorded_as_of DESC).In scope — operations combined PR (Waves 1+3). Hard back-end prerequisite for PDEV-489.
K03Composite bitemporal index on item.In scope — operations combined PR (Waves 1+3).
K04Scheme-detected AWS Advanced JDBC Wrapper integration in common-module (software.amazon.jdbc:aws-advanced-jdbc-wrapper:4.0.1).In scope — Wave 2 common-module release. Replaces dual-pool / DNS-TTL alternatives. Additive: opt-in via consumer-supplied jdbcUrl scheme.
K05Read/write splitting plugin configuration on the wrapper (auto-route read-only transactions to Aurora reader instances).In scope — Wave 2 common-module release, same coordinated release as K04.
K12cardsForItem cleanup — drop the wasted COUNT via the coupled withTotal = true → false flag flip + flatMap/when block removal.In scope — operations combined PR (Waves 1+3). Two-line coupled commit; constraint #2 covers the splitting risk.
K15Decorative tenantId.index("TENANT_ID_INDEX") declaration removal from AbstractScopedUniverse.In scope — Wave 2 common-module release, same coordinated release as K04 and K05.
K16Flyway mixed=true enablement on common-module’s DbMigration.In scope — Wave 2 common-module release. Prerequisite for any non-transactional migration (K02 / K03 use CREATE INDEX CONCURRENTLY) to coexist with the existing transactional migration tree on fresh test-container DBs. Discovered 2026-05-19 during W1.3 first-of-kind validation.
K17DataSource.close() + ContainerizedPostgres.stop() pool teardown — close every HikariDataSource produced by newSqlDataSource() when the owning ContainerizedPostgres instance is stopped.In scope — Wave 2 common-module release. Fixes a latent test-infrastructure leak (HikariCP housekeeper / connection-adder daemon threads survive across test classes, retrying dead Testcontainer ports) that hangs Gradle’s default parallel test execution on multi-core developer machines. Discovered 2026-05-19 during W1.3 verification.

The K-ID identifiers above are reference shorthand only; the underlying rationale, alternatives, audit results, and decision trace live in the project workbook (see Reference Documents). The formal project artefacts (this goal, analysis, requirements, specification, verification) are self-contained and need no workbook access for execution or verification.

  • common-module Aurora connectivity overhaul on the AWS Advanced JDBC Wrapper, including wrapper plugin configuration, HikariCP.exceptionOverrideClassName integration, and the jdbcUrl scheme migration.
  • common-module AppError.Transient error branch, HTTP 503 + Retry-After rendering, and the retry policy at the inTransaction boundary.
  • common-module removal of the decorative tenantId.index("TENANT_ID_INDEX") declaration on AbstractScopedUniverse.
  • common-module integration tests for (a) forced transient SQL failure → HTTP 503 + Retry-After and (b) one-shot transient failure absorbed by the retry.
  • operations Flyway migrations: composite bitemporal indexes on kanban_card and item; the missing tenant_id index on kanban_card consolidated into the same kanban migration.
  • operations cardsForItem Kotlin cleanup — drop the wasted COUNT (coupled two-line change in ServiceImpl.kt).
  • operations operations-component consumer wiring — common-module pin bump and application.conf updates.
  • documentation site pages — architecture (wrapper), bitemporal-indexing pattern, Flyway-authoritative-for-indexes convention, HTTP 503 contract update.
  • documentation operational runbooks — synthetic-failover test procedure, wrapper deploy notes, wrapper troubleshooting.
  • Synthetic-failover test execution on dev as the rollout gate.
  • Front-end items-page consolidation — tracked on PDEV-489. PDEV-490 contributes K02 (composite index) as the hard back-end prerequisite for that work; no further front-end coordination.
  • DB-side configuration (sizing, parameter group, instance class) — handled by PDEV-479, shipped.
  • pg_stat_statements provisioning — handled by PDEV-498, shipped.
  • JVM DNS TTL helm change — eliminated; the JDBC wrapper bypasses DNS for failover detection.
  • HPA maxReplicas reduction — dropped under the wrapper design; reader-endpoint routing absorbs the writer-pool ceiling pressure that the reduction was a fallback for.
  • Connection-pool size tuning on item and kanban DBs — re-scoped to “not needed”. Sentry shows zero connection-timeout pressure in the last four days across all environments; only prod runs at maxReplicas = 8 (stage / demo / dev at 4); the wrapper’s read/write split removes the binding constraint. Revisit if a future workload changes the picture.
  • transactionIsolation evaluation (REPEATABLE_READ → READ_COMMITTED on read-only paths) — filed as Linear PDEV-534 to run after PDEV-490 ships.
  • RDS Proxy evaluation — closed as won’t-do (PDEV-499); the JDBC wrapper is incompatible with RDS Proxy by design.
  • Service-level read cache — deferred; revisit only after the new indexes have soaked and pg_stat_statements still shows headroom.
  • cardsForItem bulk-handler cleanup — three items-page bulk handlers (handleDeleteMultipleItems, handlePrintSelectedCards, handlePreviewSelectedCards) still loop per selected item against cardsForItem. User-initiated; latency tolerable. Could be collapsed onto Filter.In against /query in a future ticket.
  • Long-term DB query observability tooling — handled separately by PDEV-512.
  • Missing-tenant_id-index migrations on FACILITY_TABLE, STATION_TABLE, ORDER_HEADER_TABLE — the audit surfaced these; they are deliberately deferred to a future per-module hygiene pass.
  1. Wave 2 ships before the operations PR. Sequencing is Wave 2 (common-module release) → Waves 1 + 3 combined (operations all-in-one) → Wave 4 (documentation) → Wave 5 (dev failover test). The operations PR collapses the original Wave 1 (migrations + K12) and Wave 3 (consumer wiring) into a single PR because the common-module pin bump (required by the migration’s Flyway mixed=true need) couples logically with the application.conf consumer wiring. The operations PR can be drafted, branched, pushed, and opened for review at any time using Gradle includeBuild to test against the local common-module worktree; the blocking gate is at merge time — must not merge until the common-module release artifact has been published to the package registry that operations/gradle/libs.versions.toml resolves against.
  2. cardsForItem cleanup must be coupled. The withTotal = true → false flag flip and the deletion of the surrounding flatMap { … when (pg.totalCount) … } block must land in the same change. Splitting them inverts a dead-code branch into a 100%-failure regression.
  3. Documentation runbooks land before the synthetic-failover test. The test procedure follows the runbook; the Wave 4 documentation PR must merge before the Wave 5 dev failover test executes. The runbook is drafted against the spec’s shape and may be refined post-Wave-3 deploy if any deployed detail diverges from the spec; Wave 3 and Wave 4 are independent and can proceed in parallel.
  4. No write-path isolation changes. The wrapper changes connection routing; the transactionIsolation level remains REPEATABLE_READ for both writes and reads in PDEV-490. Evaluating READ_COMMITTED for read-only paths is the scope of PDEV-534.
  5. Single coordinated common-module release. The scheme-detected wrapper adoption (K04), the read/write splitting plugin configuration (K05), the decorative-declaration cleanup (K15), the Flyway mixed=true enablement (K16 — required so the next operations migration, a CREATE INDEX CONCURRENTLY, can coexist with the existing transactional migration tree on fresh test-container DBs), and the DataSource.close() + ContainerizedPostgres.stop() pool teardown fix (K17 — required so the operations test JVM doesn’t hang under Gradle parallel test execution on multi-core machines) ship in one release tag, not separately.
  6. Promotion beyond dev requires the synthetic-failover test to pass. Acceptance: HTTP 5xx window collapses from ~30 s (DNS-cache-bound today) to ~3–5 s; HTTP 503 dominates the 5xx window; HTTP 500 is essentially zero.
  7. The common-module release is strictly additive. Wrapper integration is scheme-detected: common-module’s DataSource constructs the HikariCP wrapper-plugin pipeline only when the consumer-supplied jdbcUrl starts with jdbc:aws-wrapper:. Consumers on jdbc:postgresql://… URLs see zero behavior change — the wrapper code path stays inert. Consumers opt into wrapper routing by changing only their application.conf URL on their own deploy schedule. Called out in the common-module CHANGELOG as Added, not Changed.
  8. No partial-index predicates on the new bitemporal indexes. Plain (full-table) indexes per the workbook decision; no WHERE retired = FALSE clause.
#DeliverableLocation
1Wave 2 — common-module release (ships first): scheme-detected AWS Advanced JDBC Wrapper integration, AppError.Transient + retry, decorative-declaration cleanup, Flyway mixed=true, integration tests. Strictly additive.common-module repo: single PR + release tag
2Waves 1 + 3 — operations combined PR (ships after Wave 2 publishes): Flyway migrations on kanban_card (composite bitemporal indexes + tenant_id index) and item (composite bitemporal indexes); cardsForItem K12 cleanup; common-module pin bump; application.conf jdbcUrl flip to jdbc:aws-wrapper:postgresql://… + retry knobs. One combined CHANGELOG. Branch: product-slow-pdev-490.operations repo
3Wave 4 — documentation PR: architecture pages, convention pages, error-contract update, and three operational runbooks.documentation repo
4Wave 5 — synthetic-failover test execution on dev: operational gate, no PR.dev environment
5Project-side specification, requirements, plan, and verification documents.roadmap/completed/product-slow-responses/pdev-490/{specification,plan}/*.md

The detailed design phase ran in the project workbook (workbooks/ repo, notebooks/product-slow-responses/pdev-490/). The workbook is the source-of-process; the documents below are the source-of-product, derived from it. Workbook artefacts are referenced by path rather than clickable link because they live outside the documentation site.

  • workbooks/notebooks/product-slow-responses/pdev-490/conclusions.md — the implementation spec derived from the design phase. Per-repo / per-module change list, five-wave sequencing, per-PR CHANGELOG entries, documentation updates, and notes for other projects. This goal restates the conclusions at goal-doc altitude; the workbook conclusions remain available for any detail the formal project artefacts elide.
  • workbooks/notebooks/product-slow-responses/pdev-490/index.md — reader’s guide for the workbook directory.
  • workbooks/notebooks/product-slow-responses/decisions/ — formal decision records (T22 wrapper adoption, T17 no partial-predicate, T23 Flyway authoritative, K12-commit, T18-resolved-by-K12, C23-C24 test location, items-page-collapse).
  • workbooks/notebooks/product-slow-responses/threads/T20-items-page-summary-aggregation/frontend-analysis.md — basis for the PDEV-489 hand-off and the rationale for the K01 cancellation.
  • ../goal.md — the umbrella product-slow-responses project goal (PDEV-442).
  • ../analysis/ — code-dig and findings documents that drove PDEV-442 and its sub-issues (browse the directory in the repo for the per-area files: be-findings.md, fe-findings.md, code-dig.md, operations-bottlenecks.md, etc.).
  • ../../../../process/operation-notes/20260514-aurora-parameter-group-and-operations-bump-rollout.md — empirical data from the PDEV-479 + PDEV-488 rollout; the entry-state baseline against which PDEV-490’s success criteria measure improvement.
  • PDEV-490 — this project.
  • PDEV-509 — graceful failover (absorbed into PDEV-490).
  • PDEV-534transactionIsolation evaluation (follow-up, runs after PDEV-490 ships).
  • PDEV-489 — items-page front-end consolidation (sibling, out of scope here).
  • PDEV-442 — parent.

Copyright: (c) Arda Systems 2025-2026, All rights reserved