Analysis: PDEV-490 Operations Performance Improvements

Author: Claude Opus for jmpicnic | Date: 2026-05-19 | Status: Draft

Analysis: PDEV-490 Operations Performance Improvements

Entry-state analysis of the operations component and common-module persistence layer, scoped to the two endpoints that PDEV-490 targets and the persistence-layer surfaces those endpoints depend on. Establishes the empirical baseline against which requirements define improvement targets, and surfaces the gaps that the specification closes.

Summary

The operations component issues bitemporal SELECTs (latest-version-per-eId) against kanban_card and item. Two route handlers fan out from the items-page workload: cardsForItem (single-item) and listWithDetails (page-scoped with per-chunk item-side fan-out). At the measured 2026-05-19 baseline, cardsForItem runs at p50 1,113 ms / p95 2,911 ms on Alpha001-prod; listWithDetails runs at p50 289 ms / p95 2,035 ms. The dominant costs are:

Inner bitemporal subqueries on kanban_card and item planning against only three single-column indexes each — no composite covering the (tenant_id, item_reference_entity_id, eid, effective_as_of DESC, recorded_as_of DESC) access pattern.
A wasted COUNT issued by cardsForItem because the kanban service requests withTotal = true and uses the result only as a non-null sanity check.
A naive JDBC stack — HikariCP wired directly to a single Aurora cluster endpoint, no reader routing, no failover-aware retry. Aurora failovers translate to ~30 s of HTTP 500s while the JVM DNS cache holds the dead endpoint.

The project replaces the JDBC stack with the AWS Advanced JDBC Wrapper (read/write splitting + topology-driven failover + retry-on-typed-exception) in common-module; adds composite bitemporal indexes on kanban_card and item in operations; drops the wasted COUNT in cardsForItem; and surfaces transient failures as HTTP 503 with Retry-After. The bitemporal SELECTs are auto-routed to Aurora reader instances afterward.

Scope

This analysis covers:

The two target routes (cardsForItem, listWithDetails) — handlers, service methods, and the SQL they emit.
The common-module persistence layer that backs them — Persistence.kt, AbstractUniverse.kt, AbstractScopedUniverse.kt, DataSource.kt, the inTransaction boundary, and the StatusPages-installed HTTP error contract.
The Flyway migration trees that govern index coverage on kanban_card and item, plus a tenant-id index audit across every ScopedTable consumer in operations.
The measured performance baseline for both routes against Alpha001-prod, Alpha002-stage, and Alpha002-dev via Sentry over the trailing five-day window.

It does not cover:

Front-end consumers — the items-page front-end consolidation is tracked separately on PDEV-489. PDEV-490 ships the composite index that the front-end work depends on, but the front-end change itself is out of scope here.
Aurora cluster configuration (instance class, parameter group, max-connections) — handled by PDEV-479 and already shipped.
pg_stat_statements provisioning — handled by PDEV-498 and already shipped.
Long-term DB query observability tooling — tracked separately by PDEV-512.

Current state

Target endpoints

The two routes in scope are declared in operations/src/main/kotlin/cards/arda/operations/resources/kanban/api/rest/KanbanCardEndpoint.kt and implemented in operations/src/main/kotlin/cards/arda/operations/resources/kanban/service/ServiceImpl.kt:

Route	Service method	Workload shape
`GET /v1/kanban/kanban-card/for-item/{itemEId}`	`KanbanCardService.cardsForItem(itemRef, asOf)` (`ServiceImpl.kt:276-293`)	One bitemporal SELECT on `kanban_card` with `Filter.Eq(item.eId)`, plus an unused COUNT. Returns up to 1,000 cards.
`POST /v1/kanban/kanban-card/details`	`KanbanCardService.listWithDetails(query, asOf)` (`ServiceImpl.kt:322-350`)	One bitemporal SELECT on `kanban_card` followed by a chunked per-chunk SELECT on `item` (25-card chunks, `flatMapMerge(concurrency = 25)`). Hydrates `KanbanCardDetails` with full `Item.Entity` per card.

`cardsForItem` — the wasted COUNT

cardsForItem issues universe.list(query, asOf, withTotal = true). Tracing through common-module/lib/src/main/kotlin/cards/arda/common/lib/persistence/universe/AbstractUniverse.kt:152-180:

With withTotal = true, the underlying persistence layer issues a COUNT(*) against the same predicate in addition to the row-returning SELECT.
The Kotlin caller (ServiceImpl.kt:287-290) uses the resulting totalCount only as a non-null sanity check (when (pg.totalCount) { null -> Result.failure(AppError.IncompatibleState(...)); else -> Result.success(pg) }). The value never propagates to the HTTP response.
The null arm of that when is dead code under today’s withTotal = true — AbstractUniverse.list always materialises a non-null Long into pg.totalCount when withTotal is true. Dropping the flag without also removing the when would invert the dead branch into a 100%-failure regression.

Net per cardsForItem invocation today: 2 SQL statements (1 COUNT + 1 SELECT) on the kanban DB.

`listWithDetails` — chunked per-chunk fan-out

listWithDetails runs an outer kanban SELECT followed by a per-chunk inner item SELECT:

listEntities(query, asOf).flatMap { pageRs ->
  pageRs.results.chunked(25).asFlow().flatMapMerge(concurrency = 25) { chunk ->
    flow {
      val targetItems = chunk.map { it.payload.item.eId }.toSet().toList()
      itemService.listEntities(
        Query(Filter.In(ITEM_TABLE.eId.name, targetItems), Pagination(0, chunk.size)),
        asOf
      ).map { it.results.associate { it.payload.eId to it.payload } }
       .onSuccess { itMap -> emitAll(chunk.asFlow().map { composeDetails(asOf, it, itMap[it.payload.item.eId]) }) }
       .onFailure { emit(Result.failure(it)) }
    }
  }
  // …
}

Net per listWithDetails invocation today: 1 SELECT on kanban_card + ⌈N/25⌉ SELECTs on item (where N is the kanban result size). For a 25-row page that’s 2 SQL statements; for 200 rows that’s 9.

Bitemporal SQL pattern

Both routes ultimately emit the same SQL shape via common-module/lib/src/main/kotlin/cards/arda/common/lib/persistence/bitemporal/Persistence.kt. For the kanban_card SELECT:

SELECT bt.*           -- ~30 wide columns
FROM kanban_card bt
WHERE bt.id IN (
  SELECT sq.id
  FROM kanban_card sq
  WHERE <user condition>
    AND <tenant constraint>
    AND sq.effective_as_of <= <asOf.effective>
    AND sq.recorded_as_of  <= <asOf.recorded>
    AND bt.eId = sq.eId          -- correlated to outer row
    AND bt.retired = FALSE
  ORDER BY sq.effective_as_of DESC, sq.recorded_as_of DESC
  LIMIT 1
)
ORDER BY bt.recorded_as_of DESC, bt.effective_as_of DESC, bt.id ASC
OFFSET 0 LIMIT 1000

This is the “latest version of each entity at an asOf coordinate” bitemporal pattern. The correlated subquery (bt.eId = sq.eId) forces Postgres to either re-execute the inner query per outer row or unroll it via a hash/merge plan. Plan quality depends entirely on whether a composite index covers the inner-subquery predicate.

Index coverage

Current indexes on kanban_card, from operations/src/main/resources/resources/kanban/database/migrations/V001__kanban.sql:50-52:

CREATE INDEX idx_kanban_card_eid             ON kanban_card (eid);
CREATE INDEX idx_kanban_card_effective_as_of ON kanban_card (effective_as_of);
CREATE INDEX idx_kanban_card_recorded_as_of  ON kanban_card (recorded_as_of);

Three single-column indexes. Subsequent migrations V002–V006 add columns but no further indexes on kanban_card. The tenant_id column exists but is not indexed in the Flyway tree; the AbstractScopedUniverse.kt:27 declaration tenantId.index("TENANT_ID_INDEX") is decorative (Exposed’s schema-emit path is not invoked in any deploy environment — Flyway is authoritative).

Current indexes on item follow the same pattern (single-column eid, effective_as_of, recorded_as_of), with the tenant_id index here actually present via reference/item/database/migrations/V012__bt_indexes.sql:8.

Tenant-id audit across ScopedTable consumers (audit completed 2026-05-18, full results below for reference):

Module	Table	`tenant_id` index status
`reference/item`	`ITEM_TABLE`	Present (`V012__bt_indexes.sql:8` — `idx_item_tenant`)
`reference/business-affiliate`	`BUSINESS_AFFILIATE_TABLE`	Present (`V001__biz_affiliates.sql:89` — `idx_ba_tenant_id`)
`system/batch`	`BATCH_JOB_TABLE`	Present, but the migration lives in `reference/item/V012__bt_indexes.sql:12` (the misplaced location is left as-is)
`resources/kanban`	`KANBAN_CARD_TABLE`	Missing
`resources/facility`	`FACILITY_TABLE`	Missing — deferred (out of scope)
`resources/station`	`STATION_TABLE`	Missing — deferred (out of scope)
`procurement/orders`	`ORDER_HEADER_TABLE`	Missing — deferred (out of scope)

The audit found exactly one PDEV-490-actionable gap: kanban_card is missing its tenant_id index. The migration that adds the composite bitemporal indexes on kanban_card will also add (tenant_id) as a separate index in the same file. The three modules deferred (facility, station, procurement/orders) are deliberately out of scope for PDEV-490; they are candidates for a future per-module hygiene pass.

Connection pool and JDBC stack

operations/src/main/resources/application.conf:45-58:

dataSource {
  pool {
    minIdle = 1
    maxPoolSize = 10
    maxLifetime = 1800000
    connectionTimeout = 30000
    validationTimeout = 1000
    idleTimeout = 600000
    initializationFailTimeout = 1
    isAutoCommit = true
    keepAliveTime = 600000
    transactionIsolation = "TRANSACTION_REPEATABLE_READ"
  }
}

The JDBC stack today (common-module/lib/src/main/kotlin/cards/arda/common/lib/persistence/DataSource.kt):

HikariCP as the application-level pool, one pool per module DB (six pools in operations: kanban, item, businessaffiliates, facility, station, batch).
jdbcUrl of the form jdbc:postgresql://<aurora-cluster-writer-endpoint>:<port>/<db>.
driverClassName = "org.postgresql.Driver".
No read/write splitting — every transaction lands on the writer endpoint.
No failover-aware behavior — when Aurora promotes a different writer instance, the JVM DNS cache continues to resolve the cluster endpoint to the previously-promoted instance for ~30 s (per the JVM’s default networkaddress.cache.ttl).

Connection.setReadOnly(true) is propagated by Exposed when callers pass readOnly = true to transaction(...). Today this flag is set but unused at the JDBC layer — it’s a no-op against a writer-endpoint connection.

Error rendering

common-module/lib/src/main/kotlin/cards/arda/common/lib/api/rest/types/HttpResponses.kt:233-250 defines the canonical appErrorResponse mapping. AppError.Internal subtypes (Implementation, Infrastructure, InternalService, IncompatibleState, InternalTimeout, ExternalService) all render as HTTP 500 with the exception message in the body. There is no AppError.Transient branch, no HTTP 503 contract, and no Retry-After header.

Operations-side SQLException handler audit (completed 2026-05-18): grep -rnE 'SQLException|ExposedSQLException|PSQLException' src/main/kotlin/ against the operations worktree returned 0 hits. The canonical StatusPages handler in common-module is the sole HTTP renderer for SQL exceptions.

Failover behavior

Today, an Aurora failover triggers the following sequence:

The previously-promoted writer instance becomes unavailable.
HikariCP detects connection failure on the next acquire and starts retrying within the 30 s connectionTimeout window.
The JVM continues to resolve the cluster endpoint to the dead IP for up to 30 s (DNS cache).
Connections continue to fail. HikariCP exhausts its retry budget; transactions surface as org.postgresql.util.PSQLException / org.jetbrains.exposed.exceptions.ExposedSQLException.
The StatusPages handler maps these to HTTP 500 (AppError.Implementation).
The user-visible 5xx window is ~30 s long, all HTTP 500.

There is no graceful-degradation path, no retry-on-transient at the inTransaction boundary, and no Aurora-topology awareness — the JVM does not know that Aurora has promoted a different writer until the DNS cache expires.

Measured baseline

Sentry transaction durations on platform-be, trailing 5 days, all environments:

Route	Env	Count	p50	p95	p99
`GET /v1/kanban/.../kanban-card/for-item/{item-eid}`	`Alpha001-prod`	4,375	1,113 ms	2,911 ms	3,677 ms
`GET /v1/kanban/.../kanban-card/for-item/{item-eid}`	`Alpha002-dev`	1,140	553 ms	1,610 ms	2,173 ms
`GET /v1/kanban/.../kanban-card/for-item/{item-eid}`	`Alpha002-stage`	142	694 ms	1,725 ms	1,854 ms
`POST /v1/kanban/.../kanban-card/details`	`Alpha001-prod`	24,755	289 ms	2,035 ms	3,215 ms
`POST /v1/kanban/.../kanban-card/details`	`Alpha002-dev`	761	1,213 ms	2,081 ms	2,672 ms
`POST /v1/kanban/.../kanban-card/details`	`Alpha002-stage`	70	680 ms	2,110 ms	2,182 ms

For reference, lighter sibling kanban-card routes on Alpha001-prod (no fan-out, no wide-row hydration):

Route	p50	p95
`POST .../kanban-card/details/{status}`	119 ms	195 ms
`POST .../kanban-card/query`	76 ms	145 ms
`GET .../kanban-card/{entity-id}`	6 ms	7 ms

These sibling routes establish what the kanban-card SQL surface looks like when the inner subquery isn’t the dominant cost — single-digit-millisecond simple lookups, ~100–200 ms for filtered listings without per-row hydration.

Connection-timeout signal: zero Sentry events for connectionTimeout, SQLTransientConnectionException, HikariPool, or the broader connection term across errors and logs datasets in the trailing 4 days (sanity check: 1,120,365 spans on Alpha001-prod over the same window confirms instrumentation is live). The writer-side connection pool is not under saturation pressure today.

HPA configuration (operations/src/main/helm/values-*.yaml, working tree at origin/main 2026-05-19):

Environment	`minReplicas`	`maxReplicas`
`values-prod.yaml`	2	8
`values-stage.yaml`	2	4
`values-demo.yaml`	2	4
`values-dev.yaml`	2	4
`values-local.yaml`	1	2
chart default	2	4

Only prod runs at the upper maxReplicas = 8.

Target state

PDEV-490 changes the persistence layer and the kanban-side SQL surface in coordinated steps:

JDBC stack — AWS Advanced JDBC Wrapper

common-module/lib/src/main/kotlin/cards/arda/common/lib/persistence/DataSource.kt wires HikariCP through the AWS Advanced JDBC Wrapper (software.amazon.jdbc:aws-advanced-jdbc-wrapper:4.0.1):

jdbcUrl template changes from jdbc:postgresql://… to jdbc:aws-wrapper:postgresql://….
driverClassName = "software.amazon.jdbc.Driver".
Plugin pipeline: auroraInitialConnection, failover2, efm2, readWriteSplitting.
HikariConfig.exceptionOverrideClassName = "software.amazon.jdbc.util.HikariCPSQLException" so HikariCP cooperates with wrapper-emitted failover exceptions instead of evicting healthy connections.
Aurora-tuning properties: failoverClusterTopologyRefreshRateMs = 2000, failoverReaderConnectTimeoutMs = 5000, failoverWriterReconnectIntervalMs = 2000, loadBalanceReadOnlyTraffic = true.

After this lands, Connection.setReadOnly(true) (which Exposed already calls on transaction(readOnly = true)) becomes meaningful — the wrapper’s readWriteSplitting plugin routes read-only physical connections to an Aurora reader instance; writes land on the writer instance. The application-level HikariCP pool, its size, and its caller-facing surface are unchanged.

Composite bitemporal indexes

Two new Flyway migrations:

operations/src/main/resources/resources/kanban/database/migrations/V007__kanban_card_bitemporal_indexes.sql — adds three indexes on kanban_card in a single file: the two composite bitemporal indexes ((eid, effective_as_of DESC, recorded_as_of DESC) and (tenant_id, item_reference_entity_id, eid, effective_as_of DESC, recorded_as_of DESC)) plus the missing (tenant_id) index. None carry a WHERE retired = FALSE partial predicate.
operations/src/main/resources/reference/item/database/migrations/V*__item_bitemporal_indexes.sql — adds the composite bitemporal index on item matching the same shape. Existing idx_item_tenant stands.

All indexes use CREATE INDEX CONCURRENTLY, which means each statement must run outside a Flyway transaction (one statement per migration file or executeInTransaction = false on the migration).

`cardsForItem` cleanup

ServiceImpl.kt:276-293 collapses to:

override suspend fun cardsForItem(itemReference: ItemReference, asOf: TimeCoordinates)
        : Result<Page<KanbanCard, KanbanCardMetadata>> = inTransaction(db, readOnly = true) {
  universe.list(
    Query(Filter.Eq(KANBAN_CARD_TABLE.item.eId.name, itemReference.eId), Pagination(0, 1000)),
    asOf,
    includeDeleted = false,
    withTotal = false
  )()
}

Two coupled changes that must land together: flip withTotal = true → false AND delete the flatMap { … when (pg.totalCount) … } block.

`AppError.Transient` + HTTP 503

common-module gains:

A new AppError.Transient sealed branch under AppError.Internal, with three subtypes wrapping the wrapper’s typed exceptions: FailoverSucceeded (over FailoverSuccessSQLException), TransactionStateUnknown (over TransactionStateUnknownSQLException), FailoverFailed (over FailoverFailedSQLException).
New branches on the existing Throwable.normalizeToAppError() extension (at common-module/lib/src/main/kotlin/cards/arda/common/lib/lang/errors/AppError.kt:192) that walk the cause chain (unwrapping ExposedSQLException and HikariCP wrapping) to detect the three wrapper exception classes. No separate adapter class is introduced; classification stays in the canonical normalizeToAppError function.
StatusPages rendering of AppError.Transient as HTTP 503 with header Retry-After: 2.
A retry policy at the inTransactionAsync / inTransactionSync boundary that catches the three transient types, retries up to PoolConfig.maxAttempts - 1 additional times with PoolConfig.backoffMs ms between attempts, and surfaces AppError.Transient once retries exhaust.
New PoolConfig fields maxAttempts (default 2) and backoffMs (default 300).

Operations consumes the new release by:

Bumping the common-module pin in operations/gradle/libs.versions.toml.
Updating application.conf dataSource.jdbcUrl to the jdbc:aws-wrapper:postgresql://… scheme.
Adding the explicit dataSource.pool.maxAttempts = 2 and dataSource.pool.backoffMs = 300 knobs in application.conf (defaults match common-module; explicit values document the env contract).

Decorative declaration removal

common-module/lib/src/main/kotlin/cards/arda/common/lib/persistence/universe/AbstractScopedUniverse.kt:27 — the tenantId.index("TENANT_ID_INDEX") call is removed; the column declaration becomes plain uuid(ScopedMetadata.COLUMN_TENANT_ID). No runtime change (Exposed’s schema-emit was never relied on); the decorative declaration is removed so future readers don’t infer a guarantee that doesn’t exist. Flyway is the single authoritative source for indexes.

Gap analysis

Area	Current	Target	Gap closed by
Bitemporal SELECT plan on `kanban_card`	Sequential / single-column index lookup on the correlated subquery	Index scan on the composite `(tenant_id, item_reference_entity_id, eid, effective_as_of DESC, recorded_as_of DESC)`	Wave 1 kanban Flyway PR
Bitemporal SELECT plan on `item`	Sequential / single-column index lookup	Index scan on the composite	Wave 1 item Flyway PR
`cardsForItem` SQL count	2 statements (1 COUNT + 1 SELECT)	1 statement (SELECT only)	Wave 1 kanban Kotlin change
Tenant-id index on `kanban_card`	Missing	Present	Wave 1 kanban Flyway PR (consolidated with the bitemporal-index migration)
Decorative `TENANT_ID_INDEX` declaration	Present at `AbstractScopedUniverse.kt:27`	Removed	Wave 2 common-module release
Read/write splitting	None — all transactions hit writer	Read-only transactions auto-route to Aurora reader instance via wrapper’s `readWriteSplitting` plugin	Wave 2 common-module release
Failover detection latency	~30 s (JVM DNS cache–bound)	~2–5 s (Aurora topology API via wrapper’s `failover2` plugin)	Wave 2 common-module release
Transient SQL HTTP contract	HTTP 500 with raw exception body	HTTP 503 with `Retry-After: 2`	Wave 2 common-module release
Retry on transient	None	In-process retry with `maxAttempts=2`, `backoffMs=300` at the `inTransaction` boundary	Wave 2 common-module release
Operations consumer wiring	Default JDBC scheme, no retry knobs	`jdbc:aws-wrapper:postgresql://…` scheme, explicit retry knobs in `application.conf`	Wave 3 operations PR
Documentation	No site pages on wrapper / bitemporal-index pattern / Flyway-authoritative convention / 503 contract; no runbooks for the wrapper deploy or the synthetic-failover test	All four site pages and all three runbooks present	Wave 4 documentation PR
Synthetic-failover acceptance test	Not exercised	Procedure documented; passes on dev before promotion	Wave 5 dev failover test

Out-of-scope surfaces

These adjacent surfaces are deliberately untouched by PDEV-490:

The items-page front-end consumer of listWithDetails. Tracked on PDEV-489. The front-end resolution path (consolidate the two per-row backend calls into one page-level /v1/kanban/kanban-card/query call with Filter.In(item_reference_entity_id, [eIds…])) does not require any new back-end route — it uses an existing one. PDEV-490 ships the composite kanban-card index that the new front-end SQL plan needs, but the front-end implementation itself is not part of this project.
listWithDetails chunked-fan-out refactor. A previously proposed refactor (listWithDetails collapses the per-chunk inTransaction into a single up-front Filter.In fetch) was cancelled when the front-end resolution moved off this route entirely. Remaining callers (ItemDetailsPanel.fetchCards, ManageCardsPanel.fetchCards) are single-item flows where the chunk-vs-fetch tradeoff has no forcing function.
A new summary/for-items aggregate route on the kanban service. Cancelled. The front-end consolidation onto the existing /v1/kanban/kanban-card/query route renders the new aggregate redundant.
Pool-size tuning on item and kanban DBs. The wrapper’s read/write split removes the writer-pool ceiling pressure that would have driven a tuning pass. The current maxPoolSize = 10 stays. Sentry shows zero connection-timeout pressure in the trailing 4 days across all environments.
HPA maxReplicas reduction. Was a fallback under the originally-considered writer-pool budget pressure; the wrapper’s read/write split removes the budget pressure. No change to HPA.
JVM DNS TTL helm chart change. Was relevant under the original DNS-cache-bound failover detection; the wrapper bypasses DNS for failover detection (uses the Aurora topology API). The chart-level networkaddress.cache.ttl override is not added.
transactionIsolation evaluation (REPEATABLE_READ → READ_COMMITTED on read-only paths). Filed as Linear PDEV-534 to run after PDEV-490 ships, so the post-wrapper, post-index baseline is the reference point.
RDS Proxy adoption. Closed as won’t-do (Linear PDEV-499); the wrapper is incompatible with RDS Proxy by design.
Service-level read cache on kanban_card / item. Deferred; revisit only after the new indexes have soaked and pg_stat_statements still shows headroom.
cardsForItem bulk-handler cleanup on the items page. Three items-page bulk handlers (handleDeleteMultipleItems, handlePrintSelectedCards, handlePreviewSelectedCards) still loop per selected item against cardsForItem; user-initiated, latency tolerable. Future ticket.
Per-module tenant_id Flyway migrations for FACILITY_TABLE, STATION_TABLE, ORDER_HEADER_TABLE. The audit surfaced these; deferred to a future per-module hygiene pass. The misplaced BATCH_JOB migration (declared in the item module’s tree) is also accepted as-is.

Risks and constraints

PDEV-490 is low risk by construction — most changes are additive (new indexes, new error branch) or coupled by design (the cardsForItem two-line change). Failure modes worth pinning:

Coupled K12 regression. If the withTotal = true → false flag flip ships without removing the surrounding flatMap { … when (pg.totalCount) … } block, every cardsForItem call returns HTTP 500 (the previously-dead Result.failure(AppError.IncompatibleState) arm becomes the live branch). Mitigation: the change is documented as a coupled two-line change; verification asserts both arms cover zero-row and multi-row cases.
Wrapper jdbcUrl scheme regression. The jdbcUrl scheme change is breaking. If a consumer of common-module (today only operations; future: accounts-component) bumps the common-module pin without updating its jdbcUrl, the new driver class cannot resolve and the pod fails on startup. Mitigation: the change is documented in the common-module release CHANGELOG as Changed with explicit “Consumers must update jdbcUrl”; operations consumer PR ships both the pin bump and the scheme change in the same PR.
Reader-endpoint topology discovery. The wrapper’s topology cache is built lazily on first connection. The first request after a pod cold start may pay a topology-discovery cost. Mitigation: auroraInitialConnection plugin in the pipeline; failoverClusterTopologyRefreshRateMs = 2000 keeps the cache fresh post-discovery.
CREATE INDEX CONCURRENTLY on busy tables. The kanban-card and item migrations use CONCURRENTLY so they don’t lock the table. On a sufficiently active table the index build can fail with pg_index.indisvalid = false and require a manual cleanup. Mitigation: ship to dev first; rerun on failure (the migration is idempotent at the CREATE INDEX IF NOT EXISTS level when the index name is unique).
Wrapper compatibility with Exposed. The wrapper hooks into the JDBC Connection.setReadOnly lifecycle. Exposed at version 0.60.0 (pinned in common-module/gradle/libs.versions.toml:11) sets readOnly before autoCommit, which is the ordering the wrapper expects. The two-line ordering was verified by source inspection of ThreadLocalTransactionManager.kt:131-161 and JdbcConnectionImpl.kt:46-50 during the design phase.

The dev synthetic-failover test gates promotion beyond dev; demo / stage / prod each take a standard per-environment soak window after that.

Source references

operations/src/main/kotlin/cards/arda/operations/resources/kanban/api/rest/KanbanCardEndpoint.kt — route declarations.
operations/src/main/kotlin/cards/arda/operations/resources/kanban/service/ServiceImpl.kt:276-293 (cardsForItem), ServiceImpl.kt:322-350 (listWithDetails) — service implementations.
operations/src/main/kotlin/cards/arda/operations/resources/kanban/persistence/KanbanCardPersistence.kt:24-34 — KANBAN_CARD_TABLE declaration including the item_reference component.
operations/src/main/kotlin/cards/arda/operations/reference/item/domain/persistence/ItemReferenceComponent.kt:24 — the item_reference_entity_id column declaration consumed by Filter.In(KANBAN_CARD_TABLE.item.eId.name, …).
operations/src/main/resources/resources/kanban/database/migrations/V001__kanban.sql:50-52 — current indexes on kanban_card.
operations/src/main/resources/reference/item/database/migrations/V012__bt_indexes.sql:8 — current idx_item_tenant; V012__bt_indexes.sql:12 — misplaced idx_batch_job_tenant (out of scope to fix).
operations/src/main/resources/application.conf:45-58 — dataSource.pool block.
operations/src/main/helm/values-prod.yaml:14-15 — prod HPA minReplicas, maxReplicas.
common-module/lib/src/main/kotlin/cards/arda/common/lib/persistence/bitemporal/Persistence.kt — bitemporal SQL emitter (self-alias bt at line 88; selection condition at lines 214-215; asOfCondition helper at lines 92-95).
common-module/lib/src/main/kotlin/cards/arda/common/lib/persistence/universe/AbstractUniverse.kt:152-180 — list(…, withTotal) method with the COUNT + SELECT logic.
common-module/lib/src/main/kotlin/cards/arda/common/lib/persistence/universe/AbstractScopedUniverse.kt:27 — decorative tenantId.index("TENANT_ID_INDEX") declaration.
common-module/lib/src/main/kotlin/cards/arda/common/lib/api/rest/types/HttpResponses.kt:233-250 — appErrorResponse and internalErrorResponse mapping.
common-module/gradle/libs.versions.toml:11 — Exposed version pin (0.60.0).

References

PDEV-490 goal — PDEV-490 goal and success criteria.
requirements.md — functional and non-functional requirements that derive from this analysis.
specification.md — phased implementation plan.
verification.md — traceability matrix and verification protocols.
Umbrella project goal — umbrella product-slow-responses project goal (PDEV-442).
Aurora parameter group + operations bump rollout — PDEV-479 + PDEV-488 rollout; entry-state baseline for PDEV-490.

Analysis: PDEV-490 Operations Performance Improvements

Analysis: PDEV-490 Operations Performance Improvements

Summary

Scope

Current state

Target endpoints

cardsForItem — the wasted COUNT

listWithDetails — chunked per-chunk fan-out

Bitemporal SQL pattern

Index coverage

Connection pool and JDBC stack

Error rendering

Failover behavior

Measured baseline

Target state

JDBC stack — AWS Advanced JDBC Wrapper

Composite bitemporal indexes

cardsForItem cleanup

AppError.Transient + HTTP 503

Decorative declaration removal

Gap analysis

Out-of-scope surfaces

Risks and constraints

Source references

References

`cardsForItem` — the wasted COUNT

`listWithDetails` — chunked per-chunk fan-out

`cardsForItem` cleanup

`AppError.Transient` + HTTP 503