Analysis: PDEV-490 Operations Performance Improvements
Author: Claude Opus for jmpicnic | Date: 2026-05-19 | Status: Draft
Analysis: PDEV-490 Operations Performance Improvements
Section titled “Analysis: PDEV-490 Operations Performance Improvements”Entry-state analysis of the operations component and common-module persistence layer, scoped to the two endpoints that PDEV-490 targets and the persistence-layer surfaces those endpoints depend on. Establishes the empirical baseline against which requirements define improvement targets, and surfaces the gaps that the specification closes.
Summary
Section titled “Summary”The operations component issues bitemporal SELECTs (latest-version-per-eId) against kanban_card and item. Two route handlers fan out from the items-page workload: cardsForItem (single-item) and listWithDetails (page-scoped with per-chunk item-side fan-out). At the measured 2026-05-19 baseline, cardsForItem runs at p50 1,113 ms / p95 2,911 ms on Alpha001-prod; listWithDetails runs at p50 289 ms / p95 2,035 ms. The dominant costs are:
- Inner bitemporal subqueries on
kanban_cardanditemplanning against only three single-column indexes each — no composite covering the(tenant_id, item_reference_entity_id, eid, effective_as_of DESC, recorded_as_of DESC)access pattern. - A wasted COUNT issued by
cardsForItembecause the kanban service requestswithTotal = trueand uses the result only as a non-null sanity check. - A naive JDBC stack — HikariCP wired directly to a single Aurora cluster endpoint, no reader routing, no failover-aware retry. Aurora failovers translate to ~30 s of HTTP 500s while the JVM DNS cache holds the dead endpoint.
The project replaces the JDBC stack with the AWS Advanced JDBC Wrapper (read/write splitting + topology-driven failover + retry-on-typed-exception) in common-module; adds composite bitemporal indexes on kanban_card and item in operations; drops the wasted COUNT in cardsForItem; and surfaces transient failures as HTTP 503 with Retry-After. The bitemporal SELECTs are auto-routed to Aurora reader instances afterward.
This analysis covers:
- The two target routes (
cardsForItem,listWithDetails) — handlers, service methods, and the SQL they emit. - The
common-modulepersistence layer that backs them —Persistence.kt,AbstractUniverse.kt,AbstractScopedUniverse.kt,DataSource.kt, theinTransactionboundary, and theStatusPages-installed HTTP error contract. - The Flyway migration trees that govern index coverage on
kanban_cardanditem, plus a tenant-id index audit across everyScopedTableconsumer in operations. - The measured performance baseline for both routes against
Alpha001-prod,Alpha002-stage, andAlpha002-devvia Sentry over the trailing five-day window.
It does not cover:
- Front-end consumers — the items-page front-end consolidation is tracked separately on PDEV-489. PDEV-490 ships the composite index that the front-end work depends on, but the front-end change itself is out of scope here.
- Aurora cluster configuration (instance class, parameter group, max-connections) — handled by PDEV-479 and already shipped.
pg_stat_statementsprovisioning — handled by PDEV-498 and already shipped.- Long-term DB query observability tooling — tracked separately by PDEV-512.
Current state
Section titled “Current state”Target endpoints
Section titled “Target endpoints”The two routes in scope are declared in operations/src/main/kotlin/cards/arda/operations/resources/kanban/api/rest/KanbanCardEndpoint.kt and implemented in operations/src/main/kotlin/cards/arda/operations/resources/kanban/service/ServiceImpl.kt:
| Route | Service method | Workload shape |
|---|---|---|
GET /v1/kanban/kanban-card/for-item/{itemEId} | KanbanCardService.cardsForItem(itemRef, asOf) (ServiceImpl.kt:276-293) | One bitemporal SELECT on kanban_card with Filter.Eq(item.eId), plus an unused COUNT. Returns up to 1,000 cards. |
POST /v1/kanban/kanban-card/details | KanbanCardService.listWithDetails(query, asOf) (ServiceImpl.kt:322-350) | One bitemporal SELECT on kanban_card followed by a chunked per-chunk SELECT on item (25-card chunks, flatMapMerge(concurrency = 25)). Hydrates KanbanCardDetails with full Item.Entity per card. |
cardsForItem — the wasted COUNT
Section titled “cardsForItem — the wasted COUNT”cardsForItem issues universe.list(query, asOf, withTotal = true). Tracing through common-module/lib/src/main/kotlin/cards/arda/common/lib/persistence/universe/AbstractUniverse.kt:152-180:
- With
withTotal = true, the underlying persistence layer issues aCOUNT(*)against the same predicate in addition to the row-returning SELECT. - The Kotlin caller (
ServiceImpl.kt:287-290) uses the resultingtotalCountonly as a non-null sanity check (when (pg.totalCount) { null -> Result.failure(AppError.IncompatibleState(...)); else -> Result.success(pg) }). The value never propagates to the HTTP response. - The
nullarm of thatwhenis dead code under today’swithTotal = true—AbstractUniverse.listalways materialises a non-nullLongintopg.totalCountwhenwithTotalis true. Dropping the flag without also removing thewhenwould invert the dead branch into a 100%-failure regression.
Net per cardsForItem invocation today: 2 SQL statements (1 COUNT + 1 SELECT) on the kanban DB.
listWithDetails — chunked per-chunk fan-out
Section titled “listWithDetails — chunked per-chunk fan-out”listWithDetails runs an outer kanban SELECT followed by a per-chunk inner item SELECT:
listEntities(query, asOf).flatMap { pageRs -> pageRs.results.chunked(25).asFlow().flatMapMerge(concurrency = 25) { chunk -> flow { val targetItems = chunk.map { it.payload.item.eId }.toSet().toList() itemService.listEntities( Query(Filter.In(ITEM_TABLE.eId.name, targetItems), Pagination(0, chunk.size)), asOf ).map { it.results.associate { it.payload.eId to it.payload } } .onSuccess { itMap -> emitAll(chunk.asFlow().map { composeDetails(asOf, it, itMap[it.payload.item.eId]) }) } .onFailure { emit(Result.failure(it)) } } } // …}Net per listWithDetails invocation today: 1 SELECT on kanban_card + ⌈N/25⌉ SELECTs on item (where N is the kanban result size). For a 25-row page that’s 2 SQL statements; for 200 rows that’s 9.
Bitemporal SQL pattern
Section titled “Bitemporal SQL pattern”Both routes ultimately emit the same SQL shape via common-module/lib/src/main/kotlin/cards/arda/common/lib/persistence/bitemporal/Persistence.kt. For the kanban_card SELECT:
SELECT bt.* -- ~30 wide columnsFROM kanban_card btWHERE bt.id IN ( SELECT sq.id FROM kanban_card sq WHERE <user condition> AND <tenant constraint> AND sq.effective_as_of <= <asOf.effective> AND sq.recorded_as_of <= <asOf.recorded> AND bt.eId = sq.eId -- correlated to outer row AND bt.retired = FALSE ORDER BY sq.effective_as_of DESC, sq.recorded_as_of DESC LIMIT 1)ORDER BY bt.recorded_as_of DESC, bt.effective_as_of DESC, bt.id ASCOFFSET 0 LIMIT 1000This is the “latest version of each entity at an asOf coordinate” bitemporal pattern. The correlated subquery (bt.eId = sq.eId) forces Postgres to either re-execute the inner query per outer row or unroll it via a hash/merge plan. Plan quality depends entirely on whether a composite index covers the inner-subquery predicate.
Index coverage
Section titled “Index coverage”Current indexes on kanban_card, from operations/src/main/resources/resources/kanban/database/migrations/V001__kanban.sql:50-52:
CREATE INDEX idx_kanban_card_eid ON kanban_card (eid);CREATE INDEX idx_kanban_card_effective_as_of ON kanban_card (effective_as_of);CREATE INDEX idx_kanban_card_recorded_as_of ON kanban_card (recorded_as_of);Three single-column indexes. Subsequent migrations V002–V006 add columns but no further indexes on kanban_card. The tenant_id column exists but is not indexed in the Flyway tree; the AbstractScopedUniverse.kt:27 declaration tenantId.index("TENANT_ID_INDEX") is decorative (Exposed’s schema-emit path is not invoked in any deploy environment — Flyway is authoritative).
Current indexes on item follow the same pattern (single-column eid, effective_as_of, recorded_as_of), with the tenant_id index here actually present via reference/item/database/migrations/V012__bt_indexes.sql:8.
Tenant-id audit across ScopedTable consumers (audit completed 2026-05-18, full results below for reference):
| Module | Table | tenant_id index status |
|---|---|---|
reference/item | ITEM_TABLE | Present (V012__bt_indexes.sql:8 — idx_item_tenant) |
reference/business-affiliate | BUSINESS_AFFILIATE_TABLE | Present (V001__biz_affiliates.sql:89 — idx_ba_tenant_id) |
system/batch | BATCH_JOB_TABLE | Present, but the migration lives in reference/item/V012__bt_indexes.sql:12 (the misplaced location is left as-is) |
resources/kanban | KANBAN_CARD_TABLE | Missing |
resources/facility | FACILITY_TABLE | Missing — deferred (out of scope) |
resources/station | STATION_TABLE | Missing — deferred (out of scope) |
procurement/orders | ORDER_HEADER_TABLE | Missing — deferred (out of scope) |
The audit found exactly one PDEV-490-actionable gap: kanban_card is missing its tenant_id index. The migration that adds the composite bitemporal indexes on kanban_card will also add (tenant_id) as a separate index in the same file. The three modules deferred (facility, station, procurement/orders) are deliberately out of scope for PDEV-490; they are candidates for a future per-module hygiene pass.
Connection pool and JDBC stack
Section titled “Connection pool and JDBC stack”operations/src/main/resources/application.conf:45-58:
dataSource { pool { minIdle = 1 maxPoolSize = 10 maxLifetime = 1800000 connectionTimeout = 30000 validationTimeout = 1000 idleTimeout = 600000 initializationFailTimeout = 1 isAutoCommit = true keepAliveTime = 600000 transactionIsolation = "TRANSACTION_REPEATABLE_READ" }}The JDBC stack today (common-module/lib/src/main/kotlin/cards/arda/common/lib/persistence/DataSource.kt):
- HikariCP as the application-level pool, one pool per module DB (six pools in operations: kanban, item, businessaffiliates, facility, station, batch).
jdbcUrlof the formjdbc:postgresql://<aurora-cluster-writer-endpoint>:<port>/<db>.driverClassName = "org.postgresql.Driver".- No read/write splitting — every transaction lands on the writer endpoint.
- No failover-aware behavior — when Aurora promotes a different writer instance, the JVM DNS cache continues to resolve the cluster endpoint to the previously-promoted instance for ~30 s (per the JVM’s default
networkaddress.cache.ttl).
Connection.setReadOnly(true) is propagated by Exposed when callers pass readOnly = true to transaction(...). Today this flag is set but unused at the JDBC layer — it’s a no-op against a writer-endpoint connection.
Error rendering
Section titled “Error rendering”common-module/lib/src/main/kotlin/cards/arda/common/lib/api/rest/types/HttpResponses.kt:233-250 defines the canonical appErrorResponse mapping. AppError.Internal subtypes (Implementation, Infrastructure, InternalService, IncompatibleState, InternalTimeout, ExternalService) all render as HTTP 500 with the exception message in the body. There is no AppError.Transient branch, no HTTP 503 contract, and no Retry-After header.
Operations-side SQLException handler audit (completed 2026-05-18): grep -rnE 'SQLException|ExposedSQLException|PSQLException' src/main/kotlin/ against the operations worktree returned 0 hits. The canonical StatusPages handler in common-module is the sole HTTP renderer for SQL exceptions.
Failover behavior
Section titled “Failover behavior”Today, an Aurora failover triggers the following sequence:
- The previously-promoted writer instance becomes unavailable.
- HikariCP detects connection failure on the next acquire and starts retrying within the 30 s
connectionTimeoutwindow. - The JVM continues to resolve the cluster endpoint to the dead IP for up to 30 s (DNS cache).
- Connections continue to fail. HikariCP exhausts its retry budget; transactions surface as
org.postgresql.util.PSQLException/org.jetbrains.exposed.exceptions.ExposedSQLException. - The
StatusPageshandler maps these to HTTP 500 (AppError.Implementation). - The user-visible 5xx window is ~30 s long, all HTTP 500.
There is no graceful-degradation path, no retry-on-transient at the inTransaction boundary, and no Aurora-topology awareness — the JVM does not know that Aurora has promoted a different writer until the DNS cache expires.
Measured baseline
Section titled “Measured baseline”Sentry transaction durations on platform-be, trailing 5 days, all environments:
| Route | Env | Count | p50 | p95 | p99 |
|---|---|---|---|---|---|
GET /v1/kanban/.../kanban-card/for-item/{item-eid} | Alpha001-prod | 4,375 | 1,113 ms | 2,911 ms | 3,677 ms |
GET /v1/kanban/.../kanban-card/for-item/{item-eid} | Alpha002-dev | 1,140 | 553 ms | 1,610 ms | 2,173 ms |
GET /v1/kanban/.../kanban-card/for-item/{item-eid} | Alpha002-stage | 142 | 694 ms | 1,725 ms | 1,854 ms |
POST /v1/kanban/.../kanban-card/details | Alpha001-prod | 24,755 | 289 ms | 2,035 ms | 3,215 ms |
POST /v1/kanban/.../kanban-card/details | Alpha002-dev | 761 | 1,213 ms | 2,081 ms | 2,672 ms |
POST /v1/kanban/.../kanban-card/details | Alpha002-stage | 70 | 680 ms | 2,110 ms | 2,182 ms |
For reference, lighter sibling kanban-card routes on Alpha001-prod (no fan-out, no wide-row hydration):
| Route | p50 | p95 |
|---|---|---|
POST .../kanban-card/details/{status} | 119 ms | 195 ms |
POST .../kanban-card/query | 76 ms | 145 ms |
GET .../kanban-card/{entity-id} | 6 ms | 7 ms |
These sibling routes establish what the kanban-card SQL surface looks like when the inner subquery isn’t the dominant cost — single-digit-millisecond simple lookups, ~100–200 ms for filtered listings without per-row hydration.
Connection-timeout signal: zero Sentry events for connectionTimeout, SQLTransientConnectionException, HikariPool, or the broader connection term across errors and logs datasets in the trailing 4 days (sanity check: 1,120,365 spans on Alpha001-prod over the same window confirms instrumentation is live). The writer-side connection pool is not under saturation pressure today.
HPA configuration (operations/src/main/helm/values-*.yaml, working tree at origin/main 2026-05-19):
| Environment | minReplicas | maxReplicas |
|---|---|---|
values-prod.yaml | 2 | 8 |
values-stage.yaml | 2 | 4 |
values-demo.yaml | 2 | 4 |
values-dev.yaml | 2 | 4 |
values-local.yaml | 1 | 2 |
| chart default | 2 | 4 |
Only prod runs at the upper maxReplicas = 8.
Target state
Section titled “Target state”PDEV-490 changes the persistence layer and the kanban-side SQL surface in coordinated steps:
JDBC stack — AWS Advanced JDBC Wrapper
Section titled “JDBC stack — AWS Advanced JDBC Wrapper”common-module/lib/src/main/kotlin/cards/arda/common/lib/persistence/DataSource.kt wires HikariCP through the AWS Advanced JDBC Wrapper (software.amazon.jdbc:aws-advanced-jdbc-wrapper:4.0.1):
jdbcUrltemplate changes fromjdbc:postgresql://…tojdbc:aws-wrapper:postgresql://….driverClassName = "software.amazon.jdbc.Driver".- Plugin pipeline:
auroraInitialConnection,failover2,efm2,readWriteSplitting. HikariConfig.exceptionOverrideClassName = "software.amazon.jdbc.util.HikariCPSQLException"so HikariCP cooperates with wrapper-emitted failover exceptions instead of evicting healthy connections.- Aurora-tuning properties:
failoverClusterTopologyRefreshRateMs = 2000,failoverReaderConnectTimeoutMs = 5000,failoverWriterReconnectIntervalMs = 2000,loadBalanceReadOnlyTraffic = true.
After this lands, Connection.setReadOnly(true) (which Exposed already calls on transaction(readOnly = true)) becomes meaningful — the wrapper’s readWriteSplitting plugin routes read-only physical connections to an Aurora reader instance; writes land on the writer instance. The application-level HikariCP pool, its size, and its caller-facing surface are unchanged.
Composite bitemporal indexes
Section titled “Composite bitemporal indexes”Two new Flyway migrations:
operations/src/main/resources/resources/kanban/database/migrations/V007__kanban_card_bitemporal_indexes.sql— adds three indexes onkanban_cardin a single file: the two composite bitemporal indexes ((eid, effective_as_of DESC, recorded_as_of DESC)and(tenant_id, item_reference_entity_id, eid, effective_as_of DESC, recorded_as_of DESC)) plus the missing(tenant_id)index. None carry aWHERE retired = FALSEpartial predicate.operations/src/main/resources/reference/item/database/migrations/V*__item_bitemporal_indexes.sql— adds the composite bitemporal index onitemmatching the same shape. Existingidx_item_tenantstands.
All indexes use CREATE INDEX CONCURRENTLY, which means each statement must run outside a Flyway transaction (one statement per migration file or executeInTransaction = false on the migration).
cardsForItem cleanup
Section titled “cardsForItem cleanup”ServiceImpl.kt:276-293 collapses to:
override suspend fun cardsForItem(itemReference: ItemReference, asOf: TimeCoordinates) : Result<Page<KanbanCard, KanbanCardMetadata>> = inTransaction(db, readOnly = true) { universe.list( Query(Filter.Eq(KANBAN_CARD_TABLE.item.eId.name, itemReference.eId), Pagination(0, 1000)), asOf, includeDeleted = false, withTotal = false )()}Two coupled changes that must land together: flip withTotal = true → false AND delete the flatMap { … when (pg.totalCount) … } block.
AppError.Transient + HTTP 503
Section titled “AppError.Transient + HTTP 503”common-module gains:
- A new
AppError.Transientsealed branch underAppError.Internal, with three subtypes wrapping the wrapper’s typed exceptions:FailoverSucceeded(overFailoverSuccessSQLException),TransactionStateUnknown(overTransactionStateUnknownSQLException),FailoverFailed(overFailoverFailedSQLException). - New branches on the existing
Throwable.normalizeToAppError()extension (atcommon-module/lib/src/main/kotlin/cards/arda/common/lib/lang/errors/AppError.kt:192) that walk the cause chain (unwrappingExposedSQLExceptionand HikariCP wrapping) to detect the three wrapper exception classes. No separate adapter class is introduced; classification stays in the canonicalnormalizeToAppErrorfunction. StatusPagesrendering ofAppError.Transientas HTTP 503 with headerRetry-After: 2.- A retry policy at the
inTransactionAsync/inTransactionSyncboundary that catches the three transient types, retries up toPoolConfig.maxAttempts - 1additional times withPoolConfig.backoffMsms between attempts, and surfacesAppError.Transientonce retries exhaust. - New
PoolConfigfieldsmaxAttempts(default 2) andbackoffMs(default 300).
Operations consumes the new release by:
- Bumping the
common-modulepin inoperations/gradle/libs.versions.toml. - Updating
application.confdataSource.jdbcUrlto thejdbc:aws-wrapper:postgresql://…scheme. - Adding the explicit
dataSource.pool.maxAttempts = 2anddataSource.pool.backoffMs = 300knobs inapplication.conf(defaults match common-module; explicit values document the env contract).
Decorative declaration removal
Section titled “Decorative declaration removal”common-module/lib/src/main/kotlin/cards/arda/common/lib/persistence/universe/AbstractScopedUniverse.kt:27 — the tenantId.index("TENANT_ID_INDEX") call is removed; the column declaration becomes plain uuid(ScopedMetadata.COLUMN_TENANT_ID). No runtime change (Exposed’s schema-emit was never relied on); the decorative declaration is removed so future readers don’t infer a guarantee that doesn’t exist. Flyway is the single authoritative source for indexes.
Gap analysis
Section titled “Gap analysis”| Area | Current | Target | Gap closed by |
|---|---|---|---|
Bitemporal SELECT plan on kanban_card | Sequential / single-column index lookup on the correlated subquery | Index scan on the composite (tenant_id, item_reference_entity_id, eid, effective_as_of DESC, recorded_as_of DESC) | Wave 1 kanban Flyway PR |
Bitemporal SELECT plan on item | Sequential / single-column index lookup | Index scan on the composite | Wave 1 item Flyway PR |
cardsForItem SQL count | 2 statements (1 COUNT + 1 SELECT) | 1 statement (SELECT only) | Wave 1 kanban Kotlin change |
Tenant-id index on kanban_card | Missing | Present | Wave 1 kanban Flyway PR (consolidated with the bitemporal-index migration) |
Decorative TENANT_ID_INDEX declaration | Present at AbstractScopedUniverse.kt:27 | Removed | Wave 2 common-module release |
| Read/write splitting | None — all transactions hit writer | Read-only transactions auto-route to Aurora reader instance via wrapper’s readWriteSplitting plugin | Wave 2 common-module release |
| Failover detection latency | ~30 s (JVM DNS cache–bound) | ~2–5 s (Aurora topology API via wrapper’s failover2 plugin) | Wave 2 common-module release |
| Transient SQL HTTP contract | HTTP 500 with raw exception body | HTTP 503 with Retry-After: 2 | Wave 2 common-module release |
| Retry on transient | None | In-process retry with maxAttempts=2, backoffMs=300 at the inTransaction boundary | Wave 2 common-module release |
| Operations consumer wiring | Default JDBC scheme, no retry knobs | jdbc:aws-wrapper:postgresql://… scheme, explicit retry knobs in application.conf | Wave 3 operations PR |
| Documentation | No site pages on wrapper / bitemporal-index pattern / Flyway-authoritative convention / 503 contract; no runbooks for the wrapper deploy or the synthetic-failover test | All four site pages and all three runbooks present | Wave 4 documentation PR |
| Synthetic-failover acceptance test | Not exercised | Procedure documented; passes on dev before promotion | Wave 5 dev failover test |
Out-of-scope surfaces
Section titled “Out-of-scope surfaces”These adjacent surfaces are deliberately untouched by PDEV-490:
- The items-page front-end consumer of
listWithDetails. Tracked on PDEV-489. The front-end resolution path (consolidate the two per-row backend calls into one page-level/v1/kanban/kanban-card/querycall withFilter.In(item_reference_entity_id, [eIds…])) does not require any new back-end route — it uses an existing one. PDEV-490 ships the composite kanban-card index that the new front-end SQL plan needs, but the front-end implementation itself is not part of this project. listWithDetailschunked-fan-out refactor. A previously proposed refactor (listWithDetailscollapses the per-chunkinTransactioninto a single up-frontFilter.Infetch) was cancelled when the front-end resolution moved off this route entirely. Remaining callers (ItemDetailsPanel.fetchCards,ManageCardsPanel.fetchCards) are single-item flows where the chunk-vs-fetch tradeoff has no forcing function.- A new
summary/for-itemsaggregate route on the kanban service. Cancelled. The front-end consolidation onto the existing/v1/kanban/kanban-card/queryroute renders the new aggregate redundant. - Pool-size tuning on
itemandkanbanDBs. The wrapper’s read/write split removes the writer-pool ceiling pressure that would have driven a tuning pass. The currentmaxPoolSize = 10stays. Sentry shows zero connection-timeout pressure in the trailing 4 days across all environments. - HPA
maxReplicasreduction. Was a fallback under the originally-considered writer-pool budget pressure; the wrapper’s read/write split removes the budget pressure. No change to HPA. - JVM DNS TTL helm chart change. Was relevant under the original DNS-cache-bound failover detection; the wrapper bypasses DNS for failover detection (uses the Aurora topology API). The chart-level
networkaddress.cache.ttloverride is not added. transactionIsolationevaluation (REPEATABLE_READ → READ_COMMITTED on read-only paths). Filed as Linear PDEV-534 to run after PDEV-490 ships, so the post-wrapper, post-index baseline is the reference point.- RDS Proxy adoption. Closed as won’t-do (Linear PDEV-499); the wrapper is incompatible with RDS Proxy by design.
- Service-level read cache on
kanban_card/item. Deferred; revisit only after the new indexes have soaked andpg_stat_statementsstill shows headroom. cardsForItembulk-handler cleanup on the items page. Three items-page bulk handlers (handleDeleteMultipleItems,handlePrintSelectedCards,handlePreviewSelectedCards) still loop per selected item againstcardsForItem; user-initiated, latency tolerable. Future ticket.- Per-module
tenant_idFlyway migrations forFACILITY_TABLE,STATION_TABLE,ORDER_HEADER_TABLE. The audit surfaced these; deferred to a future per-module hygiene pass. The misplacedBATCH_JOBmigration (declared in the item module’s tree) is also accepted as-is.
Risks and constraints
Section titled “Risks and constraints”PDEV-490 is low risk by construction — most changes are additive (new indexes, new error branch) or coupled by design (the cardsForItem two-line change). Failure modes worth pinning:
- Coupled K12 regression. If the
withTotal = true → falseflag flip ships without removing the surroundingflatMap { … when (pg.totalCount) … }block, everycardsForItemcall returns HTTP 500 (the previously-deadResult.failure(AppError.IncompatibleState)arm becomes the live branch). Mitigation: the change is documented as a coupled two-line change; verification asserts both arms cover zero-row and multi-row cases. - Wrapper
jdbcUrlscheme regression. ThejdbcUrlscheme change is breaking. If a consumer ofcommon-module(today onlyoperations; future:accounts-component) bumps thecommon-modulepin without updating itsjdbcUrl, the new driver class cannot resolve and the pod fails on startup. Mitigation: the change is documented in thecommon-modulerelease CHANGELOG asChangedwith explicit “Consumers must update jdbcUrl”; operations consumer PR ships both the pin bump and the scheme change in the same PR. - Reader-endpoint topology discovery. The wrapper’s topology cache is built lazily on first connection. The first request after a pod cold start may pay a topology-discovery cost. Mitigation:
auroraInitialConnectionplugin in the pipeline;failoverClusterTopologyRefreshRateMs = 2000keeps the cache fresh post-discovery. CREATE INDEX CONCURRENTLYon busy tables. The kanban-card and item migrations useCONCURRENTLYso they don’t lock the table. On a sufficiently active table the index build can fail withpg_index.indisvalid = falseand require a manual cleanup. Mitigation: ship to dev first; rerun on failure (the migration is idempotent at theCREATE INDEX IF NOT EXISTSlevel when the index name is unique).- Wrapper compatibility with Exposed. The wrapper hooks into the JDBC
Connection.setReadOnlylifecycle. Exposed at version 0.60.0 (pinned incommon-module/gradle/libs.versions.toml:11) setsreadOnlybeforeautoCommit, which is the ordering the wrapper expects. The two-line ordering was verified by source inspection ofThreadLocalTransactionManager.kt:131-161andJdbcConnectionImpl.kt:46-50during the design phase.
The dev synthetic-failover test gates promotion beyond dev; demo / stage / prod each take a standard per-environment soak window after that.
Source references
Section titled “Source references”operations/src/main/kotlin/cards/arda/operations/resources/kanban/api/rest/KanbanCardEndpoint.kt— route declarations.operations/src/main/kotlin/cards/arda/operations/resources/kanban/service/ServiceImpl.kt:276-293(cardsForItem),ServiceImpl.kt:322-350(listWithDetails) — service implementations.operations/src/main/kotlin/cards/arda/operations/resources/kanban/persistence/KanbanCardPersistence.kt:24-34—KANBAN_CARD_TABLEdeclaration including theitem_referencecomponent.operations/src/main/kotlin/cards/arda/operations/reference/item/domain/persistence/ItemReferenceComponent.kt:24— theitem_reference_entity_idcolumn declaration consumed byFilter.In(KANBAN_CARD_TABLE.item.eId.name, …).operations/src/main/resources/resources/kanban/database/migrations/V001__kanban.sql:50-52— current indexes onkanban_card.operations/src/main/resources/reference/item/database/migrations/V012__bt_indexes.sql:8— currentidx_item_tenant;V012__bt_indexes.sql:12— misplacedidx_batch_job_tenant(out of scope to fix).operations/src/main/resources/application.conf:45-58—dataSource.poolblock.operations/src/main/helm/values-prod.yaml:14-15— prod HPAminReplicas,maxReplicas.common-module/lib/src/main/kotlin/cards/arda/common/lib/persistence/bitemporal/Persistence.kt— bitemporal SQL emitter (self-aliasbtat line 88; selection condition at lines 214-215;asOfConditionhelper at lines 92-95).common-module/lib/src/main/kotlin/cards/arda/common/lib/persistence/universe/AbstractUniverse.kt:152-180—list(…, withTotal)method with the COUNT + SELECT logic.common-module/lib/src/main/kotlin/cards/arda/common/lib/persistence/universe/AbstractScopedUniverse.kt:27— decorativetenantId.index("TENANT_ID_INDEX")declaration.common-module/lib/src/main/kotlin/cards/arda/common/lib/api/rest/types/HttpResponses.kt:233-250—appErrorResponseandinternalErrorResponsemapping.common-module/gradle/libs.versions.toml:11— Exposed version pin (0.60.0).
References
Section titled “References”- PDEV-490 goal — PDEV-490 goal and success criteria.
requirements.md— functional and non-functional requirements that derive from this analysis.specification.md— phased implementation plan.verification.md— traceability matrix and verification protocols.- Umbrella project goal — umbrella
product-slow-responsesproject goal (PDEV-442). - Aurora parameter group + operations bump rollout — PDEV-479 + PDEV-488 rollout; entry-state baseline for PDEV-490.
Copyright: (c) Arda Systems 2025-2026, All rights reserved
Copyright: © Arda Systems 2025-2026, All rights reserved