Skip to content

PDEV-442 — Operations + common-module code dig

This document records a close reading of the operations and common-module worktrees in projects/product-slow-responses-worktrees/. Focus is the two endpoints the BFF fans out per /items row:

  • GET /v1/kanban/kanban-card/for-item/{itemEId}service.cardsForItem(...)
  • POST /v1/kanban/kanban-card/detailsservice.listWithDetails(...)

All file paths below are relative to the respective worktree roots.

The kanban-card resource exposes (relevant subset):

RouteService methodNotes
POST /v1/kanban/kanban-card-summary/generatesummaryByStatus(filter, asOf, ...)Already returns per-item aggregate (SUM of cardQuantity by status) via a single SQL query
POST /v1/kanban/kanban-card-summary/requested/generatesummaryByStatus(..., REQUESTED)Same, hard-coded to one status
POST /v1/kanban/kanban-card-summary/in-process/generatesummaryByStatus(..., IN_PROCESS)Same
POST /v1/kanban/kanban-card/details/generatelistWithDetails(query, asOf)The N+1 source — per-item full-row fetch
POST /v1/kanban/kanban-card/details/{status}/by-statuslistWithDetails(query+status, asOf)Same path, filter by status
GET /v1/kanban/kanban-card/for-item/{itemEId}cardsForItem(itemRef, asOf)Returns all (up to 1000) cards for an item, without item details
GET /v1/kanban/kanban-card/{eid}/detailsdetailsFor(cardId, asOf)Single-card details (deprecated route)

Key existing capability: summaryByStatus is already a single-SQL GROUP-BY-aggregation endpoint that returns one row per (item, status, quantity-unit) with summed cardQuantity.amount. It does not return counts of cards, only summed quantities — but the SQL shape is exactly what an aggregated items-page endpoint needs. See §5 for how to extend this.

override suspend fun cardsForItem(itemReference: ItemReference, asOf: TimeCoordinates)
: Result<Page<KanbanCard, KanbanCardMetadata>> = inTransaction(db, readOnly = true) {
val rs = universe.list(
Query(
Filter.Eq(KANBAN_CARD_TABLE.item.eId.name, itemReference.eId),
paginate = Pagination(0, 1000)
),
asOf,
includeDeleted = false,
withTotal = true // ← see below
)().flatMap { pg ->
when (pg.totalCount) {
null -> Result.failure(AppError.IncompatibleState("..."))
else -> Result.success(pg)
}
}
rs
}

Per call:

  • withTotal = true forces two SQL queries (one COUNT + one SELECT) per invocation. The totalCount is only used as a non-null sanity check — not propagated to the response. The COUNT is wasted work.
  • paginate(0, 1000) returns all cards for the item up to 1,000. No page cursor used. Wide row select (all ~30 kanban_card columns).
  • inTransaction(db, readOnly = true) borrows a connection from the HikariCP pool (size 10 per pod — see §6).

3. listWithDetails — analysis (the heavy one)

Section titled “3. listWithDetails — analysis (the heavy one)”
override suspend fun listWithDetails(query: Query, asOf: TimeCoordinates) =
listEntities(query, asOf).flatMap { pageRs -> // 1 inTransaction, 1 SQL
pageRs.results.chunked(25).asFlow().flatMapMerge(concurrency = 25) { chunk ->
flow {
val targetItems = chunk.map { it.payload.item.eId }.toSet().toList()
itemService.listEntities( // 1 inTransaction PER CHUNK
Query(Filter.In(ITEM_TABLE.eId.name, targetItems),
paginate = Pagination(0, chunk.size)),
asOf
).map { it.results.associate { it.payload.eId to it.payload } }
.onSuccess { itMap ->
emitAll(chunk.asFlow().map { composeDetails(asOf, it, itMap[it.payload.item.eId]) })
}
}
}.toList().collectAll().map { PageResult(..., it, pageRs.totalCount) }
}

Per call (typical case: BFF filters by ITEM_REFERENCE_entity_id = X, so all returned cards reference the same item):

  • listEntities → 1 inTransaction opening a read-only Exposed transaction → 1 SQL on kanban_card (no COUNT because listEntities passes withTotal = false to universe.list).
  • pageRs.results.chunked(25) → 1 chunk if ≤ 25 cards, more otherwise.
  • For each chunk, a separate itemService.listEntities call → another inTransaction1 SQL on item.
    • targetItems = chunk.map { it.payload.item.eId }.toSet() so the item id is deduplicated within a chunk. But if multiple chunks contain the same item id (which they will in the per-item BFF call case), we fetch the same item once per chunk.
  • Each chunk’s items are mapped into a Map and composeDetails runs per card to build the wide KanbanCardDetails payload.
  • composeDetails calls lookupUrlId(...) which is pure string formatting (CARD_URL_PATH_PATTERN.format(...)) — no DB call. Good; not an issue.

Minimum SQL count for a typical BFF call (≤ 25 cards, all referencing one item): 2 SELECT statements (kanban_card + item). Each requires borrowing a DB connection.

Larger card counts compound: 60 cards → 3 chunks → 1 kanban_card SELECT + 3 item SELECTs (each fetching the same single item) → 4 SQL statements. That’s wasted, but still small in absolute terms.

4. The bitemporal SQL pattern (common-module/.../bitemporal/Persistence.kt)

Section titled “4. The bitemporal SQL pattern (common-module/.../bitemporal/Persistence.kt)”

This is where the real per-call cost lives. universe.list(query, asOf) ultimately runs:

SELECT bt.* -- ~30 wide columns per row
FROM kanban_card bt
WHERE bt.id IN (
SELECT sq.id
FROM kanban_card sq
WHERE
<user condition> -- e.g. item_reference_entity_id = X
AND <tenant constraint> -- e.g. tenant_id = T
AND sq.effective_as_of <= <asOf.effective>
AND sq.recorded_as_of <= <asOf.recorded>
AND bt.eId = sq.eId -- correlated to outer row
AND bt.retired = FALSE
ORDER BY sq.effective_as_of DESC, sq.recorded_as_of DESC
LIMIT 1
)
ORDER BY bt.recorded_as_of DESC, bt.effective_as_of DESC, bt.id ASC
OFFSET 0 LIMIT 1000

This is the classic “return the latest version of each entity” bitemporal pattern. The correlated subquery (bt.eId = sq.eId) on bt’s side of the IN clause forces Postgres to re-execute (or unroll) the inner query per outer row in the worst case. Postgres can often rewrite this into a semi-join, but the rewrite quality depends entirely on what indexes are available.

The COUNT variant runs:

SELECT COUNT(DISTINCT bt.eId) FROM kanban_card bt WHERE <selection>

— effectively the same heavy WHERE plus a DISTINCT.

Index coverage on kanban_card (migrations V001..V006)

Section titled “Index coverage on kanban_card (migrations V001..V006)”
create index idx_kanban_card_eid on kanban_card (eid);
create index idx_kanban_card_effective_as_of on kanban_card (effective_as_of);
create index idx_kanban_card_recorded_as_of on kanban_card (recorded_as_of);

What’s there:

  • Three single-column indexes.

What’s missing for the bitemporal pattern:

  • A composite index that supports the “most-recent version per eId”: (eId, effective_as_of DESC, recorded_as_of DESC) — enables an index-only scan for the inner subquery and removes the per-row sort.
  • A predicate index for the items-page workload: (tenant_id, item_reference_entity_id, eId, effective_as_of DESC, recorded_as_of DESC) WHERE retired = FALSE — collapses the WHERE in §2/§3 into a single index seek.
  • No index at all on tenant_id or item_reference_entity_id. The ScopedTable base in common-module declares tenantId.index("TENANT_ID_INDEX"), but the actual Flyway migration (V001) does not create it — Flyway is the source of truth, so the index does not exist in prod.

These index gaps multiply: every kanban-card query that filters by (tenant_id, item_reference_entity_id) falls back to a sequential scan with timestamp-based filtering, then a sort. For a tenant with a modest card history, plan choice degrades quickly.

  • count uses DISTINCT (bt.select(bt.eId).where(...).withDistinct(true).count()), which forces a sort/hash even when the result is small.
  • count() is invoked separately from list() when withTotal = true (see AbstractUniverse.list). They share the same WHERE but Postgres runs them as two independent statements. The optimizer can sometimes reuse plan caches but the work is duplicated.
  • find(op).orderBy(...).limit(...) returns a wide row read — every column on kanban_card (~30 fields) is serialised by Exposed into the Kotlin entity, then by kotlinx.serialization into JSON in the REST layer.

5. Cross-referencing with the frontend N+1

Section titled “5. Cross-referencing with the frontend N+1”
StepWhat runs
Browser opens /itemsBFF fetches /api/arda/items/query-ssrm (BFF cache hit, ~440 ms). 6 rows return.
Each of 6 rows mounts QuickActionsCellBFF fires POST /api/arda/kanban/kanban-card/query-details-by-item (×6) + GET /api/arda/kanban/kanban-card/query-by-item?eId=… (×6)
BFF route handlerJWT verify → fetch upstream → return
API Gatewaypassthrough
Operations pod, per details calllistWithDetails → 1 SELECT on kanban_card (bitemporal subquery) + 1 SELECT on item (bitemporal subquery)
Operations pod, per for-item callcardsForItem → 1 COUNT + 1 SELECT on kanban_card
Auroraabsorbs 12 × (~3 statements) = ~36 statements per /items load — none individually expensive, but each uses a connection pool slot (size 10)

The 1:1 BFF↔upstream ratio still holds, but inside operations, each upstream call is itself running 2–4 SQL statements. So Aurora is processing ~30 statements per /items load, all hitting the same two pods × 0.5 vCPU.

Why query-ssrm is fast and the others are slow

Section titled “Why query-ssrm is fast and the others are slow”

query-ssrm is fast not because the SQL is faster, but because it runs the SQL once per 5-minute TTL and serves all subsequent requests from memory. The kanban routes have no equivalent caching in the BFF or in operations.

src/main/resources/application.conf (operations):

dataSource {
pool {
minIdle = 1
maxPoolSize = 10
maxLifetime = 1800000
connectionTimeout = 30000
validationTimeout = 1000
idleTimeout = 600000
keepAliveTime = 600000
transactionIsolation = "TRANSACTION_REPEATABLE_READ"
}
}
  • Pool size: 10 per pod. With 2 pods, the component-wide pool is 20 connections.
  • Each inTransaction (and there are 2+ per BFF call) borrows one slot.
  • With 12 concurrent BFF requests per /items load × 2 inTransactions each = 24 concurrent connection demands → queueing.
  • TRANSACTION_REPEATABLE_READ is appropriate for the bitemporal reads but isolation level affects nothing here since these are read-only transactions.
  • Each module has its own DB and therefore its own pool — the kanban module uses prod-operations.kanban_db, item module uses prod-operations.item_db. So listWithDetails touches two pools per call (one for kanban_card, one for item).

7. Per-call work inventory (one /api/arda/kanban/kanban-card/query-details-by-item call)

Section titled “7. Per-call work inventory (one /api/arda/kanban/kanban-card/query-details-by-item call)”

For a typical 6-row tenant call (1 item, ≤ 25 cards):

  1. Ktor route dispatch + JWT/header parsing + ApplicationContext setup.
  2. inTransaction #1 (kanban_db pool):
    • Build Exposed query
    • Execute SELECT bt.* FROM kanban_card bt WHERE bt.id IN (... bitemporal subquery ...) ORDER BY ... LIMIT 1000
    • Materialise rows into KanbanCardRecord entities (Exposed ORM hydration)
    • Map each to BitemporalEntity<KanbanCard, KanbanCardMetadata> via asEntity
  3. inTransaction #2 (item_db pool):
    • targetItems = chunk.map { it.payload.item.eId }.toSet() — one item id in this case
    • Execute SELECT bt.* FROM item bt WHERE bt.id IN (... bitemporal subquery ...) AND bt.eId IN (...)
    • Same ORM hydration into Item.Entity
  4. composeDetails per card: allocate KanbanCardDetails, call lookupUrlId (string format), return EntityRecord.
  5. Serialise PageResult<KanbanCardDetails, KanbanCardMetadata> to JSON. Each card includes the full duplicated Item payload (~30 fields).
  6. Return through Ktor / API Gateway / BFF.

Steps 2, 3, 4, 5 are CPU-bound and run on the pod’s 0.5 vCPU. Step 5 is non-trivial: with 25 cards × ~50 fields × 30-field Item duplication per card, the response can be tens of KB of JSON.

  1. No composite bitemporal index on kanban_card. The single-column indexes force costly plans on the inner subquery for any non-trivial data set. Items table compounded by missing (tenant_id, item_reference_entity_id) index.
  2. Wide row hydration + JSON serialisation per request. Each card carries 30+ columns and the response duplicates Item details per card. Most of these fields are unused by the items-page row.
  3. cardsForItem runs an unused COUNT query (withTotal = true) per invocation. Pure waste.
  4. Multiple inTransaction blocks per service call in listWithDetails. Each opens a Suspended transaction on Dispatchers.IO and borrows from a 10-slot pool. Under N+1 load the pool queues.
  5. No service-level caching: every BFF call hits Aurora. The bitemporal data has natural read-mostly characteristics (events are rare relative to reads); an in-memory or Redis tier per pod would absorb the items-page load.
  6. HTTP 500 on “no cards”: silent error contract (page.tsx compensates by swallowing). Operations should return 200 {results: []}. This was P5 in the frontend code-dig.
  7. KanbanCardSummary.quantity sums cardQuantity.amount but not card count. The existing summary endpoint is close to what the items page needs but doesn’t carry count() or printStatus aggregates.

9. Concrete remediation proposals (operations side)

Section titled “9. Concrete remediation proposals (operations side)”

In order of effort × impact, lowest-effort first.

OP1 — Add composite indexes (XS, ship today)

Section titled “OP1 — Add composite indexes (XS, ship today)”

New Flyway migration V007__kanban_card_bitemporal_indexes.sql:

-- Supports the bitemporal "latest version per eId" subquery
CREATE INDEX CONCURRENTLY idx_kanban_card_eid_temporal
ON kanban_card (eid, effective_as_of DESC, recorded_as_of DESC);
-- Supports the items-page filter (per-tenant + per-item lookups)
CREATE INDEX CONCURRENTLY idx_kanban_card_tenant_item_temporal
ON kanban_card (tenant_id, item_reference_entity_id, eid,
effective_as_of DESC, recorded_as_of DESC)
WHERE retired = FALSE;

Use CREATE INDEX CONCURRENTLY to avoid table-lock during the migration; this prevents Flyway running inside a transaction, so it needs to be in its own file or split appropriately.

Same treatment needed for item table — check its current indexes and add equivalent composite if missing.

Acceptance: EXPLAIN ANALYZE on the bitemporal subquery shows “Index Only Scan using idx_kanban_card_eid_temporal”, and cardsForItem p95 drops from 2 s to a few hundred ms.

OP2 — Drop withTotal = true from cardsForItem (XS)

Section titled “OP2 — Drop withTotal = true from cardsForItem (XS)”

The totalCount is only used as a non-null sanity check. Pass withTotal = false. Remove the IncompatibleState branch.

Acceptance: pod log shows one SQL statement per for-item/{itemId} call instead of two.

OP3 — Return 200 + empty results instead of 500 for “no cards” (S)

Section titled “OP3 — Return 200 + empty results instead of 500 for “no cards” (S)”

Trace the failure path inside listWithDetails / cardsForItem and ensure a query with zero matches returns Result.success(emptyPage) not Result.failure. Pair with the frontend change (P5 in code-dig.md) to remove the silent 500-swallow.

Acceptance: pod log shows near-zero 5xx rate on POST /v1/kanban/kanban-card/details.

OP4 — Slim down the response shape (S, breaking change)

Section titled “OP4 — Slim down the response shape (S, breaking change)”

Two options:

  1. Add a ?fields= query param that selectively projects columns (Exposed supports column projection via .select(...cols)).
  2. Add a new /v1/kanban/kanban-card/light-details route that returns just the fields the items page needs: {eId, status, printStatus, cardQuantity, asOf} per card. Sized at ~5 fields × ≤25 cards = one-tenth the response.

The second is cleaner since the wide details endpoint has other consumers (e.g. ItemDetailsPanel).

Acceptance: payload for a typical 25-card response drops from ~50 KB to ~5 KB.

OP5 — Extend KanbanCardSummary to include counts + printStatus aggregation (M)

Section titled “OP5 — Extend KanbanCardSummary to include counts + printStatus aggregation (M)”

The existing summaryByStatus endpoint already does aggregate SQL. Add to the aggregation:

  • COUNT(*) as cardCount
  • COUNT(*) FILTER (WHERE print_status = 'PRINTED') as printedCount
  • (and any other small aggregates the items page needs)

Then expose a new route:

POST /v1/kanban/kanban-card-summary/for-items
Request: { items: [<eId1>, <eId2>, ...] }
Response: { perItem: { "<eId>": { totalCount, requestingCount,
inProcessCount, fulfilledCount,
printedCount, oldestFulfilledCardEId? } } }

This is the operations side of the frontend’s P3 proposal in code-dig.md. With this and OP1, the items page goes from “N+1 wide row fetches” to “1 aggregated GROUP-BY SQL”.

Acceptance: the BFF’s /items load issues exactly one /kanban-card-summary/for-items request and zero per-row kanban requests. Pod log shows 1 SQL per /items load instead of ~30.

OP6 — Tune connection pool size (S, ops change)

Section titled “OP6 — Tune connection pool size (S, ops change)”

maxPoolSize = 10 per pod is conservative. With 0.5 vCPU per pod (today) and small per-query work, 10 is roughly adequate.

Once the pod CPU is bumped (per the EKS findings), raise the pool to 20–25 per pod. Coordinate with Aurora max_connectionsdb.t3.medium default is ~145 total; budgeting 25 × 2 pods × 6 module DBs = 300 exceeds the limit. Either consolidate the per-module pools or move to a larger Aurora instance class (see PDEV-479).

OP7 — Service-level read cache (M, optional)

Section titled “OP7 — Service-level read cache (M, optional)”

listWithDetails and cardsForItem results for a (tenantId, asOf) key change only on mutations (which already fire DataAuthorityNotification.NewEntity via notify { ... } in newCard). A per-pod Caffeine cache keyed on (tenantId, itemEId, asOf-bucket) with an invalidation listener on DataAuthorityNotification would absorb steady-state load.

For multi-pod consistency, layer a small Redis tier between the pod and Aurora — but that’s a step-up the team may not want yet.

Pair each with the frontend/back ticket counterparts where relevant.

StepOperations workEffortImpact
1OP1 — composite bitemporal indexesXSper-call SQL p95 drops ~5–10×; needs Flyway PR + verify in stage
2OP2 — drop unused COUNT in cardsForItemXShalves SQL count on for-item route
3OP3 — fix 500 == no cards contractSunblocks observability; partners with frontend P5
4OP4 — slim response shapeSsmaller JSON, faster serialise, smaller network bytes
5OP5 — summary-for-items aggregate endpointMenables frontend P3 (one BFF call replaces N+1)
6OP6 — pool sizing reviewSunblocks higher pod CPU per the EKS recs
7OP7 — service-level cacheMoptional; revisit after OP1+OP5 ship

11. Open questions for the operations / common-module owners

Section titled “11. Open questions for the operations / common-module owners”
  • Is the ScopedTable.tenantId.index("TENANT_ID_INDEX") declaration in AbstractScopedUniverse.kt intended to be created by Exposed at startup, or is Flyway authoritative? V001 doesn’t create it. Either way, the index is missing in prod.
  • Is transactionIsolation = TRANSACTION_REPEATABLE_READ necessary for these read-only calls? READ_COMMITTED would reduce snapshot pressure on Aurora for read-heavy workloads.
  • Is there a reason listWithDetails re-fetches item per chunk instead of once up front? With the typical per-row BFF call, the chunking only creates duplicate queries with no benefit.
  • Why does cardsForItem request withTotal = true when the caller doesn’t propagate the total?
  • What’s the right home for the new aggregate endpoint — the existing kanban-card-summary family or a new kanban-card-summary/for-items route? The existing summary endpoint hard-codes “sum cardQuantity by status” — extending it might be cleaner as a new sibling.