Skip to content

PDEV-442 — Code dig of arda-frontend-app for BFF slowness and N+1 root causes

This document records a close reading of the arda-frontend-app worktree at projects/product-slow-responses-worktrees/arda-frontend-app/, focused on the two endpoints we identified as the N+1 sources for /items:

  • GET /api/arda/kanban/kanban-card/query-by-item
  • POST /api/arda/kanban/kanban-card/query-details-by-item

and the cheap reference points that don’t show the same pathology:

  • POST /api/arda/items/query-ssrm
  • POST /api/arda/kanban/kanban-card/details/{requesting,requested,in-process,fulfilled}
  • GET /api/arda/kanban/kanban-card/[eId]

All file references below are relative to the arda-frontend-app worktree root.

src/app/items/page.tsx (2,265 lines) is the items page. It renders an AG Grid via src/app/items/ItemTableAGGrid.tsx, with the per-row Quick Actions column defined in src/components/table/columnPresets.tsx.

Per-row data flow:

  1. AG Grid mounts and requests rows via SSRM (POST /api/arda/items/query-ssrm).
  2. For every visible row, the QuickActionsCell renderer calls ensureCardsForItem(item.entityId) on mount.
  3. ensureCardsForItem (page.tsx, line 437) calls getKanbanCardsForItem (page.tsx, line 306), which fetches POST /api/arda/kanban/kanban-card/query-details-by-item with a filter on the item’s entity id and paginate {index: 0, size: 100}.
  4. Independently, QuickActionsCell (columnPresets.tsx, line 116) also calls queryCandidateCard, which fetches GET /api/arda/kanban/kanban-card/query-by-item?eId=<entityId> to find the oldest FULFILLED card. This is a separate per-row request.
  5. Both per-row promises run in parallel (no batching, no de-dup beyond cardFetchPromisesRef which only prevents a duplicate concurrent call for the same item).

Result: every visible row generates exactly 2 backend round-trips (query-details-by-item + query-by-item). For the 6-row test tenant that’s 12 extra requests per /items entry; for a 60-row tenant it becomes 120.

What the row actually uses from the response

Section titled “What the row actually uses from the response”

The huge per-card payload (KanbanCardResult in src/types/kanban.ts, ~50 fields including a nested duplicated itemDetails block of ~30 fields) is consumed for five aggregates only at the row level:

UseSource
safeCards.length (badge: total cards)query-details-by-item
inOrderQueueCount (cards where payload.status === 'REQUESTING')query-details-by-item
printedCount (cards where payload.printStatus === 'PRINTED')query-details-by-item
List of card.payload.eId (only used when bulk-printing this row)query-details-by-item
candidateCard.payload.eId (oldest FULFILLED, only when user clicks “Add to order queue”)query-by-item

The row’s steady state only needs the three small counts and the total. Per-card eId lists and the FULFILLED candidate are only needed on user action — they don’t have to be fetched on mount.

The current implementation ships every field of every card on every page entry, even though the row consumes 3–4 numbers.

Linear path: JWT verify → build headers → fetch upstream /v1/kanban/kanban-card/for-item/{eId} → parse → forward.

Only one verbose log statement: console.log('Fetch request completed, status:', upstream.status); on success. Error path logs three lines. No body dumps, no header dumps.

query-details-by-item/route.ts (146 lines)

Section titled “query-details-by-item/route.ts (146 lines)”

Same control flow, but the implementation is heavy:

  • 6 console.log statements per call before the upstream returns (lines 52–100): request body, user context, request headers (including the full Bearer token — see PDEV-478), the X-Author header, the X-Tenant-Id header, and a truncated Authorization header.
  • After the upstream returns, 4 more console.log statements (lines 110–124): “response received”, status, success, and the full response body text (text).

Each console.log in Lambda writes synchronously to stdout via the runtime’s bridge. With a payload of N cards (~50 fields each, much of it duplicated itemDetails), the response-body log can be tens of KB per request. CloudWatch ingest is logged into the request’s wall-clock duration.

This is the single most concrete BFF-internal cost contributor I found in the code. The details/{state} routes (requesting, requested, in-process, fulfilled) are by contrast 61–64 lines, with zero verbose logging beyond an console.error on the catch branch.

details/route.ts and details/{state}/route.ts (61–64 lines, clean)

Section titled “details/route.ts and details/{state}/route.ts (61–64 lines, clean)”

JWT verify → fetch upstream /v1/kanban/kanban-card/details[/{state}]?effectiveasof=<now>&recordedasof=<now> → forward.

These are the routes Order Queue uses, and they show p95 0.8–1.0 s in Sentry. No per-call logging overhead.

Critically, this is not a proxy. It calls into src/app/api/arda/items/_lib/cachedItems.ts, which uses next/cache.unstable_cache to keep a per-tenant cache of every item the tenant has, fetched in 500-row pages from /v1/item/item/query, with a 5-minute TTL. The SSRM endpoint reads from cache, filters and sorts in-process, and returns one page.

That explains why query-ssrm is consistently fast (~200–440 ms in our browser samples; the route does no upstream call on cache hits).

Single-card fetch by entity id. Not on the critical path for /items, but its BFF transaction p95 (5.8 s) vs. upstream http.client p95 (0.31 s) is the cleanest piece of evidence for the “BFF gap” being real on cheap routes — most plausibly Lambda cold start.

3. Working hypotheses for the BFF slowness

Section titled “3. Working hypotheses for the BFF slowness”

Counts are 1:1 between BFF inbound and outbound on query-by-item (2,921 ÷ 2,846 ≈ 0.97), so the BFF is not looping. The gap between BFF transaction p95 (158 s) and upstream http.client p95 (13.5 s) is ~145 s. Working hypotheses ranked by plausibility, given the code we read and the EKS/Aurora data from the prior comments:

H1. Operations pod CPU is the dominant ground-truth bottleneck

Section titled “H1. Operations pod CPU is the dominant ground-truth bottleneck”

Already established from the EKS dig: 0.5 vCPU effective per pod × 2 pods = 1 vCPU total. With the /items fan-out putting 12 concurrent upstream requests on 2 pods, each upstream call serializes against 1 vCPU and degrades to 1–2 s steady-state, much worse under peak load. Code-side evidence corroborates: the upstream pod log shows the same operation taking 800–2,050 ms server-side.

H2. Verbose logging in query-details-by-item route handler

Section titled “H2. Verbose logging in query-details-by-item route handler”

10 console.log calls per invocation, one of which dumps the entire upstream response body. On Lambda, each log call has measurable overhead and serialised execution. The full-body dump is the largest single contributor — the response can be ~3 KB per card × dozens of cards on bigger tenants. This is the most concrete BFF-side cost we can attribute from code.

H3. Lambda cold start (confirmed in REPORT lines)

Section titled “H3. Lambda cold start (confirmed in REPORT lines)”

We already captured one REPORT line with Init Duration: 2181.28 ms. Cold-start cost accounts for the 5–8 s gap on otherwise-cheap routes (kanban-card/[eId], items/lookup-locations, tenant/[tenantId]) where upstream is sub-300 ms. This is independent of how the route handler is written.

H4. Amplify SSR per-container concurrency / queueing

Section titled “H4. Amplify SSR per-container concurrency / queueing”

Not directly observable from code, but consistent with the tail-event extent (the 158 s p95 is far above any per-invocation cost we can construct from H1–H3). When concurrency exceeds the container’s slot count, new requests wait; the wait time is included in the BFF transaction span. This is the only candidate I have for the p99 long tail beyond ~30 s, given the absence of in-handler retries.

H5. fetch defaults and a leaking keep-alive

Section titled “H5. fetch defaults and a leaking keep-alive”

All ARDA route handlers send Connection: keep-alive to the upstream but also use cache: 'no-store' and the default global agent. On Lambda, the global keep-alive agent is short-lived (the container itself is short-lived), so most upstream calls open a fresh TLS connection per invocation. TLS handshakes to a private API Gateway endpoint add ~50–200 ms. Small in isolation; relevant when piled on top of cold starts.

processJWTForArda (src/lib/jwt.ts) uses aws-jwt-verify, which caches the Cognito JWKS. Per-request signature verification is sub-millisecond. Not a meaningful contributor.

H7. Hard-coded User-Agent: PostmanRuntime/7.45.0

Section titled “H7. Hard-coded User-Agent: PostmanRuntime/7.45.0”

Cosmetic but worth fixing — looks like copy-paste from a Postman example. Not a performance cause.

Listed in order of disruption (lowest first). Each proposal is specific enough to scope to a PR.

P1. Remove dead logging from query-details-by-item

Section titled “P1. Remove dead logging from query-details-by-item”

Effort: trivial. Impact: measurable BFF-side wall-clock reduction per invocation; removes the security finding (PDEV-478) along the way.

Delete lines 52–55, 56, 75–88, 89–96, 97–100, 110–124 of src/app/api/arda/kanban/kanban-card/query-details-by-item/route.ts. Replace with a single one-line console.info on success and the existing console.error on failure. Sweep the same pattern across other ARDA proxy routes (the verbose [ARDA …] log calls).

Acceptance: no log lines containing Authorization, Bearer, or the full response body across app/api/arda/**. Existing tests still pass.

P2. Defer the per-row queryCandidateCard call to user action

Section titled “P2. Defer the per-row queryCandidateCard call to user action”

Effort: small. Impact: halves the per-row backend round-trip count (12 → 6 for the 6-row tenant; 120 → 60 for 60 rows).

Today QuickActionsCell (columnPresets.tsx, line 116) calls queryCandidateCard unconditionally on mount, but the result is only used when the user clicks “Add to order queue”. Delay the call until the button is pressed (or hovered, with a short debounce). The current implementation includes signal?.aborted handling so the deferred call fits naturally.

Acceptance: no query-by-item requests are issued during a normal /items page load. The Add-to-order-queue button issues one query-by-item request only when clicked.

P3. Replace query-details-by-item per-row with a per-table aggregate endpoint

Section titled “P3. Replace query-details-by-item per-row with a per-table aggregate endpoint”

Effort: medium (needs an operations-side addition). Impact: eliminates the remaining N+1 entirely.

Add an operations route POST /v1/kanban/kanban-card/summary-for-items that takes a list of item entity ids and returns a compact aggregate per item:

// Request
{ "items": ["<eid1>", "<eid2>", ...] }
// Response
{
"results": {
"<eid1>": {
"total": 12,
"requestingCount": 3,
"printedCount": 8,
"fulfilledCount": 1,
"oldestFulfilledCardEId": "...", // optional; powers P2
"cardEIds": ["...", "..."] // optional; powers bulk print
}
}
}

Then the BFF route POST /api/arda/kanban/kanban-card/summary-for-items proxies it, with the same JWT / X-Tenant / X-Author handling as the existing routes. The frontend SSRM endpoint (items/query-ssrm/route.ts) is the natural place to call this once per page request: after getCachedMappedItems returns the row slice, fetch summary data for just the visible row eIds in a single upstream call, and merge the counts into the SSRM response so AG Grid gets everything in one round-trip.

Acceptance: one upstream summary request per /items page entry, regardless of row count. The QuickActions cell renders from row data, no per-cell useEffect fetch.

P4. Cache the summary endpoint with the same pattern as query-ssrm

Section titled “P4. Cache the summary endpoint with the same pattern as query-ssrm”

Effort: small (after P3). Impact: reduces backend load and BFF cold-start sensitivity.

Re-use the unstable_cache pattern from cachedItems.ts:

const cachedFetch = unstable_cache(
() => fetchSummary(itemEids, userContext, requestId),
['kanban-summary', userContext.tenantId, itemEids.join(',')],
{ tags: [kanbanSummaryCacheTag(userContext.tenantId)], revalidate: 60 },
);

Invalidate the tag on any kanban-card mutation (/event/request, /event/accept, /event/start-processing, /event/fulfill, etc.) by adding a revalidateTag call to those route handlers — same approach query-ssrm already uses against itemsCacheTag(tenantId).

Acceptance: after the first /items load, subsequent loads within the TTL return the kanban summary from cache with no upstream call. Mutations correctly invalidate the cache for the affected tenant.

P5. Fix the “500 == no cards” contract in operations and the silent client swallowing

Section titled “P5. Fix the “500 == no cards” contract in operations and the silent client swallowing”

Effort: small backend, small frontend. Impact: correctness; unblocks proper error monitoring.

Today the operations endpoint POST /v1/kanban/kanban-card/details returns HTTP 500 when an item has no kanban cards (page.tsx:388-396 comments: “Don’t log 500 errors - they might be expected for items without cards”). This corrupts every observability signal — Sentry, alarms, and the 37 % error rate we measured.

  • Operations: return 200 with {results: []} for the empty case.
  • Frontend getKanbanCardsForItem: remove the special-case at lines 387-396 and treat 500 as a genuine error.

Acceptance: the SERVER-role aggregate in pod logs shows a near-zero 5xx rate for kanban-card/details; Sentry stops being polluted by these.

P6. Single batched summary-for-items upgrade path on the operations side

Section titled “P6. Single batched summary-for-items upgrade path on the operations side”

Effort: medium-large; coordinate with operations. Impact: order-of-magnitude reduction in operations CPU for /items page loads.

Today, even if the BFF made a single batched call, operations would internally execute the same per-item bitemporal query in a loop (judging by the per-item SQL pattern seen in Performance Insights). The summary endpoint added in P3 should be implemented in operations as a single SQL query with WHERE item_reference_entity_id = ANY($1) plus aggregation per item id — not as a for-loop calling the existing per-item code.

Acceptance: operations pod log shows a single SERVER request per /items load (not N). Aurora PI shows a single aggregate query, not N per-item ones.

StepTicket sizeRiskExpected user-visible effect
P1 — strip verbose logs in BFF route handlersXSlowsmall (–50–200 ms / call) + closes PDEV-478
P2 — defer queryCandidateCardXSlowhalves per-row round trips (–~1 s on cold load)
P3 — summary-for-items BFF + operations endpointMmediumeliminates N+1; /items load drops from ~5 s to ~1 s on the test tenant
P4 — cache the summary endpointSlowsecond visits within TTL near-instant
P5 — fix the 500 == no cards contractSlowunblocks observability
P6 — implement summary in operations as a single SQLMmediumreduces operations CPU by ~6× on /items

P1 and P2 can ship today without any backend coordination. P3, P5, P6 require coordinated changes with the operations component owner — but each is independently valuable.

  • Are there other call sites of getKanbanCardsForItem? Found in ItemDetailsPanel.tsx and ManageCardsPanel.tsx — those are user-action contexts (opening a detail panel), not page-mount, so they’re acceptable as-is. Verify before refactoring.
  • Does the existing next/cache setup work correctly under Amplify SSR cold starts? unstable_cache is per-Lambda-container memory. Each cold container has an empty cache and pays the upstream cost on its first call. With P3 + P4, an HPA-driven scale-out would mean more cold caches; combine with longer TTL or warm pools.
  • What is the connection pool size in operations per pod? Open question from the pod_capacity findings — once CPU is freed by the N+1 fix, the DB connection pool may become the next bottleneck.
  1. Hard-coded User-Agent: PostmanRuntime/7.45.0 in every proxy route handler under app/api/arda/**. Should be Arda BFF/<version> or similar — already correct in cachedItems.ts:39.
  2. Mixed JWT handling: getKanbanCardsForItem reads the JWT from localStorage and inlines it into the Authorization header (page.tsx:309-340) but the route handler ignores that header for upstream auth and uses env.ARDA_API_KEY instead. Worth consolidating in a single auth helper to prevent confusion.
  3. devLog chains in getKanbanCardsForItem: even if behind a dev flag, the strings are constructed on every call (template literals, JSON.stringify). Wrap in if (isDev) blocks or rely on a logger that doesn’t evaluate args when disabled.