PDEV-442 — Code dig of arda-frontend-app for BFF slowness and N+1 root causes
This document records a close reading of the arda-frontend-app worktree
at projects/product-slow-responses-worktrees/arda-frontend-app/,
focused on the two endpoints we identified as the N+1 sources for
/items:
GET /api/arda/kanban/kanban-card/query-by-itemPOST /api/arda/kanban/kanban-card/query-details-by-item
and the cheap reference points that don’t show the same pathology:
POST /api/arda/items/query-ssrmPOST /api/arda/kanban/kanban-card/details/{requesting,requested,in-process,fulfilled}GET /api/arda/kanban/kanban-card/[eId]
All file references below are relative to the arda-frontend-app
worktree root.
1. The data flow on /items
Section titled “1. The data flow on /items”Client side
Section titled “Client side”src/app/items/page.tsx (2,265 lines) is the items page. It renders an
AG Grid via src/app/items/ItemTableAGGrid.tsx, with the per-row Quick
Actions column defined in src/components/table/columnPresets.tsx.
Per-row data flow:
- AG Grid mounts and requests rows via SSRM (
POST /api/arda/items/query-ssrm). - For every visible row, the
QuickActionsCellrenderer callsensureCardsForItem(item.entityId)on mount. ensureCardsForItem(page.tsx, line 437) callsgetKanbanCardsForItem(page.tsx, line 306), which fetchesPOST /api/arda/kanban/kanban-card/query-details-by-itemwith a filter on the item’s entity id andpaginate {index: 0, size: 100}.- Independently,
QuickActionsCell(columnPresets.tsx, line 116) also callsqueryCandidateCard, which fetchesGET /api/arda/kanban/kanban-card/query-by-item?eId=<entityId>to find the oldest FULFILLED card. This is a separate per-row request. - Both per-row promises run in parallel (no batching, no de-dup beyond
cardFetchPromisesRefwhich only prevents a duplicate concurrent call for the same item).
Result: every visible row generates exactly 2 backend round-trips
(query-details-by-item + query-by-item). For the 6-row test tenant
that’s 12 extra requests per /items entry; for a 60-row tenant it
becomes 120.
What the row actually uses from the response
Section titled “What the row actually uses from the response”The huge per-card payload (KanbanCardResult in src/types/kanban.ts,
~50 fields including a nested duplicated itemDetails block of ~30
fields) is consumed for five aggregates only at the row level:
| Use | Source |
|---|---|
safeCards.length (badge: total cards) | query-details-by-item |
inOrderQueueCount (cards where payload.status === 'REQUESTING') | query-details-by-item |
printedCount (cards where payload.printStatus === 'PRINTED') | query-details-by-item |
List of card.payload.eId (only used when bulk-printing this row) | query-details-by-item |
candidateCard.payload.eId (oldest FULFILLED, only when user clicks “Add to order queue”) | query-by-item |
The row’s steady state only needs the three small counts and the
total. Per-card eId lists and the FULFILLED candidate are only needed
on user action — they don’t have to be fetched on mount.
The current implementation ships every field of every card on every page entry, even though the row consumes 3–4 numbers.
2. Reading the BFF route handlers
Section titled “2. Reading the BFF route handlers”query-by-item/route.ts (79 lines, clean)
Section titled “query-by-item/route.ts (79 lines, clean)”Linear path: JWT verify → build headers → fetch upstream
/v1/kanban/kanban-card/for-item/{eId} → parse → forward.
Only one verbose log statement: console.log('Fetch request completed, status:', upstream.status);
on success. Error path logs three lines. No body dumps, no header
dumps.
query-details-by-item/route.ts (146 lines)
Section titled “query-details-by-item/route.ts (146 lines)”Same control flow, but the implementation is heavy:
- 6
console.logstatements per call before the upstream returns (lines 52–100): request body, user context, request headers (including the full Bearer token — see PDEV-478), the X-Author header, the X-Tenant-Id header, and a truncated Authorization header. - After the upstream returns, 4 more
console.logstatements (lines 110–124): “response received”, status, success, and the full response body text (text).
Each console.log in Lambda writes synchronously to stdout via the
runtime’s bridge. With a payload of N cards (~50 fields each, much of
it duplicated itemDetails), the response-body log can be tens of KB
per request. CloudWatch ingest is logged into the request’s wall-clock
duration.
This is the single most concrete BFF-internal cost contributor I found
in the code. The details/{state} routes (requesting, requested,
in-process, fulfilled) are by contrast 61–64 lines, with zero
verbose logging beyond an console.error on the catch branch.
details/route.ts and details/{state}/route.ts (61–64 lines, clean)
Section titled “details/route.ts and details/{state}/route.ts (61–64 lines, clean)”JWT verify → fetch upstream /v1/kanban/kanban-card/details[/{state}]?effectiveasof=<now>&recordedasof=<now> → forward.
These are the routes Order Queue uses, and they show p95 0.8–1.0 s in Sentry. No per-call logging overhead.
items/query-ssrm/route.ts (130 lines)
Section titled “items/query-ssrm/route.ts (130 lines)”Critically, this is not a proxy. It calls into
src/app/api/arda/items/_lib/cachedItems.ts, which uses
next/cache.unstable_cache to keep a per-tenant cache of every item
the tenant has, fetched in 500-row pages from /v1/item/item/query,
with a 5-minute TTL. The SSRM endpoint reads from cache, filters and
sorts in-process, and returns one page.
That explains why query-ssrm is consistently fast (~200–440 ms in our
browser samples; the route does no upstream call on cache hits).
[eId]/route.ts
Section titled “[eId]/route.ts”Single-card fetch by entity id. Not on the critical path for /items,
but its BFF transaction p95 (5.8 s) vs. upstream http.client p95
(0.31 s) is the cleanest piece of evidence for the “BFF gap” being
real on cheap routes — most plausibly Lambda cold start.
3. Working hypotheses for the BFF slowness
Section titled “3. Working hypotheses for the BFF slowness”Counts are 1:1 between BFF inbound and outbound on query-by-item
(2,921 ÷ 2,846 ≈ 0.97), so the BFF is not looping. The gap between BFF
transaction p95 (158 s) and upstream http.client p95 (13.5 s) is
~145 s. Working hypotheses ranked by plausibility, given the code we
read and the EKS/Aurora data from the prior comments:
H1. Operations pod CPU is the dominant ground-truth bottleneck
Section titled “H1. Operations pod CPU is the dominant ground-truth bottleneck”Already established from the EKS dig: 0.5 vCPU effective per pod ×
2 pods = 1 vCPU total. With the /items fan-out putting 12 concurrent
upstream requests on 2 pods, each upstream call serializes against
1 vCPU and degrades to 1–2 s steady-state, much worse under peak load.
Code-side evidence corroborates: the upstream pod log shows the same
operation taking 800–2,050 ms server-side.
H2. Verbose logging in query-details-by-item route handler
Section titled “H2. Verbose logging in query-details-by-item route handler”10 console.log calls per invocation, one of which dumps the entire
upstream response body. On Lambda, each log call has measurable
overhead and serialised execution. The full-body dump is the largest
single contributor — the response can be ~3 KB per card × dozens of
cards on bigger tenants. This is the most concrete BFF-side cost we
can attribute from code.
H3. Lambda cold start (confirmed in REPORT lines)
Section titled “H3. Lambda cold start (confirmed in REPORT lines)”We already captured one REPORT line with Init Duration: 2181.28 ms.
Cold-start cost accounts for the 5–8 s gap on otherwise-cheap routes
(kanban-card/[eId], items/lookup-locations, tenant/[tenantId])
where upstream is sub-300 ms. This is independent of how the route
handler is written.
H4. Amplify SSR per-container concurrency / queueing
Section titled “H4. Amplify SSR per-container concurrency / queueing”Not directly observable from code, but consistent with the tail-event extent (the 158 s p95 is far above any per-invocation cost we can construct from H1–H3). When concurrency exceeds the container’s slot count, new requests wait; the wait time is included in the BFF transaction span. This is the only candidate I have for the p99 long tail beyond ~30 s, given the absence of in-handler retries.
H5. fetch defaults and a leaking keep-alive
Section titled “H5. fetch defaults and a leaking keep-alive”All ARDA route handlers send Connection: keep-alive to the upstream
but also use cache: 'no-store' and the default global agent. On
Lambda, the global keep-alive agent is short-lived (the container
itself is short-lived), so most upstream calls open a fresh
TLS connection per invocation. TLS handshakes to a private API
Gateway endpoint add ~50–200 ms. Small in isolation; relevant when
piled on top of cold starts.
H6. JWT verification
Section titled “H6. JWT verification”processJWTForArda (src/lib/jwt.ts) uses aws-jwt-verify, which
caches the Cognito JWKS. Per-request signature verification is
sub-millisecond. Not a meaningful contributor.
H7. Hard-coded User-Agent: PostmanRuntime/7.45.0
Section titled “H7. Hard-coded User-Agent: PostmanRuntime/7.45.0”Cosmetic but worth fixing — looks like copy-paste from a Postman example. Not a performance cause.
4. Concrete N+1 elimination proposals
Section titled “4. Concrete N+1 elimination proposals”Listed in order of disruption (lowest first). Each proposal is specific enough to scope to a PR.
P1. Remove dead logging from query-details-by-item
Section titled “P1. Remove dead logging from query-details-by-item”Effort: trivial. Impact: measurable BFF-side wall-clock reduction per invocation; removes the security finding (PDEV-478) along the way.
Delete lines 52–55, 56, 75–88, 89–96, 97–100, 110–124 of
src/app/api/arda/kanban/kanban-card/query-details-by-item/route.ts.
Replace with a single one-line console.info on success and the
existing console.error on failure. Sweep the same pattern across
other ARDA proxy routes (the verbose [ARDA …] log calls).
Acceptance: no log lines containing Authorization, Bearer, or
the full response body across app/api/arda/**. Existing tests still
pass.
P2. Defer the per-row queryCandidateCard call to user action
Section titled “P2. Defer the per-row queryCandidateCard call to user action”Effort: small. Impact: halves the per-row backend round-trip count (12 → 6 for the 6-row tenant; 120 → 60 for 60 rows).
Today QuickActionsCell (columnPresets.tsx, line 116) calls
queryCandidateCard unconditionally on mount, but the result is only
used when the user clicks “Add to order queue”. Delay the call until
the button is pressed (or hovered, with a short debounce). The current
implementation includes signal?.aborted handling so the deferred call
fits naturally.
Acceptance: no query-by-item requests are issued during a normal
/items page load. The Add-to-order-queue button issues one
query-by-item request only when clicked.
P3. Replace query-details-by-item per-row with a per-table aggregate endpoint
Section titled “P3. Replace query-details-by-item per-row with a per-table aggregate endpoint”Effort: medium (needs an operations-side addition). Impact: eliminates the remaining N+1 entirely.
Add an operations route
POST /v1/kanban/kanban-card/summary-for-items that takes a list of
item entity ids and returns a compact aggregate per item:
// Request{ "items": ["<eid1>", "<eid2>", ...] }
// Response{ "results": { "<eid1>": { "total": 12, "requestingCount": 3, "printedCount": 8, "fulfilledCount": 1, "oldestFulfilledCardEId": "...", // optional; powers P2 "cardEIds": ["...", "..."] // optional; powers bulk print } }}Then the BFF route POST /api/arda/kanban/kanban-card/summary-for-items
proxies it, with the same JWT / X-Tenant / X-Author handling as the
existing routes. The frontend SSRM endpoint
(items/query-ssrm/route.ts) is the natural place to call this once
per page request: after getCachedMappedItems returns the row slice,
fetch summary data for just the visible row eIds in a single upstream
call, and merge the counts into the SSRM response so AG Grid gets
everything in one round-trip.
Acceptance: one upstream summary request per /items page entry,
regardless of row count. The QuickActions cell renders from row data,
no per-cell useEffect fetch.
P4. Cache the summary endpoint with the same pattern as query-ssrm
Section titled “P4. Cache the summary endpoint with the same pattern as query-ssrm”Effort: small (after P3). Impact: reduces backend load and BFF cold-start sensitivity.
Re-use the unstable_cache pattern from cachedItems.ts:
const cachedFetch = unstable_cache( () => fetchSummary(itemEids, userContext, requestId), ['kanban-summary', userContext.tenantId, itemEids.join(',')], { tags: [kanbanSummaryCacheTag(userContext.tenantId)], revalidate: 60 },);Invalidate the tag on any kanban-card mutation
(/event/request, /event/accept, /event/start-processing,
/event/fulfill, etc.) by adding a revalidateTag call to those
route handlers — same approach query-ssrm already uses against
itemsCacheTag(tenantId).
Acceptance: after the first /items load, subsequent loads within
the TTL return the kanban summary from cache with no upstream call.
Mutations correctly invalidate the cache for the affected tenant.
P5. Fix the “500 == no cards” contract in operations and the silent client swallowing
Section titled “P5. Fix the “500 == no cards” contract in operations and the silent client swallowing”Effort: small backend, small frontend. Impact: correctness; unblocks proper error monitoring.
Today the operations endpoint POST /v1/kanban/kanban-card/details
returns HTTP 500 when an item has no kanban cards
(page.tsx:388-396 comments: “Don’t log 500 errors - they might be
expected for items without cards”). This corrupts every observability
signal — Sentry, alarms, and the 37 % error rate we measured.
- Operations: return
200with{results: []}for the empty case. - Frontend
getKanbanCardsForItem: remove the special-case at lines 387-396 and treat 500 as a genuine error.
Acceptance: the SERVER-role aggregate in pod logs shows a
near-zero 5xx rate for kanban-card/details; Sentry stops being
polluted by these.
P6. Single batched summary-for-items upgrade path on the operations side
Section titled “P6. Single batched summary-for-items upgrade path on the operations side”Effort: medium-large; coordinate with operations. Impact:
order-of-magnitude reduction in operations CPU for /items page
loads.
Today, even if the BFF made a single batched call, operations would
internally execute the same per-item bitemporal query in a loop
(judging by the per-item SQL pattern seen in Performance Insights).
The summary endpoint added in P3 should be implemented in operations
as a single SQL query with WHERE item_reference_entity_id = ANY($1)
plus aggregation per item id — not as a for-loop calling the existing
per-item code.
Acceptance: operations pod log shows a single SERVER request per
/items load (not N). Aurora PI shows a single aggregate query, not N
per-item ones.
5. Suggested execution order
Section titled “5. Suggested execution order”| Step | Ticket size | Risk | Expected user-visible effect |
|---|---|---|---|
| P1 — strip verbose logs in BFF route handlers | XS | low | small (–50–200 ms / call) + closes PDEV-478 |
P2 — defer queryCandidateCard | XS | low | halves per-row round trips (–~1 s on cold load) |
P3 — summary-for-items BFF + operations endpoint | M | medium | eliminates N+1; /items load drops from ~5 s to ~1 s on the test tenant |
| P4 — cache the summary endpoint | S | low | second visits within TTL near-instant |
P5 — fix the 500 == no cards contract | S | low | unblocks observability |
| P6 — implement summary in operations as a single SQL | M | medium | reduces operations CPU by ~6× on /items |
P1 and P2 can ship today without any backend coordination. P3, P5, P6 require coordinated changes with the operations component owner — but each is independently valuable.
6. Code-side open questions to verify
Section titled “6. Code-side open questions to verify”- Are there other call sites of
getKanbanCardsForItem? Found inItemDetailsPanel.tsxandManageCardsPanel.tsx— those are user-action contexts (opening a detail panel), not page-mount, so they’re acceptable as-is. Verify before refactoring. - Does the existing
next/cachesetup work correctly under Amplify SSR cold starts?unstable_cacheis per-Lambda-container memory. Each cold container has an empty cache and pays the upstream cost on its first call. With P3 + P4, an HPA-driven scale-out would mean more cold caches; combine with longer TTL or warm pools. - What is the connection pool size in
operationsper pod? Open question from the pod_capacity findings — once CPU is freed by the N+1 fix, the DB connection pool may become the next bottleneck.
7. Bycatch (worth filing if not already)
Section titled “7. Bycatch (worth filing if not already)”- Hard-coded
User-Agent: PostmanRuntime/7.45.0in every proxy route handler underapp/api/arda/**. Should beArda BFF/<version>or similar — already correct incachedItems.ts:39. - Mixed JWT handling:
getKanbanCardsForItemreads the JWT fromlocalStorageand inlines it into theAuthorizationheader (page.tsx:309-340) but the route handler ignores that header for upstream auth and usesenv.ARDA_API_KEYinstead. Worth consolidating in a single auth helper to prevent confusion. devLogchains ingetKanbanCardsForItem: even if behind a dev flag, the strings are constructed on every call (template literals,JSON.stringify). Wrap inif (isDev)blocks or rely on a logger that doesn’t evaluate args when disabled.
Copyright: © Arda Systems 2025-2026, All rights reserved