Investigation: PDEV-610 Stale-Data Banner Cross-User Signal

Working notebook for the PDEV-610 sub-project. Captures problem framing, options, current-state findings, and live discussion threads. Promoted content lands in goal.md, the inline Decision Log section of design.md, or a future specification.md.

Problem

The items detail panel renders cards via useFreshRead, which on mount snapshots the set of rId values it sees and only flips isStale=true when its own next refresh finds a different set. The banner is wired to that flag.

Flow today:

User A edits → A’s save handler refreshes A’s cache → A’s snapshot diff trips → A sees the banner. ✅
User B has the same item open in another browser → nothing in B’s process is told anything happened → B’s useFreshRead never refreshes, never sees an rId change → no banner. ❌

The gap is not in the banner or in useFreshRead; both work correctly given their inputs. The gap is that no transport at all carries “this item moved” from A to B. No BroadcastChannel, no polling, no SSE, no WebSocket. The refactor that introduced the banner exposed this gap; it didn’t create it.

Every option below is some flavor of “give User B a reason to revalidate.”

Options, least → most architectural impact

1. Manual “Refresh” affordance

Where: arda-frontend-app only, one component.
Mechanism: User B clicks Refresh → existing refreshCardsForItems() runs → useFreshRead’s next diff trips the banner if rIds changed.
Pros: trivial; zero new infrastructure; zero recurring cost.
Cons: doesn’t actually solve the reported bug — User B still has to know to refresh. UX bandage.
Verdict: acceptable only as a fallback paired with another option.

2. Window-focus / visibility revalidation

Where: arda-frontend-app only. Listen for visibilitychange and focus in ItemCardsProvider (or a tiny hook on the detail panel); on visible-again, call enqueueStaleRefresh for currently-mounted items.
Mechanism: B switches tabs/windows and comes back → FE refetches → useFreshRead trips the banner.
Pros: zero backend change; no long-lived connections; very small surface; matches “I came back from somewhere else” mental model.
Cons: doesn’t help while B is actively staring at the screen without switching focus. Misses the two-screens-side-by-side case.
Verdict: cheap baseline; usually combined with one of the polling options.

3. BroadcastChannel for same-browser tabs

Where: arda-frontend-app only. ItemCardsProvider posts {type: "item-changed", entityId, rId} on its own writes; on receive, invokes enqueueStaleRefresh for that entity.
Mechanism: only covers the same browser (multi-tab single user). Does not flow across browsers or machines.
Pros: zero backend change; near-zero latency; robust.
Cons: ticket repro explicitly requires two users in separate browsers — BroadcastChannel does not help that case.
Verdict: free correctness win for multi-tab, but does not close PDEV-610 on its own.

4. Active polling of the items detail panel (and/or cache)

Where: arda-frontend-app only. While items are mounted (and tab is visible), call enqueueStaleRefresh every N seconds.
Mechanism: B’s FE periodically refetches; useFreshRead’s rId-diff trips the banner. Tunable interval; can apply exponential backoff if idle.
Pros: frontend-only; no new backend surface; latency bounded by the interval; no connection state to manage.
Cons: request volume scales with (#concurrent users × #open items / interval). Cost depends on whether the endpoint supports conditional GETs.
Verdict: simplest mechanism that actually solves the cross-user case. Strong default candidate.

5. Conditional-GET polling with backend ETag / `rId` support

Where: mostly arda-frontend-app; operations may need to confirm/expose ETag headers the FE can echo as If-None-Match.
Mechanism: as 4, but no-change responses return 304 Not Modified instead of full payload.
Pros: option 4 simplicity, much lower cost; sets up reuse for other staleness surfaces (PDEV-588 etc.).
Cons: small backend audit to confirm caching headers; mild FE-BE coupling.
Verdict: option 4 done well. Right pick when this pattern will recur.

6. Server-Sent Events from `operations`

Where: operations (new SSE endpoint per tenant or per entity scope) and arda-frontend-app (EventSource subscription wired into ItemCardsProvider).
Mechanism: backend publishes “item changed” events; FE subscribes once per session and forwards into the existing cache invalidation path.
Pros: push-based, sub-second latency, fits the unidirectional invalidation use case; lighter than WebSocket.
Cons: new long-lived HTTP connections → ALB/EKS/idle-timeout considerations; auth-on-stream; backend needs an internal event bus (Kotlin coroutines SharedFlow, Postgres LISTEN/NOTIFY, or similar) to fan writes out across subscribers; non-trivial multi-pod scaling.
Verdict: right answer when multiple surfaces will need real-time invalidation and polling cost becomes painful.

7. WebSockets

Where: operations + arda-frontend-app.
Mechanism: full bidirectional channel.
Pros: flexible; supports future real-time features (presence, live cursors, push notifications).
Cons: strictly more infrastructure than SSE for a strictly unidirectional invalidation use case.
Verdict: only justifiable if a separate product reason wants bidirectional real-time. Don’t introduce it just for this ticket.

8. Cross-service pub/sub fabric (SNS/SQS, Redis pub-sub, Kafka, …) + WS/SSE gateway

Where: infrastructure, operations, possibly a new gateway component, arda-frontend-app.
Mechanism: every mutation publishes a domain event; many subscribers (web gateway, other services, analytics) consume. Cross-user invalidation is just one consumer.
Pros: correct long-term shape if Arda heads toward event-driven, multi-consumer architecture.
Cons: disproportionate for a single banner; this would be a platform decision, not a bug fix.
Verdict: out of scope here; mention only as the asymptote.

Where this points

Self-contained items-page fix → options 2–5.
Reusable invalidation channel investment → option 6.
Reject for this ticket: 1 (doesn’t fix it), 7 (overkill for unidirectional), 8 (platform decision).

Current-state architecture (relevant to option 4)

Confirmed from arda-frontend-app/src/app/items/ItemCardsContext.tsx:

ItemCardsProvider is mounted at the root layout (post-PDEV-597). One provider, one in-memory Map<entityId, {cards, fetchedAt}> per browser session.
enqueueStaleRefresh(eid) has a microtask coalescer (ItemCardsContext.tsx:650-675): every entityId enqueued in the same JS tick merges into a single batched refreshCardsForItems(eids) call — one HTTP request, eids batched in the payload.
refreshCardsForItems is plural-by-design; the network shape cardsForItems(eids) is already batch-friendly.

Caveat: today’s TTL refresh is not one-per-browser. It is useStaleCheck (ItemCardsContext.tsx:197-221), which runs per cell and schedules its own setTimeout at each cell’s own fetchedAt + TTL (default 30 s, env-tunable). Cells with staggered fetchedAt fire at staggered moments, so they typically do not land in the same microtask → multiple smaller batches per TTL window.

To get one-batched-request-per-browser-per-cycle, add a single provider-level setInterval that enqueues every currently-cached (or currently-mounted) eid each tick. The existing microtask coalescer turns that into one refreshCardsForItems(allEids) call. The per-cell useStaleCheck could remain as a safety net or be removed.

Cost knobs to keep in mind:

Cache size vs. payload size. Cache is LRU-capped via itemCardsMapLru.ts; one batched POST sends every cached eid. Worth checking the LRU cap and whether cardsForItems has a server-side limit.
Visible vs. cached. It may be cheaper to poll only currently-mounted (useFreshRead-subscribed) items rather than every cached entry. Same coalescing, smaller payload, banner correctness preserved (banner only matters where a user is looking).

Combining 1 + 2 + 3 as a local invalidation bus

Proposal: instead of treating options 1, 2, 3 as three separate features, treat them as producers on a small in-app event bus, with the provider as the primary consumer. The bus uses BroadcastChannel as its transport, which gives same-browser cross-tab propagation for free.

producers ──► bus ──► consumer
  focus/visibility            ItemCardsProvider
  save success                 → enqueueStaleRefresh(eid)
  bulk-action completion        → microtask coalescer
  scan / state change            → batched refreshCardsForItems(eids)
  manual Refresh click
  [future] SSE event handler

This separates “something might have changed” (many sources) from “go fetch and reconcile” (one owner). Other consumers (list cells, detail pane, future surfaces) can subscribe if they need to react locally without going through the provider.

Coverage map

Layer	Catches	Misses
1 (manual Refresh)	User-initiated escape hatch	Anything the user doesn’t know to escape
2 (focus / visibility)	“I came back from another tab / window”	Active multi-screen viewing
3 (BroadcastChannel)	Same browser, multi-tab same user	Different browsers, machines, incognito
Combined 1+2+3 + action-triggers	Same-browser staleness; post-action freshness for the acting user and their other tabs	The PDEV-610 primary repro — two users in separate browsers

The reported bug explicitly says “User A and User B both sign in… in separate browsers.” BroadcastChannel does not cross browser instances. So 1+2+3 alone do not close PDEV-610.

How it composes with option 4

1+2+3 do not replace option 4; they let option 4 be tuned much lower.

Without 1+2+3: option 4 carries everything, so the interval must be aggressive (5–10 s) to feel responsive. That dominates request volume.
With 1+2+3: the perceived-latency cases (“I just acted”, “I just came back”, “my other tab acted”) are caught locally at ≈0 ms. Polling only has to carry the genuinely-remote case (other browser, other user) within an SLA acceptable for that case — comfortably 30–60 s. Request volume drops 5–10×.

Framing: 1+2+3 cover the same-browser layer; #4 (or #6 later) carries the cross-process layer. They occupy different parts of the staleness surface and reinforce each other rather than overlap.

Design risks to watch

Cycle / amplification. If the provider both consumes the bus and publishes when its own fetches complete, it can loop. Discipline: only write-originating intents publish (save, bulk action, manual refresh, focus-resume). Cache-update completions do not.
Same-tab self-receive. BroadcastChannel does not echo back to the sending context. A save handler that only posts to the channel will not notify its own tab. Wrap publish in a helper that both calls the local consumer and posts to the channel — producers then have one API and don’t think about tab boundaries.
Focus-resume scope. On focus return, which eids? Currently-mounted (anything with an active useFreshRead / useStaleCheck subscriber) is the natural answer — bounded and relevant.
Throttling. Bursts of user actions can produce bursts of messages; the provider’s existing microtask coalescer already deduplicates a tick’s worth, so this comes nearly free. Worth verifying under bulk-action flows that touch many eids at once.
Producer specificity. “Anything the user did” can trip refreshes for items an action did not affect. Each producer should publish the specific eids it knows about (save handler knows its own eid; bulk action knows its set; scan knows the scanned eid) rather than a blanket “refresh everything.”

Suggested shape (if we go this way)

Tiny useItemStaleSignal (or similar) exposing markItemStale(eid | eid[]) that does the call-locally + post-to-channel dance.
One BroadcastChannel('arda-item-stale') instance owned by the provider.
Producers (save handlers, bulk actions, scan, manual Refresh button) call markItemStale(...).
Focus / visibility handler in the provider calls markItemStale(currentlyMountedEids).
The provider’s channel listener simply calls enqueueStaleRefresh(...) — no new fetch path; the existing coalescer handles batching.
Option 4 (or 6) plugs in later as one more producer: a setInterval posts markItemStale(mountedEids) every N seconds; an SSE handler posts markItemStale(eid) on push. Consumer side never changes.

The architectural payoff: the bus is the seam where any future transport plugs in without rewiring producers.

Open discussion threads

Interval and scope for option 4 layered on top of the bus. What polling interval, applied to which eids (mounted vs. cached), with what backoff when idle. Whether to require ETag / If-None-Match (option 5) from the outset or as a follow-up.
PDEV-613 (bulk print trailing {}). Independent investigation; not yet started.