Skip to content

PDEV-442 — Backend log inspection

PDEV-442 — Backend log inspection (last 3 h)

Section titled “PDEV-442 — Backend log inspection (last 3 h)”

Window: 2026-05-13T03:23Z → 2026-05-13T06:23Z (off-peak UTC). Files in this directory:

  • pod-9whbd.log, pod-xn74c.log — kubectl stdout from both operations pods (last 3h).
  • endpoint-aggregate.json — per-endpoint SERVER-role timings parsed from pod logs.
  • apigw-integration-latency.json, apigw-latency.json — API Gateway CloudWatch metrics, 5-min buckets, last 3h.
  • apigw-kanban.json — API Gateway access logs filtered to kanban-card routes.
  • amplify-arda.json, amplify-lifecycle.json, amplify-CLIENT.json, amplify-role.json — Amplify Lambda log slices.
  • aurora-sample.json, aurora-duration.json — empty; Aurora postgres log group has no events in window.

Pod-level REST timings — what each pod sees from request received → response sent. CLIENT-role entries do not exist in pod stdout in this window; only SERVER-role markers are present, which is the same per-call timing.

EndpointMethodCountMinAvgp95MaxErrors (500)
/v1/kanban/kanban-card/detailsPOST11087 ms849 ms1,621 ms2,050 ms41 (37 %)
/v1/kanban/kanban-card/for-item/{id}GET54103 ms1,009 ms2,044 ms2,134 ms0

for-item is uniformly slow (avg ~1 s, p95 ~2 s) but all 200 OK. kanban-card/details fails 37 % of the time with HTTP 500 — and the pod emits no ERROR-level log line and no stack trace for those, only the SERVER timing record. The failure path is silent in stdout.

API Gateway (us-east-1, AlphaA001-prod-API)

Section titled “API Gateway (us-east-1, AlphaA001-prod-API)”

Last-3-h 5-min buckets show p95 1.1–1.7 s, max 2.6 s — consistent with the pod view. IntegrationLatency and total Latency are within ~5 ms of each other, so API Gateway itself is pure passthrough.

But the 24-h hourly view is much more interesting:

Hour (UTC)max msp95 msp99 ms
06:25–12:252.0–3.3 k0.2–1.6 k0.7–2.1 k
12:25 (peak begins)7.5 k5.2 k7.2 k
13:256.1 k3.0 k5.9 k
14:2513.7 k9.4 k12.7 k
15:256.0 k3.5 k5.5 k
16:2511.3 k6.7 k9.4 k
17:258.0 k5.5 k7.4 k
18:2511.9 k8.8 k11.0 k
19:258.5 k4.6 k7.6 k
20:2518.6 k2.0 k10.5 k
21:257.6 k5.2 k7.4 k
22:25–06:252.0–2.7 k1.0–1.5 k1.3–1.6 k

Operations IS genuinely slow during business hours (12:00–21:00 UTC, ~ 7:00–16:00 CT), with p99 7–12 s and the worst single request 18 s. The off-peak window I sampled in the previous in-session test masked this.

Conclusion at the API Gateway layer: API Gateway is a passthrough; the 13.5 s p95 Sentry showed for the BFF→operations call IS predominantly the pod’s own response time during business hours, not a network or proxy artifact.

Amplify SSR Lambda (us-east-2, /aws/amplify/duhexavnwh88g)

Section titled “Amplify SSR Lambda (us-east-2, /aws/amplify/duhexavnwh88g)”

REPORT lines from the BFF Lambda show typical warm execution 25–500 ms. Cold starts confirmed:

REPORT RequestId: 43d99b4d-220c-4fed-960e-35a56c76f36e Duration: 675.14 ms Billed Duration: 2857 ms Memory Size: 1024 MB Max Memory Used: 177 MB **Init Duration: 2181.28 ms**

A 2.18 s init followed by 675 ms execution → 2.86 s billed wall-clock. This confirms the Lambda cold-start hypothesis for the “5–8 s gap on cheap routes” when Sentry’s BFF p95 was much larger than upstream http.client p95.

The BFF logs are very verbose for every request:

  • Full bearer JWT in plaintext (Authorization: Bearer d3733bda-697e-…) at INFO level. Security finding — opens a credential-exposure path through CloudWatch.
  • Full request body, full request headers, and full user-context object are logged twice per call, including JWT.
  • Logs show User-Agent: PostmanRuntime/7.45.0 as a hard-coded value, which is harmless but odd.

The logging volume itself adds a few ms of CPU and bytes of egress per call — not the perf culprit, but worth cleaning up alongside the security fix.

No “CLIENT”-marker JSON found in Amplify logs

Section titled “No “CLIENT”-marker JSON found in Amplify logs”

Searched filter pattern CLIENT and role in Amplify Lambda logs over the last 3 h: no JSON entries with "role":"CLIENT". The Amplify code logs in a different format (ad-hoc strings, not the structured role-tagged JSON the operations pod uses).

Aurora database (RDS Performance Insights, 24 h)

Section titled “Aurora database (RDS Performance Insights, 24 h)”
  • Cluster: alpha001-prod-auroracluster (aurora-postgresql 16).
  • Instances: db.t3.medium (2 vCPU, 4 GiB RAM) writer + reader.
  • Performance Insights enabled, 465-day retention on both instances.
  • pg_stat_statements extension is NOT installed (extension check returned 0 rows). Slow-query log group has no events. The cluster has no in-database query observability — PI is the only source.

Top SQL by DB load (avg active sessions, 24 h)

Section titled “Top SQL by DB load (avg active sessions, 24 h)”
Avg active sessionsTokenized SQL (first 160 chars)
0.185SELECT kanban_card.id, kanban_card.effective_as_of, kanban_card.recorded_as_of, …
0.075SELECT item.id, item.effective_as_of, item.recorded_as_of, …
0.034SELECT COUNT(*) FROM (SELECT DISTINCT kanban_card.eid … WHERE tenant_id = $1 AND item_reference_entity_id = $2 …)
0.023–0.005various other SELECT item.id, … and SELECT kanban_card.id, … queries (15+ tokenized variants)

The two top queries are the dominant load — both wide selects against kanban_card and item with bitemporal effective_as_of / recorded_as_of filters.

AggregateValue
Total avg active sessions (db.load.avg)0.49
Peak avg active sessions2.14
Fraction of load in CPU class100 %
Lock / IO / Latch wait classes0
MetricRange
Writer CPUUtilization (avg)19–22 %
Writer CPUUtilization (max hourly)50–92 %
Writer CPUCreditBalanceflat at ~533 (no burst-credit drain)

Conclusion at the database layer: the database is not the bottleneck. Average load is 0.49 active sessions on a 2-vCPU instance (1.5 cores of headroom). 100 % of the load is CPU class — no lock or IO contention. The DB-side per-call cost is well under the 1–2 s per-call times the pod reports for the same endpoints.

The two most-loaded queries are wide bitemporal selects pulling every column on kanban_card and item. They are fast on a per-call basis; their load contribution comes from frequency, not duration.

The pod’s 1–2 s per-call response time for kanban-card/for-item and kanban-card/details is not explained by Aurora. The DB is fast. The work is happening in the pod itself — most likely:

  • Exposed ORM materialisation of the wide bitemporal rows.
  • The bitemporal versioning logic (effective_as_of / recorded_as_of filtering, “as-of” view) running in-app.
  • Coroutine scheduling under load on a small pod (worth confirming the pod’s CPU/memory limits and Fargate sizing).
  • Serialization to JSON of large row sets.

A profile of the operations JVM during a slow request would discriminate between these. The next investigation step is to capture an async-profiler flame graph (or a JFR snapshot) from one of the pods during a peak-load window.

/aws/rds/cluster/alpha001-prod-auroracluster/postgresql has zero events in the last 3 h (and the most recent event in the log streams is from 2026-05-11). Either log_min_duration_statement is unset/too high, or query logging is disabled. The slow-query log path is not available without enabling that parameter. Need to either (a) enable parameter group changes or (b) connect via bastion to query pg_stat_statements for live data.

Summary table — where the time actually goes

Section titled “Summary table — where the time actually goes”

Updated picture after this investigation, replacing the earlier “BFF amplifies upstream 10×” framing:

LayerObserved contributionEvidence
Browser → BFF Lambda<300 ms when warm; +2 s once per container on cold startAmplify REPORT lines (Init Duration 2.18 s)
BFF Lambda execution (warm)200–700 msAmplify REPORT Duration
BFF → API Gateway (cross-region us-east-2 → us-east-1)not directly measurable; bounded by Lambda Duration
API Gateway~5 ms overheadIntegrationLatencyLatency
API Gateway → operations pod (peak hours)p95 9 s, max 18 sAPI Gateway IntegrationLatency 24-h hourly view
Operations pod processing (off-peak)avg 1 s, p95 2 s, max 2.1 sPod SERVER-role aggregate (this window)
Operations pod failures (kanban-card/details)37 % HTTP 500, silent in stdoutPod SERVER-role aggregate
Aurora query pathnot yet measured — slow-query log emptyneeds pg_stat_statements or pg log enabled
  1. The dominant source of slowness is the operations pod under peak business-hour load, not the BFF. The pod alone reaches p99 = 7–12 s during 12:00–21:00 UTC.
  2. Lambda cold start adds ~2 s per cold container — explains the 5–8 s gap on otherwise-cheap routes, NOT the multi-tens-of-seconds tail.
  3. The kanban-card/details endpoint has a 37 % silent error rate. This is independent of slowness and is likely a significant UX defect on its own; needs an ERROR-log channel added in operations.
  4. The 158 s p95 BFF transaction figure in Sentry is now best explained by: pod peak-hour latency (up to 18 s) × possible retries by the BFF (need to verify in arda-frontend-app source) × Lambda timeout / aborted-transaction tail.
  • Aurora DB-side timing: postgres log group is empty. Two paths:
    1. Enable log_min_duration_statement = 500 (parameter group change, restart-free).
    2. Connect via the prod bastion and query pg_stat_statements for top slow queries on kanban_card and item tables.
  • Peak-hour pod log slice: re-run the same kubectl pull during a business-hours window (12:00–21:00 UTC) and aggregate. Off-peak data understates the problem materially.
  • kanban-card/details 500 root cause: needs an ERROR-log channel added or an exception-message capture upstream in the framework. Right now the pod is failing silently.