PDEV-442 — Backend log inspection

PDEV-442 — Backend log inspection (last 3 h)

Window: 2026-05-13T03:23Z → 2026-05-13T06:23Z (off-peak UTC). Files in this directory:

pod-9whbd.log, pod-xn74c.log — kubectl stdout from both operations pods (last 3h).
endpoint-aggregate.json — per-endpoint SERVER-role timings parsed from pod logs.
apigw-integration-latency.json, apigw-latency.json — API Gateway CloudWatch metrics, 5-min buckets, last 3h.
apigw-kanban.json — API Gateway access logs filtered to kanban-card routes.
amplify-arda.json, amplify-lifecycle.json, amplify-CLIENT.json, amplify-role.json — Amplify Lambda log slices.
aurora-sample.json, aurora-duration.json — empty; Aurora postgres log group has no events in window.

Operations pod (SERVER role, last 3 h)

Pod-level REST timings — what each pod sees from request received → response sent. CLIENT-role entries do not exist in pod stdout in this window; only SERVER-role markers are present, which is the same per-call timing.

Endpoint	Method	Count	Min	Avg	p95	Max	Errors (500)
`/v1/kanban/kanban-card/details`	POST	110	87 ms	849 ms	1,621 ms	2,050 ms	41 (37 %)
`/v1/kanban/kanban-card/for-item/{id}`	GET	54	103 ms	1,009 ms	2,044 ms	2,134 ms	0

for-item is uniformly slow (avg ~1 s, p95 ~2 s) but all 200 OK. kanban-card/details fails 37 % of the time with HTTP 500 — and the pod emits no ERROR-level log line and no stack trace for those, only the SERVER timing record. The failure path is silent in stdout.

API Gateway (us-east-1, AlphaA001-prod-API)

Last-3-h 5-min buckets show p95 1.1–1.7 s, max 2.6 s — consistent with the pod view. IntegrationLatency and total Latency are within ~5 ms of each other, so API Gateway itself is pure passthrough.

But the 24-h hourly view is much more interesting:

Hour (UTC)	max ms	p95 ms	p99 ms
06:25–12:25	2.0–3.3 k	0.2–1.6 k	0.7–2.1 k
12:25 (peak begins)	7.5 k	5.2 k	7.2 k
13:25	6.1 k	3.0 k	5.9 k
14:25	13.7 k	9.4 k	12.7 k
15:25	6.0 k	3.5 k	5.5 k
16:25	11.3 k	6.7 k	9.4 k
17:25	8.0 k	5.5 k	7.4 k
18:25	11.9 k	8.8 k	11.0 k
19:25	8.5 k	4.6 k	7.6 k
20:25	18.6 k	2.0 k	10.5 k
21:25	7.6 k	5.2 k	7.4 k
22:25–06:25	2.0–2.7 k	1.0–1.5 k	1.3–1.6 k

Operations IS genuinely slow during business hours (12:00–21:00 UTC, ~ 7:00–16:00 CT), with p99 7–12 s and the worst single request 18 s. The off-peak window I sampled in the previous in-session test masked this.

Conclusion at the API Gateway layer: API Gateway is a passthrough; the 13.5 s p95 Sentry showed for the BFF→operations call IS predominantly the pod’s own response time during business hours, not a network or proxy artifact.

Amplify SSR Lambda (us-east-2, /aws/amplify/duhexavnwh88g)

Per-invocation timings

REPORT lines from the BFF Lambda show typical warm execution 25–500 ms. Cold starts confirmed:

REPORT RequestId: 43d99b4d-220c-4fed-960e-35a56c76f36e Duration: 675.14 ms Billed Duration: 2857 ms Memory Size: 1024 MB Max Memory Used: 177 MB **Init Duration: 2181.28 ms**

A 2.18 s init followed by 675 ms execution → 2.86 s billed wall-clock. This confirms the Lambda cold-start hypothesis for the “5–8 s gap on cheap routes” when Sentry’s BFF p95 was much larger than upstream http.client p95.

Logging anti-patterns (worth flagging)

The BFF logs are very verbose for every request:

Full bearer JWT in plaintext (Authorization: Bearer d3733bda-697e-…) at INFO level. Security finding — opens a credential-exposure path through CloudWatch.
Full request body, full request headers, and full user-context object are logged twice per call, including JWT.
Logs show User-Agent: PostmanRuntime/7.45.0 as a hard-coded value, which is harmless but odd.

The logging volume itself adds a few ms of CPU and bytes of egress per call — not the perf culprit, but worth cleaning up alongside the security fix.

No “CLIENT”-marker JSON found in Amplify logs

Searched filter pattern CLIENT and role in Amplify Lambda logs over the last 3 h: no JSON entries with "role":"CLIENT". The Amplify code logs in a different format (ad-hoc strings, not the structured role-tagged JSON the operations pod uses).

Aurora database (RDS Performance Insights, 24 h)

Cluster shape

Cluster: alpha001-prod-auroracluster (aurora-postgresql 16).
Instances: db.t3.medium (2 vCPU, 4 GiB RAM) writer + reader.
Performance Insights enabled, 465-day retention on both instances.
pg_stat_statements extension is NOT installed (extension check returned 0 rows). Slow-query log group has no events. The cluster has no in-database query observability — PI is the only source.

Top SQL by DB load (avg active sessions, 24 h)

Avg active sessions	Tokenized SQL (first 160 chars)
0.185	`SELECT kanban_card.id, kanban_card.effective_as_of, kanban_card.recorded_as_of, …`
0.075	`SELECT item.id, item.effective_as_of, item.recorded_as_of, …`
0.034	`SELECT COUNT(*) FROM (SELECT DISTINCT kanban_card.eid … WHERE tenant_id = $1 AND item_reference_entity_id = $2 …)`
0.023–0.005	various other `SELECT item.id, …` and `SELECT kanban_card.id, …` queries (15+ tokenized variants)

The two top queries are the dominant load — both wide selects against kanban_card and item with bitemporal effective_as_of / recorded_as_of filters.

DB load and wait events (24 h)

Aggregate	Value
Total avg active sessions (db.load.avg)	0.49
Peak avg active sessions	2.14
Fraction of load in CPU class	100 %
Lock / IO / Latch wait classes	0

Writer CPU and credits (24 h hourly)

Metric	Range
Writer `CPUUtilization` (avg)	19–22 %
Writer `CPUUtilization` (max hourly)	50–92 %
Writer `CPUCreditBalance`	flat at ~533 (no burst-credit drain)

Conclusion at the database layer: the database is not the bottleneck. Average load is 0.49 active sessions on a 2-vCPU instance (1.5 cores of headroom). 100 % of the load is CPU class — no lock or IO contention. The DB-side per-call cost is well under the 1–2 s per-call times the pod reports for the same endpoints.

The two most-loaded queries are wide bitemporal selects pulling every column on kanban_card and item. They are fast on a per-call basis; their load contribution comes from frequency, not duration.

Implications for the slowness

The pod’s 1–2 s per-call response time for kanban-card/for-item and kanban-card/details is not explained by Aurora. The DB is fast. The work is happening in the pod itself — most likely:

Exposed ORM materialisation of the wide bitemporal rows.
The bitemporal versioning logic (effective_as_of / recorded_as_of filtering, “as-of” view) running in-app.
Coroutine scheduling under load on a small pod (worth confirming the pod’s CPU/memory limits and Fargate sizing).
Serialization to JSON of large row sets.

A profile of the operations JVM during a slow request would discriminate between these. The next investigation step is to capture an async-profiler flame graph (or a JFR snapshot) from one of the pods during a peak-load window.

Aurora postgres CloudWatch log group

/aws/rds/cluster/alpha001-prod-auroracluster/postgresql has zero events in the last 3 h (and the most recent event in the log streams is from 2026-05-11). Either log_min_duration_statement is unset/too high, or query logging is disabled. The slow-query log path is not available without enabling that parameter. Need to either (a) enable parameter group changes or (b) connect via bastion to query pg_stat_statements for live data.

Summary table — where the time actually goes

Updated picture after this investigation, replacing the earlier “BFF amplifies upstream 10×” framing:

Layer	Observed contribution	Evidence
Browser → BFF Lambda	<300 ms when warm; +2 s once per container on cold start	Amplify REPORT lines (Init Duration 2.18 s)
BFF Lambda execution (warm)	200–700 ms	Amplify REPORT Duration
BFF → API Gateway (cross-region us-east-2 → us-east-1)	not directly measurable; bounded by Lambda Duration	—
API Gateway	~5 ms overhead	`IntegrationLatency` ≈ `Latency`
API Gateway → operations pod (peak hours)	p95 9 s, max 18 s	API Gateway IntegrationLatency 24-h hourly view
Operations pod processing (off-peak)	avg 1 s, p95 2 s, max 2.1 s	Pod SERVER-role aggregate (this window)
Operations pod failures (kanban-card/details)	37 % HTTP 500, silent in stdout	Pod SERVER-role aggregate
Aurora query path	not yet measured — slow-query log empty	needs `pg_stat_statements` or pg log enabled

Hypotheses sharpened

The dominant source of slowness is the operations pod under peak business-hour load, not the BFF. The pod alone reaches p99 = 7–12 s during 12:00–21:00 UTC.
Lambda cold start adds ~2 s per cold container — explains the 5–8 s gap on otherwise-cheap routes, NOT the multi-tens-of-seconds tail.
The kanban-card/details endpoint has a 37 % silent error rate. This is independent of slowness and is likely a significant UX defect on its own; needs an ERROR-log channel added in operations.
The 158 s p95 BFF transaction figure in Sentry is now best explained by: pod peak-hour latency (up to 18 s) × possible retries by the BFF (need to verify in arda-frontend-app source) × Lambda timeout / aborted-transaction tail.

What’s still missing

Aurora DB-side timing: postgres log group is empty. Two paths:
1. Enable log_min_duration_statement = 500 (parameter group change, restart-free).
2. Connect via the prod bastion and query pg_stat_statements for top slow queries on kanban_card and item tables.
Peak-hour pod log slice: re-run the same kubectl pull during a business-hours window (12:00–21:00 UTC) and aggregate. Off-peak data understates the problem materially.
kanban-card/details 500 root cause: needs an ERROR-log channel added or an exception-message capture upstream in the framework. Right now the pod is failing silently.