PDEV-442 — Backend log inspection
PDEV-442 — Backend log inspection (last 3 h)
Section titled “PDEV-442 — Backend log inspection (last 3 h)”Window: 2026-05-13T03:23Z → 2026-05-13T06:23Z (off-peak UTC). Files in this directory:
pod-9whbd.log,pod-xn74c.log— kubectl stdout from both operations pods (last 3h).endpoint-aggregate.json— per-endpoint SERVER-role timings parsed from pod logs.apigw-integration-latency.json,apigw-latency.json— API Gateway CloudWatch metrics, 5-min buckets, last 3h.apigw-kanban.json— API Gateway access logs filtered to kanban-card routes.amplify-arda.json,amplify-lifecycle.json,amplify-CLIENT.json,amplify-role.json— Amplify Lambda log slices.aurora-sample.json,aurora-duration.json— empty; Aurora postgres log group has no events in window.
Operations pod (SERVER role, last 3 h)
Section titled “Operations pod (SERVER role, last 3 h)”Pod-level REST timings — what each pod sees from request received → response sent. CLIENT-role entries do not exist in pod stdout in this window; only SERVER-role markers are present, which is the same per-call timing.
| Endpoint | Method | Count | Min | Avg | p95 | Max | Errors (500) |
|---|---|---|---|---|---|---|---|
/v1/kanban/kanban-card/details | POST | 110 | 87 ms | 849 ms | 1,621 ms | 2,050 ms | 41 (37 %) |
/v1/kanban/kanban-card/for-item/{id} | GET | 54 | 103 ms | 1,009 ms | 2,044 ms | 2,134 ms | 0 |
for-item is uniformly slow (avg ~1 s, p95 ~2 s) but all 200 OK. kanban-card/details
fails 37 % of the time with HTTP 500 — and the pod emits no ERROR-level log
line and no stack trace for those, only the SERVER timing record. The failure
path is silent in stdout.
API Gateway (us-east-1, AlphaA001-prod-API)
Section titled “API Gateway (us-east-1, AlphaA001-prod-API)”Last-3-h 5-min buckets show p95 1.1–1.7 s, max 2.6 s — consistent with the pod
view. IntegrationLatency and total Latency are within ~5 ms of each other,
so API Gateway itself is pure passthrough.
But the 24-h hourly view is much more interesting:
| Hour (UTC) | max ms | p95 ms | p99 ms |
|---|---|---|---|
| 06:25–12:25 | 2.0–3.3 k | 0.2–1.6 k | 0.7–2.1 k |
| 12:25 (peak begins) | 7.5 k | 5.2 k | 7.2 k |
| 13:25 | 6.1 k | 3.0 k | 5.9 k |
| 14:25 | 13.7 k | 9.4 k | 12.7 k |
| 15:25 | 6.0 k | 3.5 k | 5.5 k |
| 16:25 | 11.3 k | 6.7 k | 9.4 k |
| 17:25 | 8.0 k | 5.5 k | 7.4 k |
| 18:25 | 11.9 k | 8.8 k | 11.0 k |
| 19:25 | 8.5 k | 4.6 k | 7.6 k |
| 20:25 | 18.6 k | 2.0 k | 10.5 k |
| 21:25 | 7.6 k | 5.2 k | 7.4 k |
| 22:25–06:25 | 2.0–2.7 k | 1.0–1.5 k | 1.3–1.6 k |
Operations IS genuinely slow during business hours (12:00–21:00 UTC, ~ 7:00–16:00 CT), with p99 7–12 s and the worst single request 18 s. The off-peak window I sampled in the previous in-session test masked this.
Conclusion at the API Gateway layer: API Gateway is a passthrough; the 13.5 s p95 Sentry showed for the BFF→operations call IS predominantly the pod’s own response time during business hours, not a network or proxy artifact.
Amplify SSR Lambda (us-east-2, /aws/amplify/duhexavnwh88g)
Section titled “Amplify SSR Lambda (us-east-2, /aws/amplify/duhexavnwh88g)”Per-invocation timings
Section titled “Per-invocation timings”REPORT lines from the BFF Lambda show typical warm execution 25–500 ms. Cold starts confirmed:
REPORT RequestId: 43d99b4d-220c-4fed-960e-35a56c76f36e Duration: 675.14 ms Billed Duration: 2857 ms Memory Size: 1024 MB Max Memory Used: 177 MB **Init Duration: 2181.28 ms**
A 2.18 s init followed by 675 ms execution → 2.86 s billed wall-clock. This
confirms the Lambda cold-start hypothesis for the “5–8 s gap on cheap routes”
when Sentry’s BFF p95 was much larger than upstream http.client p95.
Logging anti-patterns (worth flagging)
Section titled “Logging anti-patterns (worth flagging)”The BFF logs are very verbose for every request:
- Full bearer JWT in plaintext (
Authorization: Bearer d3733bda-697e-…) at INFO level. Security finding — opens a credential-exposure path through CloudWatch. - Full request body, full request headers, and full user-context object are logged twice per call, including JWT.
- Logs show
User-Agent: PostmanRuntime/7.45.0as a hard-coded value, which is harmless but odd.
The logging volume itself adds a few ms of CPU and bytes of egress per call — not the perf culprit, but worth cleaning up alongside the security fix.
No “CLIENT”-marker JSON found in Amplify logs
Section titled “No “CLIENT”-marker JSON found in Amplify logs”Searched filter pattern CLIENT and role in Amplify Lambda logs over the
last 3 h: no JSON entries with "role":"CLIENT". The Amplify code logs in a
different format (ad-hoc strings, not the structured role-tagged JSON the
operations pod uses).
Aurora database (RDS Performance Insights, 24 h)
Section titled “Aurora database (RDS Performance Insights, 24 h)”Cluster shape
Section titled “Cluster shape”- Cluster:
alpha001-prod-auroracluster(aurora-postgresql 16). - Instances:
db.t3.medium(2 vCPU, 4 GiB RAM) writer + reader. - Performance Insights enabled, 465-day retention on both instances.
pg_stat_statementsextension is NOT installed (extension check returned 0 rows). Slow-query log group has no events. The cluster has no in-database query observability — PI is the only source.
Top SQL by DB load (avg active sessions, 24 h)
Section titled “Top SQL by DB load (avg active sessions, 24 h)”| Avg active sessions | Tokenized SQL (first 160 chars) |
|---|---|
| 0.185 | SELECT kanban_card.id, kanban_card.effective_as_of, kanban_card.recorded_as_of, … |
| 0.075 | SELECT item.id, item.effective_as_of, item.recorded_as_of, … |
| 0.034 | SELECT COUNT(*) FROM (SELECT DISTINCT kanban_card.eid … WHERE tenant_id = $1 AND item_reference_entity_id = $2 …) |
| 0.023–0.005 | various other SELECT item.id, … and SELECT kanban_card.id, … queries (15+ tokenized variants) |
The two top queries are the dominant load — both wide selects against
kanban_card and item with bitemporal effective_as_of /
recorded_as_of filters.
DB load and wait events (24 h)
Section titled “DB load and wait events (24 h)”| Aggregate | Value |
|---|---|
| Total avg active sessions (db.load.avg) | 0.49 |
| Peak avg active sessions | 2.14 |
| Fraction of load in CPU class | 100 % |
| Lock / IO / Latch wait classes | 0 |
Writer CPU and credits (24 h hourly)
Section titled “Writer CPU and credits (24 h hourly)”| Metric | Range |
|---|---|
Writer CPUUtilization (avg) | 19–22 % |
Writer CPUUtilization (max hourly) | 50–92 % |
Writer CPUCreditBalance | flat at ~533 (no burst-credit drain) |
Conclusion at the database layer: the database is not the bottleneck. Average load is 0.49 active sessions on a 2-vCPU instance (1.5 cores of headroom). 100 % of the load is CPU class — no lock or IO contention. The DB-side per-call cost is well under the 1–2 s per-call times the pod reports for the same endpoints.
The two most-loaded queries are wide bitemporal selects pulling every
column on kanban_card and item. They are fast on a per-call basis;
their load contribution comes from frequency, not duration.
Implications for the slowness
Section titled “Implications for the slowness”The pod’s 1–2 s per-call response time for kanban-card/for-item and
kanban-card/details is not explained by Aurora. The DB is fast.
The work is happening in the pod itself — most likely:
- Exposed ORM materialisation of the wide bitemporal rows.
- The bitemporal versioning logic (
effective_as_of/recorded_as_offiltering, “as-of” view) running in-app. - Coroutine scheduling under load on a small pod (worth confirming the pod’s CPU/memory limits and Fargate sizing).
- Serialization to JSON of large row sets.
A profile of the operations JVM during a slow request would discriminate between these. The next investigation step is to capture an async-profiler flame graph (or a JFR snapshot) from one of the pods during a peak-load window.
Aurora postgres CloudWatch log group
Section titled “Aurora postgres CloudWatch log group”/aws/rds/cluster/alpha001-prod-auroracluster/postgresql has zero events
in the last 3 h (and the most recent event in the log streams is from
2026-05-11). Either log_min_duration_statement is unset/too high, or query
logging is disabled. The slow-query log path is not available without
enabling that parameter. Need to either (a) enable parameter group changes
or (b) connect via bastion to query pg_stat_statements for live data.
Summary table — where the time actually goes
Section titled “Summary table — where the time actually goes”Updated picture after this investigation, replacing the earlier “BFF amplifies upstream 10×” framing:
| Layer | Observed contribution | Evidence |
|---|---|---|
| Browser → BFF Lambda | <300 ms when warm; +2 s once per container on cold start | Amplify REPORT lines (Init Duration 2.18 s) |
| BFF Lambda execution (warm) | 200–700 ms | Amplify REPORT Duration |
| BFF → API Gateway (cross-region us-east-2 → us-east-1) | not directly measurable; bounded by Lambda Duration | — |
| API Gateway | ~5 ms overhead | IntegrationLatency ≈ Latency |
| API Gateway → operations pod (peak hours) | p95 9 s, max 18 s | API Gateway IntegrationLatency 24-h hourly view |
| Operations pod processing (off-peak) | avg 1 s, p95 2 s, max 2.1 s | Pod SERVER-role aggregate (this window) |
| Operations pod failures (kanban-card/details) | 37 % HTTP 500, silent in stdout | Pod SERVER-role aggregate |
| Aurora query path | not yet measured — slow-query log empty | needs pg_stat_statements or pg log enabled |
Hypotheses sharpened
Section titled “Hypotheses sharpened”- The dominant source of slowness is the operations pod under peak business-hour load, not the BFF. The pod alone reaches p99 = 7–12 s during 12:00–21:00 UTC.
- Lambda cold start adds ~2 s per cold container — explains the 5–8 s gap on otherwise-cheap routes, NOT the multi-tens-of-seconds tail.
- The kanban-card/details endpoint has a 37 % silent error rate. This is independent of slowness and is likely a significant UX defect on its own; needs an ERROR-log channel added in operations.
- The 158 s p95 BFF transaction figure in Sentry is now best explained by:
pod peak-hour latency (up to 18 s) ×
possible retries by the BFF (need to verify in
arda-frontend-appsource) × Lambda timeout / aborted-transaction tail.
What’s still missing
Section titled “What’s still missing”- Aurora DB-side timing: postgres log group is empty. Two paths:
- Enable
log_min_duration_statement = 500(parameter group change, restart-free). - Connect via the prod bastion and query
pg_stat_statementsfor top slow queries onkanban_cardanditemtables.
- Enable
- Peak-hour pod log slice: re-run the same kubectl pull during a business-hours window (12:00–21:00 UTC) and aggregate. Off-peak data understates the problem materially.
kanban-card/details500 root cause: needs an ERROR-log channel added or an exception-message capture upstream in the framework. Right now the pod is failing silently.
Copyright: © Arda Systems 2025-2026, All rights reserved