Skip to content

PDEV-442 — Operations pod capacity (EKS / Fargate)

Captured during the PDEV-442 investigation while looking for “signs of stress on the EKS nodes during high-latency periods”. Fargate runs each pod inside its own micro-VM, so the pod’s cgroup quota is the node budget — there’s no separate node layer to inspect.

Deployment operations in namespace prod-operations on EKS cluster Alpha001-eks-cluster. Two replicas on Fargate.

resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1
memory: 2Gi

Total prod operations budget on paper: 2 vCPU / 4 GiB across both pods.

Inside both pods, cgroup v1 reports:

Podcpu.cfs_quota_uscpu.cfs_period_usEffective vCPU
operations-78bc575484-9whbd500001000000.5
operations-78bc575484-xn74c500001000000.5

Despite the deployment declaring limits.cpu = 1, Fargate’s cgroup quota is set to 0.5 vCPU per pod. Fargate appears to honor the request (500m), not the limit. The component-wide CPU budget for prod is therefore 1 vCPU across both pods.

Sanity check needed. Confirm against the Fargate task definition whether the task is being sized to the request or to the limit on this deployment. Either way, the cgroup is the source of truth and it caps each container at 0.5 vCPU.

Captured over each pod’s ~6.3-day uptime:

Podnr_periodsnr_throttledthrottled_time% periods throttled
9whbd1,193,45452335.35 s0.044 %
xn74c1,284,58679948.25 s0.062 %

Throttling is occurring. The aggregate throttled-time is modest in absolute terms (35–48 s per pod over 6 days), but the events cluster during high-load periods (12:00–21:00 UTC). Even rare 60–100 ms throttle waits stack with each other under concurrent load and materialise as p99 tail latency.

nr_bursts = 0 on both pods: the cgroup is not configured with CPU burst, so the pod cannot use credits earned during idle periods to absorb spikes.

Podusage_in_bytesmax_usage_in_byteslimit_in_bytesmemory.failcnt
9whbd441 MB442 MB2 GiB0
xn74c440 MB445 MB2 GiB0

JVM VmRSS 432 MB, 47 threads, no swap, no pgmajfault pressure (26 total over 6 days). max_usage_in_bytes is essentially the current usage — memory has been steady, never near the limit. Memory is not the constraint.

  • Live loadavg: 0.02 0.03 0.00 — off-peak; no current pressure.
  • JVM peak virtual memory (VmPeak) 3.27 GB — virtual address space including mmap regions, not resident. Cgroup enforces against RSS.
  • No OutOfMemory, OOMKilled, GC pause messages or throttle markers in the pod stdout for the 3-h window (note: JVM default GC logging is off; explicit -Xlog:gc* is required to capture pauses).

How this reconciles with the measured latency

Section titled “How this reconciles with the measured latency”

A single request that consumes 300 ms of CPU work needs ~600 ms wall- clock on 0.5 vCPU with no contention. Under the /items workload — 12 concurrent kanban requests per page load against 2 pods × 0.5 vCPU = 1 vCPU of capacity — each request can degrade to 1–2 s. That is exactly the pod-level latency we measure (avg 0.85–1.0 s, p95 1.6–2.0 s). During business hours, with multiple users on /items simultaneously, the same arithmetic predicts the p99 = 7–12 s API Gateway reports.

The DB is fast (PI: 0.49 avg active sessions, 100 % CPU class, 19–22 % writer CPU). The cost is in the pod’s in-process work — Exposed ORM materialisation of wide bitemporal rows, JSON serialisation, the bitemporal “as-of” filter — running on a CPU budget that is too small to absorb the page’s fan-out.

Diagnostic audit of how the operations JVM uses the container memory budget today. Motivates the memory and GC flags in Recommendation #3 and the prod sizing decisions in PDEV-488.

Audit of operations/build.gradle.kts shows the runtime JVM has zero memory-related flags set anywhere:

SourceValuePurpose
application.applicationDefaultJvmArgs (line 26)["-Dio.ktor.development=$isDevelopment"]config flag, no memory tuning
jib.container.jvmFlags (line 372)["-Darda.config.location=/app/conf:/app/secret"]config flag, no memory tuning
tasks.test { jvmArgs("--add-opens=...") } (line 599)test JVM onlynot in the runtime container

Corretto 21 with +UseContainerSupport (default since JDK 11) reads the cgroup memory limit and picks max heap = 25 % of the limit by default (MaxRAMPercentage=25.0). With current and proposed pod sizing:

EnvContainer memory limitDefault max heap (25 %)Unused budget
Today (all envs)2 GiB~512 MiB~1 GiB
Prod after PDEV-4884 GiB~1 GiB~2.5 GiB

Moving prod to 4 GiB worsens the under-utilisation ratio unless we also adjust the heap percentage. The measured RSS of ~440 MB (§ Memory posture above) is consistent with the JVM using a small fraction of its 512 MiB ceiling — heap + non-heap + native combined — well within the current limit. So today’s “memory is comfortable” reading is correct, but it’s also telling us the JVM isn’t using the budget we’re paying for.

All six flags below are intended to be shipped together. Five of them are in the compute.javaToolOptions values string; the sixth (-XX:HeapDumpPath) is path-dependent and is appended by templates/deployment.yaml using the application.artifactPrefix helper (see § Reusable Helm helpers under Evaluation of Changes).

FlagValueWhy
-XX:MaxRAMPercentage75.0Let heap grow to ~75 % of the container limit. Leaves the remaining ~25 % for code cache, metaspace, thread stacks, direct buffers, and JVM-native — enough breathing room that JVM memory growth won’t trip the K8s OOM-killer. Standard for containerised Java.
-XX:InitialRAMPercentage50.0Pre-allocate half the eventual max heap at startup instead of growing lazily. Reduces early-life GC churn and improves first-request latency after pod start.
-XX:+ExitOnOutOfMemoryError(flag)If heap OOM ever fires, terminate the JVM immediately so K8s restarts the pod cleanly. Without this the JVM can limp on for many seconds with degraded behaviour before the liveness probe kills it.
-XX:+HeapDumpOnOutOfMemoryError(flag)If OOM fires, write a heap dump for post-mortem.
-XX:HeapDumpPath/tmp/{prefix}-heap.hprofPath of the heap dump produced on OOM. See Notes below for how {prefix} is composed and why this flag is appended by the deployment template instead of declared in values.
-Xlog:gc*:stdout:time,uptime,level,tags(target)GC logging to container stdout, shipped to CloudWatch via the cluster’s existing Fluent Bit path. Closes the silent-GC blind spot.
  • {prefix} composition. -XX:HeapDumpPath is not a static flag; it embeds an environment-and-component identifier (e.g. prod-operations) so heap dumps from different pods don’t collide when pulled to a workstation. The identifier is computed at chart render time by the application.artifactPrefix helper (defined alongside the existing application.* helpers in _helpers.tpl). For this reason the flag is appended by templates/deployment.yaml, not declared in any values*.yaml. Per-env override is available via compute.artifactPrefix.
  • Heap-dump retrieval is operator-driven. Fargate ephemeral storage is ~20 GiB by default — plenty for one heap dump (heap caps at ~3 GiB on the prod sizing). The dump persists only until pod recycle; retrieve it before then with kubectl --context <ctx> -n <ns> exec <pod> -- sh -c 'cat /tmp/<prefix>-heap.hprof' > local.hprof.
  • GC logging targets stdout deliberately. The cluster already ships container stdout to CloudWatch via Fluent Bit; the alternative (writing to a file in the container) would lose data on pod restart and require a separate shipping pipeline. See Recommendation #3’s “Stdout is sufficient — no Fluent Bit change” subsection for the full reasoning.
  • -Xmx / -Xms as absolute values — using percentages keeps the flags portable across the per-env memory limits (0.5 vCPU/2 GiB dev through 2 vCPU/4 GiB prod) without per-env overrides.
  • Changing the GC. G1 is the default and is appropriate for this workload (mixed allocation, latency-sensitive). ZGC and Shenandoah reduce pause times further but have larger heap overhead and are worth considering only after a JFR snapshot shows GC pauses as the primary contributor to tail latency.
  • -XX:+UseStringDeduplication — would save heap on the duplicated Item-name / supplier strings in KanbanCardDetails payloads, but the savings are small relative to the heap budget and the better fix is OP4 (slim down the response shape) on the operations side (PDEV-490).
  • -XX:NativeMemoryTracking=summary — useful for one-off diagnosis of where non-heap memory goes, but adds steady overhead. Enable ad-hoc via a debug values file, don’t ship in prod by default.

In order of impact, lowest effort first:

  1. Align requests.cpu with limits.cpu. On Fargate, the request determines the task size. Setting request 500m / limit 1 yields 0.5 vCPU effective. Use request 2 / limit 2 (or higher) — and verify the cgroup quota changes accordingly after the next rollout.

  2. Add an HPA with a CPU target (e.g. 60 %). Today the deployment is pinned at 2 replicas regardless of load. The /items fan-out is a classic case for horizontal scaling.

  3. Enable JVM GC logging (-Xlog:gc*:stdout:time,uptime,level,tags) shipped to CloudWatch via the existing stdout → Fluent Bit path, so we can observe pauses during peak hours.

  4. Capture a JFR snapshot under peak load to find the dominant in-process cost (Exposed materialisation vs. JSON serialisation vs. bitemporal logic). async-profiler is not viable on Fargate (see Recommendation #4 for why); JFR is the Fargate-compatible replacement.

  5. Attach a JVM-agent-based Sentry instrumentation (Sentry’s OpenTelemetry Java agent) with no Kotlin source changes. Provides server spans, JDBC spans, distributed tracing across SPA → BFF → operations, and JVM runtime metrics. Wired through the same JAVA_TOOL_OPTIONS mechanism as #3 and #4; runtime configuration under a new oam.performance.sentry.* Helm values block. Gated off by default until the Sentry project + the single DSN are provisioned (PDEV-492).

  6. Reconsider the Fargate choice for the operations component. EC2- backed Karpenter-managed nodes would allow CPU burst and a denser scheduling profile; today the per-pod overhead of a Fargate micro-VM is paying for capacity that mostly sits idle off-peak.

    Status: explicitly deferred. Not in scope for PDEV-488 or any sibling sub-issue under PDEV-442. Revisit when system scale or sustained-load patterns make the cost- of-Fargate calculus material; no concrete plan at this point.

Tracked by PDEV-488. The Helm-chart changes that satisfy #1 and #2 are scoped to that ticket; this section summarises what they entail:

  • compute: block in values.yaml with per-env requests-equal-limits Fargate-valid sizes (dev / stage / demo 1 vCPU / 2 GiB; prod 2 vCPU / 4 GiB). All four envs share the same 2 GiB-per-vCPU memory ratio.
  • HPA template shipped with compute.hpa.enabled: true in all four envs day one. The EKS metrics-server addon (PDEV-491) is the runtime prerequisite for the HPA to actually scale; if it lags behind PDEV-488 the HPA stays in a <unknown>/60% state and the deployment holds at minReplicas — no chart-side fail.
  • compute.hpa.minReplicas as the single source of truth for replica count, whether the HPA is actively scaling or not.

Recommendations #3, #4, and #5 require additional changes to the same Helm chart (and one runbook in the documentation site). All three are JVM-side changes that share a mechanism — a new compute.javaToolOptions value plumbed into the container as JAVA_TOOL_OPTIONS — but each delivers independent value.

#3, #4, and #5 each need to construct identifiers that combine the deployment partition, environment, and component name — for filenames, Sentry environment tags, and JFR recording names. To keep the values files pure YAML (no {{ }} substitutions inside values, no Helm tpl indirection at render time), the composition is done with two new helpers in src/main/helm/templates/_helpers.tpl, alongside the existing application.name, application.labels, etc. After implementation _helpers.tpl is the single source of truth; the rest of this document refers to the helpers by name only.

Helpers to add (proposed contents — implement once, in _helpers.tpl, then reference by include):

{{/*
Composes "{infrastructure}-{purpose}" (e.g. "alpha001-prod"). Used as
the default for the Sentry environment tag and any other
"{partition}-{env}" identifier.
Usage: {{ include "application.environment" . }}
*/}}
{{ define "application.environment" -}}
{{ printf "%s-%s" .Values.global.infrastructure .Values.global.purpose }}
{{- end }}
{{/*
On-disk artifact filename prefix for JFR exit recordings, heap dumps,
and similar files that may be pulled to a workstation for analysis.
Returns "{purpose}-{component}" (e.g. "prod-operations") so artifacts
collected from multiple components / environments are unambiguous.
Usage: {{ include "application.artifactPrefix" . }}
*/}}
{{ define "application.artifactPrefix" -}}
{{ printf "%s-%s" .Values.global.purpose (include "application.name" .) }}
{{- end }}

Returns at a glance:

HelperReturnsExample outputUsed for
application.environment{infrastructure}-{purpose}alpha001-prodDefault for SENTRY_ENVIRONMENT (#5)
application.artifactPrefix{purpose}-{component}prod-operationsDefault for heap-dump filename (#3), JFR name= and exit-filename (#4)

The subsections below reference these helpers by name. Values files contain only literal strings and per-env overrides; substitution and composition happen exclusively inside templates/deployment.yaml.

Recommendation #3 — Enable JVM GC logging and ship to CloudWatch

Section titled “Recommendation #3 — Enable JVM GC logging and ship to CloudWatch”

What it does. Turns on the JVM’s structured GC log (-Xlog:gc*) so every garbage collection event emits a line we can correlate against request-latency spikes. Today the JVM is silent about GC; when the pod shows a tail-latency burst we can’t tell whether it was a long G1 pause, heap-allocation pressure, or upstream slowness.

Mechanism — use JAVA_TOOL_OPTIONS, not JAVA_OPTS. The operations image is built with Jib (gradle plugin, see build.gradle.kts:370-385), not a Dockerfile. There is no Dockerfile in the operations repo to tweak. Jib generates an exec-form ENTRYPOINT — an array, not a shell command:

ENTRYPOINT ["java",
"-Darda.config.location=/app/conf:/app/secret", // from jib.container.jvmFlags
"-Dio.ktor.development=false", // from application.applicationDefaultJvmArgs
"-cp", "@/app/jib-classpath-file",
"io.ktor.server.netty.EngineMain"]

Because there is no shell, the conventional JAVA_OPTS env var is not expanded — setting it in the K8s pod env would be silently ignored. The portable alternative the JVM itself reads at startup is JAVA_TOOL_OPTIONS (part of the OpenJDK launcher contract). Setting JAVA_TOOL_OPTIONS in the container env works with Jib’s exec-form entrypoint without any image or build-system change, and the JVM logs Picked up JAVA_TOOL_OPTIONS: ... to stderr on startup — a free verification line.

Where the change lives. Two pieces in the operations repo (no Dockerfile, no Jib config change needed).

  1. Plumb a javaToolOptions value through the Helm chart. Extend the compute: block in values.yaml, default empty:

    values.yaml
    compute:
    resources: { ... }
    hpa: { ... }
    # Extra JVM startup options applied via the JAVA_TOOL_OPTIONS env
    # var. The operations image is Jib-built with an exec-form
    # ENTRYPOINT, so JAVA_OPTS (which requires a shell) is not read.
    # JAVA_TOOL_OPTIONS is honoured by the JVM directly.
    # Disabled by default; set in per-environment values files to
    # enable GC logging, JFR, etc.
    javaToolOptions: ""

    Per-env opt-in (start with stage and prod):

    # values-prod.yaml — append at end
    compute:
    # ...existing resources/hpa blocks...
    # Static JVM flags only. The path-dependent flag
    # -XX:HeapDumpPath is appended by the deployment template using
    # the `application.artifactPrefix` helper — renders as
    # `/tmp/prod-operations-heap.hprof` in prod, etc.
    javaToolOptions: >-
    -Xlog:gc*:stdout:time,uptime,level,tags
    -XX:InitialRAMPercentage=50.0
    -XX:MaxRAMPercentage=75.0
    -XX:+ExitOnOutOfMemoryError
    -XX:+HeapDumpOnOutOfMemoryError

    See the ## JVM Memory Evaluation section above for the audit that motivates this flag set and the rationale behind each flag.

  2. Inject JAVA_TOOL_OPTIONS into the container in templates/deployment.yaml. The container already has an env: block (it injects DB URLs and the like). Compose the final value from the static base flags (compute.javaToolOptions) plus a path-dependent heap-dump flag built from the application.artifactPrefix helper. Per-env override of the prefix is available via compute.artifactPrefix.

    This snippet is the #3-only baseline; Recommendations #4 and #5 incrementally extend it by appending more fragments to $opts. Apply this version if you ship #3 in isolation.

    env:
    # ...existing entries...
    {{- $prefix := default (include "application.artifactPrefix" .) .Values.compute.artifactPrefix }}
    {{- $heapDumpFlag := printf "-XX:HeapDumpPath=/tmp/%s-heap.hprof" $prefix }}
    {{- $opts := trim (printf "%s %s" (default "" .Values.compute.javaToolOptions) $heapDumpFlag) }}
    {{- if $opts }}
    - name: JAVA_TOOL_OPTIONS
    value: {{ $opts | quote }}
    {{- end }}

    Values files stay pure YAML — no Go-template substitutions in values, no tpl indirection. The composition is local to the deployment template, where it belongs, and per-env overrides are plain strings.

    No Dockerfile to touch, no Jib jvmFlags change. The existing jib.container.jvmFlags = ["-Darda.config.location=..."] stay in the ENTRYPOINT (these are “always-on” baseline flags that don’t belong in per-env config). JAVA_TOOL_OPTIONS are appended by the JVM to whatever is already on the command line — both apply.

  3. Stdout is sufficient — no Fluent Bit change. The cluster already ships container stdout to CloudWatch via Fluent Bit (log groups /Alpha001/eks-logs and /aws/eks/Alpha001-eks-cluster/fluent-bit-logs). Writing GC events to stdout (the colon-prefixed target in -Xlog:gc*:stdout:...) means they show up in the same CloudWatch log stream as the pod’s application logs. No new log group, no new IAM, no new shipping config.

Verification after deploy. Pick the target env and confirm the JVM picked up the options at startup (single line emitted to stderr, captured in pod stdout):

Terminal window
CONTEXT=Alpha001 # Alpha001 for prod/demo, Alpha002 for stage/dev
NS=prod-operations # ${ENV}-${COMPONENT}
kubectl --context $CONTEXT -n $NS logs \
-l app=operations --tail 500 | grep 'JAVA_TOOL_OPTIONS'

Expected line (after #3 ships in isolation — the path-dependent heap-dump flag is appended by the deployment template):

Picked up JAVA_TOOL_OPTIONS: -Xlog:gc*:stdout:time,uptime,level,tags -XX:InitialRAMPercentage=50.0 -XX:MaxRAMPercentage=75.0 -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/prod-operations-heap.hprof

Then confirm GC events are flowing:

Terminal window
kubectl --context $CONTEXT -n $NS logs \
-l app=operations --tail 200 | grep '\[gc'

Should show lines like [12.345s][info][gc] GC(7) Pause Young (Normal) (G1 Evacuation Pause) 256M->130M(4096M) 18.234ms.

In CloudWatch Logs Insights:

fields @timestamp, @message
| filter @message like /\[gc\]/
| parse @message /(?<dur>\d+\.\d+)ms\s*$/
| stats avg(dur), max(dur), pct(dur, 99) by bin(5m)

To see GC pause distribution over time and overlay against the API Gateway latency spikes.

Risk. Low. GC logging at info level adds ~50 bytes per event and a few microseconds. With G1 collecting every few seconds, the overhead is in the noise. The MaxRAMPercentage bump is more impactful — it gives the JVM access to the memory it’s already allocated, which should reduce GC frequency, not increase it.

Effort and ship order. XS. Can ship in the same PR as #1 and #2.

Recommendation #4 — Capture a JFR snapshot under peak load

Section titled “Recommendation #4 — Capture a JFR snapshot under peak load”

What it does. Produces a flame graph (CPU sampling) of the operations JVM during a peak-load window, so we can attribute the in-process cost to a specific source: Exposed ORM hydration of bitemporal rows, JSON serialisation, the as-of filter logic, lock contention on a coroutine, or something else entirely. Today we know the pod is CPU-bound but we don’t know what it’s spending CPU on.

Why JFR, not async-profiler, on Fargate. Important constraint: async-profiler does not work on EKS Fargate. It relies on perf_event_open and AsyncGetCallTrace, which need either privileged container mode or CAP_PERFMON. Fargate’s micro-VM model exposes neither. This is why the recommendation title doesn’t mention async-profiler — JFR is the only viable choice under the current deployment model. If we ever leave Fargate (Recommendation #6’s deferred consideration) async-profiler becomes available.

Java Flight Recorder (JFR) is the Fargate-compatible alternative:

  • Built into the JVM (free since OpenJDK 8u262 / 11+).
  • Low-overhead (~1-2 %) in the profile template; safe to run continuously in production.
  • Captures the same data we need: method-level CPU sampling, allocation hotspots, lock contention, GC events, IO.
  • Analysed in JDK Mission Control (JMC), IntelliJ’s built-in viewer, or rendered as flame graphs via jfr2flamegraph / jfrconv (Datadog), or uploaded to Profiler.dev / Pyroscope.

Where the change lives. Two pieces in the same Helm chart we touch for #3, plus a short runbook in the documentation site at documentation/src/content/docs/process/sre/runbooks/.

  1. Enable continuous JFR recording via JAVA_TOOL_OPTIONS. Reuses the compute.javaToolOptions plumbing from #3. The recording name and on-exit filename both use the {purpose}-{component} prefix produced by application.artifactPrefix (e.g. prod-operations), so JFR artifacts from operations / accounts-component / multiple environments are self-identifying when collected side-by-side:

    # values-prod.yaml — same javaToolOptions as #3 (static base flags only).
    compute:
    javaToolOptions: >-
    -Xlog:gc*:stdout:time,uptime,level,tags
    -XX:InitialRAMPercentage=50.0
    -XX:MaxRAMPercentage=75.0
    -XX:+ExitOnOutOfMemoryError
    -XX:+HeapDumpOnOutOfMemoryError

    The path-dependent flags — -XX:HeapDumpPath (from #3), -XX:StartFlightRecording=name=...,filename=... (added by this recommendation), and the Sentry -javaagent: flag (#5) — are all appended by the deployment template using the application.artifactPrefix helper. At deploy time the prod render produces name=prod-operations, the demo render produces name=demo-operations, and so on. No per-env duplication of the identifier; no {{ ... }} substitution inside values files.

    Decoding the JFR flag ({prefix} is the output of application.artifactPrefix, e.g. prod-operations):

    • name={prefix} — names the recording so jcmd can refer to it and so multiple recordings collected from different partitions don’t collide in JMC / JFR analysis tools.
    • settings=profile — the higher-fidelity preset (vs. default), still under ~2 % overhead.
    • disk=true,maxsize=200M,maxage=2h — rolling 2-hour window on disk, capped at 200 MB. Old data discarded as new arrives.
    • dumponexit=true,filename=/tmp/{prefix}-exit.jfr — writes the buffered recording on graceful pod shutdown so we don’t lose data when the pod is recycled.
  2. Extend the JAVA_TOOL_OPTIONS composition in templates/deployment.yaml to append the JFR flag, adding one more printf fragment to $opts. This is an incremental extension of the #3 baseline snippet:

    env:
    # ...existing entries...
    {{- $prefix := default (include "application.artifactPrefix" .) .Values.compute.artifactPrefix }}
    {{- $heapDumpFlag := printf "-XX:HeapDumpPath=/tmp/%s-heap.hprof" $prefix }}
    {{- $jfrFlag := printf "-XX:StartFlightRecording=name=%s,settings=profile,disk=true,maxsize=200M,maxage=2h,dumponexit=true,filename=/tmp/%s-exit.jfr" $prefix $prefix }}
    {{- $opts := trim (printf "%s %s %s" (default "" .Values.compute.javaToolOptions) $heapDumpFlag $jfrFlag) }}
    {{- if $opts }}
    - name: JAVA_TOOL_OPTIONS
    value: {{ $opts | quote }}
    {{- end }}

    This is the #3 + #4 composition; Recommendation #5 extends it one more step by appending a Sentry-agent fragment to $opts. If you ship #3 + #4 without #5, this is the deployment template snippet to use.

  3. No image change needed. JFR is in the JVM. No new dependencies, no Dockerfile change.

  4. Runbook at documentation/src/content/docs/process/sre/runbooks/jvm-profiling.md. The capture/export sequence requires kubectl and is worth documenting once. The page should follow the conventions of the existing process/sre/runbooks/ directory — frontmatter with title, description (40–300 chars), tags, domain: process, maturity, author — and is added in the documentation worktree alongside this project’s goal.md:

    Terminal window
    # The recording is named `{purpose}-{component}` (e.g. prod-operations)
    # — the output of the application.artifactPrefix Helm helper.
    # Substitute below for the cluster / partition you're targeting.
    ENV=prod
    COMPONENT=operations
    NS=${ENV}-${COMPONENT}
    REC=${ENV}-${COMPONENT}
    STAMP=$(date -u +%FT%TZ)
    # 1. Pick a pod under peak load
    POD=$(kubectl --context Alpha001 -n $NS get pod \
    -l app=$COMPONENT -o name | head -1)
    # 2. Dump a snapshot of the rolling recording without stopping it.
    # jcmd PID 1 is the JVM (PID 1 in the container).
    kubectl --context Alpha001 -n $NS exec $POD -- \
    jcmd 1 JFR.dump name=$REC filename=/tmp/${REC}-peak-${STAMP}.jfr
    # 3. Stream it out (kubectl cp is unreliable for large files; the
    # cat pattern matches what management/aurora-data-dump uses).
    kubectl --context Alpha001 -n $NS exec $POD -- \
    sh -c "cat /tmp/${REC}-peak-${STAMP}.jfr" > ./${REC}-peak-${STAMP}.jfr
    # 4. Analyse locally (file is self-identifying: prod-operations-peak-*.jfr)
    # - JDK Mission Control: open the .jfr file
    # - IntelliJ IDEA Ultimate: Run → Open Profiler Snapshot...
    # - Flame graph: jfrconv -t cpu --collapse <file>.jfr \
    # | flamegraph.pl > flame.svg

    The runbook should also note:

    • The snapshot must be captured during peak hours (12:00–21:00 UTC) to be representative. Off-peak snapshots will not reveal the bottleneck.
    • A 5–10 minute snapshot is enough; longer adds noise without insight.
    • After capture, delete the in-pod file: kubectl exec ... rm /tmp/${REC}-peak-*.jfr.

Verification after deploy. Pick a target env (here: prod) and a running pod, then confirm JFR is recording. The variables below are the same ones used in the runbook bash block above:

Terminal window
CONTEXT=Alpha001 # kubectl context (Alpha001 for prod/demo, Alpha002 for stage/dev)
COMPONENT=operations
NS=prod-operations # ${ENV}-${COMPONENT}
POD=$(kubectl --context $CONTEXT -n $NS get pod \
-l app=$COMPONENT -o name | head -1)
kubectl --context $CONTEXT -n $NS exec $POD -- jcmd 1 JFR.check

Expected output (substituting prod-operations for the prefix when deployed to prod; stage-operations etc. for other envs):

Recording 1: name=prod-operations duration=0s (running)

Risk. Low for the recording itself. Continuous JFR at profile settings is what Datadog Continuous Profiler, Pyroscope, and AWS CodeGuru all use in production — well within their measured 1–2 % CPU overhead and a few MB of disk.

One operational note: disk=true writes to the pod’s ephemeral storage. With maxsize=200M, that’s small enough not to matter on Fargate (default ephemeral storage is 20 GiB). If a pod has nothing else writing to disk, this is invisible.

Effort and ship order. XS for the config change itself. The bulk of the work is using the snapshot once captured — interpreting the flame graph and turning that into concrete operations-side fixes (which would feed PDEV-490 OP4–OP7). But the change to enable it is the same XS-sized addition to JAVA_TOOL_OPTIONS and a small markdown file.

Worth shipping in the same PR as #1, #2, and #3 — they’re all “one PR, one Helm chart redesign” — but only after we’ve also raised the prod pod CPU limit. JFR recording on a CPU-starved pod will exhibit the cost of its own profiling more than it reveals the true workload.

Recommendation #5 — JVM-agent-based Sentry instrumentation (no Kotlin code)

Section titled “Recommendation #5 — JVM-agent-based Sentry instrumentation (no Kotlin code)”

What it does. Adds server-side observability for the operations component without touching Kotlin source. The mechanism is the Sentry OpenTelemetry Java agent (sentry-opentelemetry-agent-<ver>.jar), attached at JVM startup via -javaagent:. It auto-instruments via the OpenTelemetry Java instrumentation library and ships spans, errors, and JVM metrics to Sentry.

Six signals come for free, without a single import:

SignalSource
HTTP server spans for every Ktor routeOTel Ktor/Netty instrumentation
DB query spans (statement, rows, duration)OTel JDBC instrumentation around HikariCP
Distributed trace contextReads sentry-trace / W3C traceparent from inbound headers and propagates outbound
Uncaught exceptions → Sentry issues with stack tracesOTel error capture
JVM runtime metrics (heap, GC, threads)OTel JVM metrics emitter
Release / environment taggingEnv vars at startup

This closes two gaps from the investigation: the “37 % silent 500” path on kanban-card/details becomes visible to Sentry without an operations code change (PDEV-490 OP3’s contract fix can land separately on its own merits), and the BFF→operations correlation becomes a single distributed trace in Sentry instead of a hand timestamp-join.

Code-side instrumentation — adding custom spans, business-context tags on the kanban endpoints, breadcrumbs on the 500 == no cards path, etc. — is naturally PDEV-490 territory and not in scope here.

Where the change lives. Three pieces, all in the operations repo build / Helm chart. No Kotlin source changes.

  1. Bundle the agent JAR in the image via Jib’s extraDirectories. A small Gradle task downloads the agent JAR (pinned by version) into src/main/jib/agent/, and Jib copies it into the image at /app/agents/sentry-otel-agent.jar:

    build.gradle.kts
    val sentryAgentVersion = "8.5.0" // pin to a specific release
    val sentryAgent = configurations.create("sentryAgent")
    dependencies {
    "sentryAgent"("io.sentry:sentry-opentelemetry-agent:$sentryAgentVersion")
    }
    val copySentryAgent by tasks.registering(Copy::class) {
    from(sentryAgent) {
    include("sentry-opentelemetry-agent-*.jar")
    rename { "sentry-otel-agent.jar" }
    }
    into(layout.projectDirectory.dir("src/main/jib/agent"))
    }
    tasks.named("jibDockerBuild") { dependsOn(copySentryAgent) }
    tasks.named("jib") { dependsOn(copySentryAgent) }
    jib {
    extraDirectories {
    paths { path { setFrom("src/main/jib/agent"); into = "/app/agents" } }
    }
    // ...existing container / from / to blocks unchanged
    }

    The JAR is ~30 MB — material but not significant relative to the Corretto base image. The agent is always present in every image regardless of whether Sentry is enabled, which keeps the failure modes simple (see below).

  2. Extend the JAVA_TOOL_OPTIONS composition in templates/deployment.yaml to append the Sentry -javaagent: fragment, gated on oam.performance.sentry.enabled. This is one more printf fragment appended to $opts, building on #4’s composition. Gating means the small startup tax (agent loads + bytecode rewriting, ~200–500 ms) stays out of environments that don’t use Sentry:

    env:
    # ...existing entries...
    {{- $prefix := default (include "application.artifactPrefix" .) .Values.compute.artifactPrefix }}
    {{- $heapDumpFlag := printf "-XX:HeapDumpPath=/tmp/%s-heap.hprof" $prefix }}
    {{- $jfrFlag := printf "-XX:StartFlightRecording=name=%s,settings=profile,disk=true,maxsize=200M,maxage=2h,dumponexit=true,filename=/tmp/%s-exit.jfr" $prefix $prefix }}
    {{- $sentryAgentFlag := ternary "-javaagent:/app/agents/sentry-otel-agent.jar" "" .Values.oam.performance.sentry.enabled }}
    {{- $opts := trim (printf "%s %s %s %s" (default "" .Values.compute.javaToolOptions) $heapDumpFlag $jfrFlag $sentryAgentFlag) }}
    {{- if $opts }}
    - name: JAVA_TOOL_OPTIONS
    value: {{ $opts | quote }}
    {{- end }}

    This is the #3 + #4 + #5 composition — the final form of the JAVA_TOOL_OPTIONS injection when all three recommendations are in place. When oam.performance.sentry.enabled: false the -javaagent: fragment is the empty string; the agent JAR sits in the image but is never loaded. When compute.javaToolOptions is empty (dev / demo), $opts is empty after trim and the env var is omitted entirely — JFR and heap dumps don’t activate either. Values files contain only literal strings and per-env overrides; no {{ ... }} substitutions inside values, no tpl indirection.

  3. Add an oam: top-level section to values.yaml, with a performance: sub-block that houses Sentry (and, in the future, any other performance-monitoring concern — JFR Sentry profiling, APM agents, distributed-tracing collectors, etc.). Keeping observability under oam.performance.* separates it from the pod’s compute: (sizing) concerns and reserves room under oam: for other operational aspects (log shipping, alerting hooks, etc.) that don’t belong in compute::

    # values.yaml (chart defaults)
    #
    # OAM — operational-aspect configuration. Houses observability,
    # log shipping, alerting hooks, and any concern that is about how
    # the running pod is observed and managed, as opposed to compute
    # capacity (which lives under `compute:`).
    oam:
    performance:
    sentry:
    # Enabled in all four envs day-one per the blanket telemetry
    # policy. The chart's fail-soft contract (secretKeyRef:
    # optional: true on the DSN) makes this safe even before the
    # infrastructure-scoped DSN secret is provisioned — see
    # § Failure Modes for what happens when enabled but DSN
    # missing.
    enabled: true
    # Environment tag. When empty (default), the deployment template
    # uses the `application.environment` helper which composes
    # `{infrastructure}-{purpose}` (e.g. `alpha001-prod`). Override
    # here with a literal string only when a particular env needs
    # something other than the helper output.
    environment: ""
    # Sample rate for traces. Default is 0.1 (10 %) for the
    # production-facing envs; dev and stage override to 1.0 in
    # their per-env values files for full debugging visibility.
    tracesSampleRate: "0.1"

    The K8s secret name (be-sentry-dsn) and key (dsn) are fixed in the deployment template, not values-driven — they are produced by an ExternalSecret declared in templates/secrets.yaml (see § Provisioning pipeline). There is no per-env override for the secret name; deviation would require breaking the “one-DSN-per-Sentry-project” model.

    Per-env enablement. Blanket telemetry policy — all four envs ship with enabled: true day one. Differentiation is via tracesSampleRate only (full sampling in the debug-oriented envs, capped in the production-facing ones to keep CPU overhead small). The chart’s fail-soft contract (secretKeyRef: optional: true on the DSN; see § Failure modes) makes it safe to enable everywhere before the upstream AWS SM secret exists — the Sentry agent simply stays inert until the K8s secret materializes.

    values-dev.yaml
    oam:
    performance:
    sentry:
    enabled: true
    tracesSampleRate: "1.0"
    values-stage.yaml
    oam:
    performance:
    sentry:
    enabled: true
    tracesSampleRate: "1.0"
    values-demo.yaml
    oam:
    performance:
    sentry:
    enabled: true
    tracesSampleRate: "0.1" # 10 % — production-facing demo env
    values-prod.yaml
    oam:
    performance:
    sentry:
    enabled: true
    tracesSampleRate: "0.1" # 10 % to keep agent CPU overhead small

    The deployment template injects two env-var groups:

    • OTEL_SERVICE_NAME — always set, outside the Sentry gate. The component name is a property of the deployed service, not a Sentry-specific knob. Setting it unconditionally means any future OTel-aware tooling (Datadog APM, Honeycomb, self-hosted Tempo, etc.) automatically tags events correctly without requiring a new gate. Sentry consumes it as the service.name tag — the dimension every alert rule and dashboard widget scopes by in the single-project platform-be design (see sentry-configuration.md § Component differentiation via service.name).
    • SENTRY_* env vars — gated on oam.performance.sentry.enabled. Use the application.environment helper for the env-tag default and a fixed secretKeyRef to the be-sentry-dsn K8s secret produced by the ExternalSecret (see § Provisioning pipeline).
    # templates/deployment.yaml — alongside the existing env entries
    - name: OTEL_SERVICE_NAME
    value: {{ include "application.name" . | quote }}
    {{- if .Values.oam.performance.sentry.enabled }}
    {{- $sentryEnv := default (include "application.environment" .) .Values.oam.performance.sentry.environment }}
    - name: SENTRY_DSN
    valueFrom:
    secretKeyRef:
    name: be-sentry-dsn
    key: dsn
    optional: true # so pod still starts if upstream lags
    - name: SENTRY_ENVIRONMENT
    value: {{ $sentryEnv | quote }}
    - name: SENTRY_TRACES_SAMPLE_RATE
    value: {{ .Values.oam.performance.sentry.tracesSampleRate | quote }}
    - name: SENTRY_RELEASE
    value: {{ printf "%s@%s" (include "application.name" .) .Chart.AppVersion | quote }}
    {{- end }}
  4. Add the be-sentry-dsn ExternalSecret to templates/secrets.yaml, alongside the existing ExternalSecret resources. Plain-string projection — no JSON, no template indirection. The remoteRef.key is composed from global.infrastructure and points at the infrastructure-scoped AWS Secrets Manager secret provisioned by the infrastructure layer (see § Provisioning pipeline):

    # templates/secrets.yaml — inside the `featureFlag.hasExternalSecrets` block
    {{- if .Values.oam.performance.sentry.enabled }}
    ---
    apiVersion: external-secrets.io/v1
    kind: ExternalSecret
    metadata:
    name: be-sentry-dsn
    labels:
    {{- include "application.labels" . | nindent 4 }}
    app: {{ include "application.name" . | quote }}
    spec:
    refreshInterval: 1h
    secretStoreRef:
    name: {{ include "application.name" . | quote }}
    target:
    deletionPolicy: Delete
    data:
    - secretKey: dsn
    remoteRef:
    key: "{{ .Values.global.infrastructure }}-SentryDsn"
    version: "AWSCURRENT"
    {{- end }}

    The ExternalSecret is gated on oam.performance.sentry.enabled so it is not even synthesized in environments that are not yet using Sentry — keeps the namespace clean and avoids ESO log noise about a not-yet-existing source secret. (The featureFlag.hasExternalSecrets: false branch — used by helmInstallToLocal — does not need a vanilla-K8s-secret fallback for Sentry; local installs run with oam.performance.sentry.enabled: false.)

Failure modes — what happens if the Sentry DSN is not provisioned when the pod starts. This is the question that determines whether shipping with enabled: false defaults is safe, and what happens when an environment is flipped on before its DSN secret exists. The behaviour falls into six cases, only one of which is fail-loud:

ScenarioJVM startPod trafficWhat you’ll see
oam.performance.sentry.enabled: false (the chart default)NormalNormalNo -javaagent: flag rendered; agent JAR sits in the image unused; no Sentry-related log lines. Zero overhead.
oam.performance.sentry.enabled: true, DSN secret missing or emptyNormalNormalAgent loads, attempts init, reads empty SENTRY_DSN and disables itself. Logs (at INFO/WARN) something like "Sentry SDK disabled because no DSN was set". Spans/errors are dropped silently. JVM continues.
oam.performance.sentry.enabled: true, DSN malformed (not a valid Sentry URL)NormalNormalAgent loads, logs an ERROR: Invalid DSN, then disables itself. Same outcome as missing DSN.
oam.performance.sentry.enabled: true, DSN valid but Sentry temporarily unreachableNormalNormalEvents queue in memory (default ~100 events), then drop oldest. Respect Retry-After headers. JVM unaffected.
oam.performance.sentry.enabled: true, DSN revoked / project deletedNormalNormalSentry transport gets 401/403; SDK applies exponential backoff. JVM unaffected.
oam.performance.sentry.enabled: true, agent JAR missing from /app/agents/Pod fails to start with Error opening zip file or JAR manifest missingn/aThis is the only fail-loud path. Mitigated by bundling the JAR at image build time (step 1) — the JAR’s existence is guaranteed at every pod start.

Operational property: Sentry is fail-soft. The only path that crashes the pod is the agent JAR itself being missing, which the build-time bundling closes deterministically. Missing/invalid DSN just produces an inert agent — the pod starts and serves traffic normally with one diagnostic line in its log. This makes it safe to flip oam.performance.sentry.enabled: true in a values file before the DSN secret has been created: the worst case is “Sentry is silently not collecting yet.”

The DSN value flows through four layers from source to pod. Each layer is independently deployable; the chart’s fail-soft properties make lagging layers safe.

1Password (Arda-SystemsOAM)
op://Arda-SystemsOAM/be-sentry-dsn/dsn [source of record]
▼ (amm.sh at infra deploy: op read + ::add-mask:: + --parameter-overrides)
AWS Secrets Manager (per infrastructure, plain string)
Alpha001-SentryDsn = "https://...@.../<project-id>"
Alpha002-SentryDsn = "https://...@.../<project-id>"
▼ (ESO sync via the existing SecretStore + secretReader IRSA role)
K8s Secret (per operations namespace)
be-sentry-dsn { dsn: <string> }
Pod env SENTRY_DSN ← secretKeyRef(be-sentry-dsn, dsn)

Resource breakdown, per infrastructure (Alpha001 shown; Alpha002 mirrors):

LayerResourceCardinality in Alpha001
1Pop://Arda-SystemsOAM/be-sentry-dsn/dsn1 total (workspace-wide source of record)
AWS SMAlpha001-SentryDsn (plain SecretString)1 per infrastructure
K8s ExternalSecretbe-sentry-dsn1 per operations namespace (so 2 in Alpha001: prod-operations, demo-operations)
K8s Secretbe-sentry-dsn (key dsn)1 per operations namespace (materialized by ESO)

The IAM piece is already in place: the {Infrastructure}-SecretsManagerReadRole (used by the secretReader IRSA ServiceAccount that ESO assumes) has a managed policy {Infrastructure}-ReadSecrets whose resource scope is arn:aws:secretsmanager:{region}:{account}:secret:{Infra}-*. That wildcard already covers {Infra}-SentryDsn alongside the partition-scoped keys. No IAM change required.

(Matches PDEV-492 and the infrastructure work scoped in infrastructure-improvements.md §4.)

  1. (done) Provision the Sentry project (platform-be) in the arda-systems org. Capture the single DSN (one DSN per project, used across all four envs by design — see sentry-configuration.md § Storage Convention).
  2. (done) Add the DSN to the workspace-wide Arda-SystemsOAM 1Password vault as be-sentry-dsn/dsn. Not the partition vaults — Sentry DSNs are common across envs by design, so they belong in the workspace vault.
  3. Infrastructure CDK: add an AWS::SecretsManager::Secret resource named {Infrastructure}-SentryDsn to the existing infrastructure CDK app (mounted by src/main/cdk/instances/{Infrastructure}/infra.ts). Use a cdk.CfnParameter (noEcho: true, minLength: 1) and pass it to the SM resource via cdk.SecretValue.cfnParameter(...) so the synthesized template never embeds the DSN plaintext. Export the secret ARN as {Infrastructure}-I-SentryDsnArn. The legacy src/main/cfn/partitionSecrets.cfn.yaml template is not the model — CDK is the preferred IaC technology in the infrastructure repo. Full design in infrastructure-improvements.md §4.
  4. amm.sh extension (infra step, before the existing infrastructure CDK deploy at lines 304-342): read op://Arda-SystemsOAM/be-sentry-dsn/dsn, mask via ::add-mask:: when running under GitHub Actions, append --parameters SentryDsn=$SENTRY_DSN to infrastructure_cdk_arguments, and let the existing npx cdk ... deploy call apply it. The -I- / -API- conventions are CFN-export-name markers only and never appear in AWS resource names; only the export ({Infrastructure}-I-SentryDsnArn) carries the -I-.
  5. Operations chart (this ticket): the ExternalSecret declared in step 4 of the previous subsection materializes the K8s secret be-sentry-dsn in each operations namespace.
  6. Ship the chart with oam.performance.sentry.enabled: true in all four envs (blanket telemetry policy). Per-env tracesSampleRate differs: 1.0 in dev and stage, 0.1 in demo and prod.

If steps 3–5 lag behind step 6, the pod still starts; the only effect is no Sentry data flow for that env until the SM secret + ExternalSecret reconcile. No outage, no rollback.

Verification after deploy. Pick the target env (here: prod) and sample the logs. Substitute CONTEXT/NS for other envs.

Terminal window
CONTEXT=Alpha001 # Alpha001 for prod/demo, Alpha002 for stage/dev
NS=prod-operations # ${ENV}-${COMPONENT}
# Agent loaded and initialised
kubectl --context $CONTEXT -n $NS logs \
-l app=operations --tail 500 \
| grep -E 'sentry|JAVA_TOOL_OPTIONS'

Expected output when oam.performance.sentry.enabled: true and a valid DSN is in the K8s secret. The JAVA_TOOL_OPTIONS line lists every flag composed by the deployment template — verbatim, in the order the template emits them:

Picked up JAVA_TOOL_OPTIONS: -Xlog:gc*:stdout:time,uptime,level,tags -XX:InitialRAMPercentage=50.0 -XX:MaxRAMPercentage=75.0 -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/prod-operations-heap.hprof -XX:StartFlightRecording=name=prod-operations,settings=profile,disk=true,maxsize=200M,maxage=2h,dumponexit=true,filename=/tmp/prod-operations-exit.jfr -javaagent:/app/agents/sentry-otel-agent.jar
[sentry] INFO Initializing Sentry, dsn=https://o<...>@sentry.io/<project>
[sentry] INFO OpenTelemetry instrumentation loaded

Expected output when enabled but the DSN secret is missing or empty:

Picked up JAVA_TOOL_OPTIONS: -Xlog:gc*:stdout:time,uptime,level,tags -XX:InitialRAMPercentage=50.0 -XX:MaxRAMPercentage=75.0 -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/prod-operations-heap.hprof -XX:StartFlightRecording=name=prod-operations,settings=profile,disk=true,maxsize=200M,maxage=2h,dumponexit=true,filename=/tmp/prod-operations-exit.jfr -javaagent:/app/agents/sentry-otel-agent.jar
[sentry] WARN Sentry SDK disabled because no DSN was set

Then in the Sentry web UI, the platform-be project should start receiving:

  • Performance > Transactions: spans for GET /v1/kanban/kanban-card/for-item/{itemEId} etc.
  • Issues: any uncaught exceptions thrown by the Ktor handlers.
  • A trace from the frontend should now extend through API Gateway into operations (visible by clicking the trace ID from a recent arda-frontend event).

Risk. Low. The agent’s runtime overhead at 10 % trace sampling in prod is 1–3 % CPU and ~10–30 MB additional heap residue, within the headroom the 2 vCPU / 4 GiB prod pod sizing provides. Dev and stage at 100 % sampling pay more overhead (5–8 % CPU) but on non-prod traffic that’s acceptable for the catch-misconfiguration benefit; demo at 10 % matches prod. The blanket-on policy is a deliberate choice to surface integration / misconfiguration problems early — we can dial back any env later via a one-line values change if a specific overhead surfaces.

Quota / billing. Operations transaction volume is significant (2,921 query-by-item calls per day at the time of the investigation; likely more across all routes). At 10 % prod + 10 % demo + 100 % dev

  • 100 % stage sampling, the bulk of traced volume comes from the production-facing envs at modest rates, well within typical Sentry plan limits — but worth a one-line confirmation with whoever owns Sentry billing before deploy.

Effort and ship order. Small. The wiring is the same shape as the GC-log / JFR plumbing in #3 and #4: a values change, a deployment-template extension, and a build-system addition (Gradle task + Jib extraDirectories). No Kotlin code.

Suggested order. Everything below ships in one PR (PDEV-488). What differs is when each piece activates per environment:

  1. In the PR: #1, #2 (sizing + HPA on in all four envs), #3 (GC logs + JVM memory), and the Helm wiring for #4 and #5. Values files at this point: all four envs have compute.javaToolOptions populated (so GC logs + JVM memory + JFR are active everywhere from merge); oam.performance.sentry.enabled is true in all four envs with per-env tracesSampleRate (1.0 for dev and stage, 0.1 for demo and prod).
  2. After the PR merges and deploys, all four envs activate GC logs + JVM memory + JFR + Sentry agent. The Sentry agent is fail-soft if the upstream AWS SM secret isn’t yet provisioned (secretKeyRef: optional: true keeps the pod healthy with an inert agent until the secret reconciles).
  3. No follow-up PR is required for telemetry activation. If a specific env later needs to be dialed back (e.g. quota concern, noise from an integration), flip its tracesSampleRate or enabled flag with a one-line values change.

Two related items that fall outside this recommendation but follow naturally:

  • The Sentry org / vault provisioning is tracked by PDEV-492 — one Sentry project (platform-be), one DSN (shared across all envs by design), one workspace-vault entry. Owners: whoever manages the Sentry org and Arda-SystemsOAM today.
  • PDEV-490 will benefit from this work landing first: once Sentry is visible on the operations side, OP3 (fix 500 == no cards) becomes much easier to triage because the silent stack traces become loud stack traces in Sentry.