PDEV-442 — Operations pod capacity (EKS / Fargate)
Captured during the PDEV-442 investigation while looking for “signs of stress on the EKS nodes during high-latency periods”. Fargate runs each pod inside its own micro-VM, so the pod’s cgroup quota is the node budget — there’s no separate node layer to inspect.
Configured resources
Section titled “Configured resources”Deployment operations in namespace prod-operations on EKS cluster
Alpha001-eks-cluster. Two replicas on Fargate.
resources: requests: cpu: 500m memory: 1Gi limits: cpu: 1 memory: 2GiTotal prod operations budget on paper: 2 vCPU / 4 GiB across both pods.
Effective vCPU on Fargate
Section titled “Effective vCPU on Fargate”Inside both pods, cgroup v1 reports:
| Pod | cpu.cfs_quota_us | cpu.cfs_period_us | Effective vCPU |
|---|---|---|---|
| operations-78bc575484-9whbd | 50000 | 100000 | 0.5 |
| operations-78bc575484-xn74c | 50000 | 100000 | 0.5 |
Despite the deployment declaring limits.cpu = 1, Fargate’s cgroup
quota is set to 0.5 vCPU per pod. Fargate appears to honor the
request (500m), not the limit. The component-wide CPU budget for
prod is therefore 1 vCPU across both pods.
Sanity check needed. Confirm against the Fargate task definition whether the task is being sized to the request or to the limit on this deployment. Either way, the cgroup is the source of truth and it caps each container at 0.5 vCPU.
CPU throttling history (cgroup cpu.stat)
Section titled “CPU throttling history (cgroup cpu.stat)”Captured over each pod’s ~6.3-day uptime:
| Pod | nr_periods | nr_throttled | throttled_time | % periods throttled |
|---|---|---|---|---|
| 9whbd | 1,193,454 | 523 | 35.35 s | 0.044 % |
| xn74c | 1,284,586 | 799 | 48.25 s | 0.062 % |
Throttling is occurring. The aggregate throttled-time is modest in absolute terms (35–48 s per pod over 6 days), but the events cluster during high-load periods (12:00–21:00 UTC). Even rare 60–100 ms throttle waits stack with each other under concurrent load and materialise as p99 tail latency.
nr_bursts = 0 on both pods: the cgroup is not configured with CPU
burst, so the pod cannot use credits earned during idle periods to
absorb spikes.
Memory posture (cgroup v1, live snapshot)
Section titled “Memory posture (cgroup v1, live snapshot)”| Pod | usage_in_bytes | max_usage_in_bytes | limit_in_bytes | memory.failcnt |
|---|---|---|---|---|
| 9whbd | 441 MB | 442 MB | 2 GiB | 0 |
| xn74c | 440 MB | 445 MB | 2 GiB | 0 |
JVM VmRSS 432 MB, 47 threads, no swap, no pgmajfault pressure (26
total over 6 days). max_usage_in_bytes is essentially the current
usage — memory has been steady, never near the limit. Memory is not
the constraint.
Load-related signals
Section titled “Load-related signals”- Live
loadavg:0.02 0.03 0.00— off-peak; no current pressure. - JVM peak virtual memory (
VmPeak) 3.27 GB — virtual address space including mmap regions, not resident. Cgroup enforces against RSS. - No
OutOfMemory,OOMKilled, GC pause messages orthrottlemarkers in the pod stdout for the 3-h window (note: JVM default GC logging is off; explicit-Xlog:gc*is required to capture pauses).
How this reconciles with the measured latency
Section titled “How this reconciles with the measured latency”A single request that consumes 300 ms of CPU work needs ~600 ms wall-
clock on 0.5 vCPU with no contention. Under the /items workload — 12
concurrent kanban requests per page load against 2 pods × 0.5 vCPU =
1 vCPU of capacity — each request can degrade to 1–2 s. That is
exactly the pod-level latency we measure (avg 0.85–1.0 s, p95 1.6–2.0 s).
During business hours, with multiple users on /items simultaneously,
the same arithmetic predicts the p99 = 7–12 s API Gateway reports.
The DB is fast (PI: 0.49 avg active sessions, 100 % CPU class, 19–22 % writer CPU). The cost is in the pod’s in-process work — Exposed ORM materialisation of wide bitemporal rows, JSON serialisation, the bitemporal “as-of” filter — running on a CPU budget that is too small to absorb the page’s fan-out.
JVM Memory Evaluation
Section titled “JVM Memory Evaluation”Diagnostic audit of how the operations JVM uses the container memory budget today. Motivates the memory and GC flags in Recommendation #3 and the prod sizing decisions in PDEV-488.
Audit of operations/build.gradle.kts shows the runtime JVM has
zero memory-related flags set anywhere:
| Source | Value | Purpose |
|---|---|---|
application.applicationDefaultJvmArgs (line 26) | ["-Dio.ktor.development=$isDevelopment"] | config flag, no memory tuning |
jib.container.jvmFlags (line 372) | ["-Darda.config.location=/app/conf:/app/secret"] | config flag, no memory tuning |
tasks.test { jvmArgs("--add-opens=...") } (line 599) | test JVM only | not in the runtime container |
Corretto 21 with +UseContainerSupport (default since JDK 11) reads
the cgroup memory limit and picks max heap = 25 % of the limit by
default (MaxRAMPercentage=25.0). With current and proposed pod
sizing:
| Env | Container memory limit | Default max heap (25 %) | Unused budget |
|---|---|---|---|
| Today (all envs) | 2 GiB | ~512 MiB | ~1 GiB |
| Prod after PDEV-488 | 4 GiB | ~1 GiB | ~2.5 GiB |
Moving prod to 4 GiB worsens the under-utilisation ratio unless we also adjust the heap percentage. The measured RSS of ~440 MB (§ Memory posture above) is consistent with the JVM using a small fraction of its 512 MiB ceiling — heap + non-heap + native combined — well within the current limit. So today’s “memory is comfortable” reading is correct, but it’s also telling us the JVM isn’t using the budget we’re paying for.
Recommended JVM memory management flags
Section titled “Recommended JVM memory management flags”All six flags below are intended to be shipped together. Five of them
are in the compute.javaToolOptions values string; the sixth
(-XX:HeapDumpPath) is path-dependent and is appended by
templates/deployment.yaml using the application.artifactPrefix
helper (see § Reusable Helm helpers under Evaluation of Changes).
| Flag | Value | Why |
|---|---|---|
-XX:MaxRAMPercentage | 75.0 | Let heap grow to ~75 % of the container limit. Leaves the remaining ~25 % for code cache, metaspace, thread stacks, direct buffers, and JVM-native — enough breathing room that JVM memory growth won’t trip the K8s OOM-killer. Standard for containerised Java. |
-XX:InitialRAMPercentage | 50.0 | Pre-allocate half the eventual max heap at startup instead of growing lazily. Reduces early-life GC churn and improves first-request latency after pod start. |
-XX:+ExitOnOutOfMemoryError | (flag) | If heap OOM ever fires, terminate the JVM immediately so K8s restarts the pod cleanly. Without this the JVM can limp on for many seconds with degraded behaviour before the liveness probe kills it. |
-XX:+HeapDumpOnOutOfMemoryError | (flag) | If OOM fires, write a heap dump for post-mortem. |
-XX:HeapDumpPath | /tmp/{prefix}-heap.hprof | Path of the heap dump produced on OOM. See Notes below for how {prefix} is composed and why this flag is appended by the deployment template instead of declared in values. |
-Xlog:gc*:stdout:time,uptime,level,tags | (target) | GC logging to container stdout, shipped to CloudWatch via the cluster’s existing Fluent Bit path. Closes the silent-GC blind spot. |
{prefix}composition.-XX:HeapDumpPathis not a static flag; it embeds an environment-and-component identifier (e.g.prod-operations) so heap dumps from different pods don’t collide when pulled to a workstation. The identifier is computed at chart render time by theapplication.artifactPrefixhelper (defined alongside the existingapplication.*helpers in_helpers.tpl). For this reason the flag is appended bytemplates/deployment.yaml, not declared in anyvalues*.yaml. Per-env override is available viacompute.artifactPrefix.- Heap-dump retrieval is operator-driven. Fargate ephemeral
storage is ~20 GiB by default — plenty for one heap dump (heap
caps at ~3 GiB on the prod sizing). The dump persists only until
pod recycle; retrieve it before then with
kubectl --context <ctx> -n <ns> exec <pod> -- sh -c 'cat /tmp/<prefix>-heap.hprof' > local.hprof. - GC logging targets
stdoutdeliberately. The cluster already ships container stdout to CloudWatch via Fluent Bit; the alternative (writing to a file in the container) would lose data on pod restart and require a separate shipping pipeline. See Recommendation #3’s “Stdout is sufficient — no Fluent Bit change” subsection for the full reasoning.
Not recommended (intentional non-choices)
Section titled “Not recommended (intentional non-choices)”-Xmx/-Xmsas absolute values — using percentages keeps the flags portable across the per-env memory limits (0.5 vCPU/2 GiB dev through 2 vCPU/4 GiB prod) without per-env overrides.- Changing the GC. G1 is the default and is appropriate for this workload (mixed allocation, latency-sensitive). ZGC and Shenandoah reduce pause times further but have larger heap overhead and are worth considering only after a JFR snapshot shows GC pauses as the primary contributor to tail latency.
-XX:+UseStringDeduplication— would save heap on the duplicated Item-name / supplier strings inKanbanCardDetailspayloads, but the savings are small relative to the heap budget and the better fix is OP4 (slim down the response shape) on the operations side (PDEV-490).-XX:NativeMemoryTracking=summary— useful for one-off diagnosis of where non-heap memory goes, but adds steady overhead. Enable ad-hoc via a debug values file, don’t ship in prod by default.
Recommendations
Section titled “Recommendations”In order of impact, lowest effort first:
-
Align
requests.cpuwithlimits.cpu. On Fargate, the request determines the task size. Settingrequest 500m / limit 1yields 0.5 vCPU effective. Userequest 2 / limit 2(or higher) — and verify the cgroup quota changes accordingly after the next rollout. -
Add an HPA with a CPU target (e.g. 60 %). Today the deployment is pinned at 2 replicas regardless of load. The
/itemsfan-out is a classic case for horizontal scaling. -
Enable JVM GC logging (
-Xlog:gc*:stdout:time,uptime,level,tags) shipped to CloudWatch via the existing stdout → Fluent Bit path, so we can observe pauses during peak hours. -
Capture a JFR snapshot under peak load to find the dominant in-process cost (Exposed materialisation vs. JSON serialisation vs. bitemporal logic). async-profiler is not viable on Fargate (see Recommendation #4 for why); JFR is the Fargate-compatible replacement.
-
Attach a JVM-agent-based Sentry instrumentation (Sentry’s OpenTelemetry Java agent) with no Kotlin source changes. Provides server spans, JDBC spans, distributed tracing across SPA → BFF → operations, and JVM runtime metrics. Wired through the same
JAVA_TOOL_OPTIONSmechanism as #3 and #4; runtime configuration under a newoam.performance.sentry.*Helm values block. Gated off by default until the Sentry project + the single DSN are provisioned (PDEV-492). -
Reconsider the Fargate choice for the operations component. EC2- backed Karpenter-managed nodes would allow CPU burst and a denser scheduling profile; today the per-pod overhead of a Fargate micro-VM is paying for capacity that mostly sits idle off-peak.
Status: explicitly deferred. Not in scope for PDEV-488 or any sibling sub-issue under PDEV-442. Revisit when system scale or sustained-load patterns make the cost- of-Fargate calculus material; no concrete plan at this point.
Evaluation of Changes
Section titled “Evaluation of Changes”Recommendation #1 + #2 — in flight
Section titled “Recommendation #1 + #2 — in flight”Tracked by PDEV-488. The Helm-chart changes that satisfy #1 and #2 are scoped to that ticket; this section summarises what they entail:
compute:block invalues.yamlwith per-env requests-equal-limits Fargate-valid sizes (dev / stage / demo 1 vCPU / 2 GiB; prod 2 vCPU / 4 GiB). All four envs share the same 2 GiB-per-vCPU memory ratio.- HPA template shipped with
compute.hpa.enabled: truein all four envs day one. The EKSmetrics-serveraddon (PDEV-491) is the runtime prerequisite for the HPA to actually scale; if it lags behind PDEV-488 the HPA stays in a<unknown>/60%state and the deployment holds atminReplicas— no chart-side fail. compute.hpa.minReplicasas the single source of truth for replica count, whether the HPA is actively scaling or not.
Recommendations #3, #4, and #5 require additional changes to the same
Helm chart (and one runbook in the documentation site). All three are
JVM-side changes that share a mechanism — a new
compute.javaToolOptions value plumbed into the container as
JAVA_TOOL_OPTIONS — but each delivers independent value.
Reusable Helm helpers
Section titled “Reusable Helm helpers”#3, #4, and #5 each need to construct identifiers that combine the
deployment partition, environment, and component name — for filenames,
Sentry environment tags, and JFR recording names. To keep the values
files pure YAML (no {{ }} substitutions inside values, no Helm tpl
indirection at render time), the composition is done with two new
helpers in src/main/helm/templates/_helpers.tpl, alongside the
existing application.name, application.labels, etc. After
implementation _helpers.tpl is the single source of truth; the rest
of this document refers to the helpers by name only.
Helpers to add (proposed contents — implement once, in _helpers.tpl,
then reference by include):
{{/*Composes "{infrastructure}-{purpose}" (e.g. "alpha001-prod"). Used asthe default for the Sentry environment tag and any other"{partition}-{env}" identifier.
Usage: {{ include "application.environment" . }}*/}}{{ define "application.environment" -}}{{ printf "%s-%s" .Values.global.infrastructure .Values.global.purpose }}{{- end }}
{{/*On-disk artifact filename prefix for JFR exit recordings, heap dumps,and similar files that may be pulled to a workstation for analysis.Returns "{purpose}-{component}" (e.g. "prod-operations") so artifactscollected from multiple components / environments are unambiguous.
Usage: {{ include "application.artifactPrefix" . }}*/}}{{ define "application.artifactPrefix" -}}{{ printf "%s-%s" .Values.global.purpose (include "application.name" .) }}{{- end }}Returns at a glance:
| Helper | Returns | Example output | Used for |
|---|---|---|---|
application.environment | {infrastructure}-{purpose} | alpha001-prod | Default for SENTRY_ENVIRONMENT (#5) |
application.artifactPrefix | {purpose}-{component} | prod-operations | Default for heap-dump filename (#3), JFR name= and exit-filename (#4) |
The subsections below reference these helpers by name. Values files
contain only literal strings and per-env overrides; substitution and
composition happen exclusively inside templates/deployment.yaml.
Recommendation #3 — Enable JVM GC logging and ship to CloudWatch
Section titled “Recommendation #3 — Enable JVM GC logging and ship to CloudWatch”What it does. Turns on the JVM’s structured GC log (-Xlog:gc*)
so every garbage collection event emits a line we can correlate
against request-latency spikes. Today the JVM is silent about GC; when
the pod shows a tail-latency burst we can’t tell whether it was a long
G1 pause, heap-allocation pressure, or upstream slowness.
Mechanism — use JAVA_TOOL_OPTIONS, not JAVA_OPTS. The
operations image is built with Jib (gradle plugin, see
build.gradle.kts:370-385), not a Dockerfile. There is no Dockerfile
in the operations repo to tweak. Jib generates an exec-form
ENTRYPOINT — an array, not a shell command:
ENTRYPOINT ["java", "-Darda.config.location=/app/conf:/app/secret", // from jib.container.jvmFlags "-Dio.ktor.development=false", // from application.applicationDefaultJvmArgs "-cp", "@/app/jib-classpath-file", "io.ktor.server.netty.EngineMain"]Because there is no shell, the conventional JAVA_OPTS env var is not
expanded — setting it in the K8s pod env would be silently ignored.
The portable alternative the JVM itself reads at startup is
JAVA_TOOL_OPTIONS (part of the OpenJDK launcher contract).
Setting JAVA_TOOL_OPTIONS in the container env works with Jib’s
exec-form entrypoint without any image or build-system change, and
the JVM logs Picked up JAVA_TOOL_OPTIONS: ... to stderr on startup
— a free verification line.
Where the change lives. Two pieces in the operations repo (no
Dockerfile, no Jib config change needed).
-
Plumb a
javaToolOptionsvalue through the Helm chart. Extend thecompute:block invalues.yaml, default empty:values.yaml compute:resources: { ... }hpa: { ... }# Extra JVM startup options applied via the JAVA_TOOL_OPTIONS env# var. The operations image is Jib-built with an exec-form# ENTRYPOINT, so JAVA_OPTS (which requires a shell) is not read.# JAVA_TOOL_OPTIONS is honoured by the JVM directly.# Disabled by default; set in per-environment values files to# enable GC logging, JFR, etc.javaToolOptions: ""Per-env opt-in (start with stage and prod):
# values-prod.yaml — append at endcompute:# ...existing resources/hpa blocks...# Static JVM flags only. The path-dependent flag# -XX:HeapDumpPath is appended by the deployment template using# the `application.artifactPrefix` helper — renders as# `/tmp/prod-operations-heap.hprof` in prod, etc.javaToolOptions: >--Xlog:gc*:stdout:time,uptime,level,tags-XX:InitialRAMPercentage=50.0-XX:MaxRAMPercentage=75.0-XX:+ExitOnOutOfMemoryError-XX:+HeapDumpOnOutOfMemoryErrorSee the
## JVM Memory Evaluationsection above for the audit that motivates this flag set and the rationale behind each flag. -
Inject
JAVA_TOOL_OPTIONSinto the container intemplates/deployment.yaml. The container already has anenv:block (it injects DB URLs and the like). Compose the final value from the static base flags (compute.javaToolOptions) plus a path-dependent heap-dump flag built from theapplication.artifactPrefixhelper. Per-env override of the prefix is available viacompute.artifactPrefix.This snippet is the #3-only baseline; Recommendations #4 and #5 incrementally extend it by appending more fragments to
$opts. Apply this version if you ship #3 in isolation.env:# ...existing entries...{{- $prefix := default (include "application.artifactPrefix" .) .Values.compute.artifactPrefix }}{{- $heapDumpFlag := printf "-XX:HeapDumpPath=/tmp/%s-heap.hprof" $prefix }}{{- $opts := trim (printf "%s %s" (default "" .Values.compute.javaToolOptions) $heapDumpFlag) }}{{- if $opts }}- name: JAVA_TOOL_OPTIONSvalue: {{ $opts | quote }}{{- end }}Values files stay pure YAML — no Go-template substitutions in values, no
tplindirection. The composition is local to the deployment template, where it belongs, and per-env overrides are plain strings.No Dockerfile to touch, no Jib
jvmFlagschange. The existingjib.container.jvmFlags = ["-Darda.config.location=..."]stay in the ENTRYPOINT (these are “always-on” baseline flags that don’t belong in per-env config).JAVA_TOOL_OPTIONSare appended by the JVM to whatever is already on the command line — both apply. -
Stdout is sufficient — no Fluent Bit change. The cluster already ships container stdout to CloudWatch via Fluent Bit (log groups
/Alpha001/eks-logsand/aws/eks/Alpha001-eks-cluster/fluent-bit-logs). Writing GC events tostdout(the colon-prefixed target in-Xlog:gc*:stdout:...) means they show up in the same CloudWatch log stream as the pod’s application logs. No new log group, no new IAM, no new shipping config.
Verification after deploy. Pick the target env and confirm the JVM picked up the options at startup (single line emitted to stderr, captured in pod stdout):
CONTEXT=Alpha001 # Alpha001 for prod/demo, Alpha002 for stage/devNS=prod-operations # ${ENV}-${COMPONENT}
kubectl --context $CONTEXT -n $NS logs \ -l app=operations --tail 500 | grep 'JAVA_TOOL_OPTIONS'Expected line (after #3 ships in isolation — the path-dependent heap-dump flag is appended by the deployment template):
Picked up JAVA_TOOL_OPTIONS: -Xlog:gc*:stdout:time,uptime,level,tags -XX:InitialRAMPercentage=50.0 -XX:MaxRAMPercentage=75.0 -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/prod-operations-heap.hprofThen confirm GC events are flowing:
kubectl --context $CONTEXT -n $NS logs \ -l app=operations --tail 200 | grep '\[gc'Should show lines like
[12.345s][info][gc] GC(7) Pause Young (Normal) (G1 Evacuation Pause) 256M->130M(4096M) 18.234ms.
In CloudWatch Logs Insights:
fields @timestamp, @message| filter @message like /\[gc\]/| parse @message /(?<dur>\d+\.\d+)ms\s*$/| stats avg(dur), max(dur), pct(dur, 99) by bin(5m)To see GC pause distribution over time and overlay against the API Gateway latency spikes.
Risk. Low. GC logging at info level adds ~50 bytes per event and
a few microseconds. With G1 collecting every few seconds, the overhead
is in the noise. The MaxRAMPercentage bump is more impactful — it
gives the JVM access to the memory it’s already allocated, which
should reduce GC frequency, not increase it.
Effort and ship order. XS. Can ship in the same PR as #1 and #2.
Recommendation #4 — Capture a JFR snapshot under peak load
Section titled “Recommendation #4 — Capture a JFR snapshot under peak load”What it does. Produces a flame graph (CPU sampling) of the operations JVM during a peak-load window, so we can attribute the in-process cost to a specific source: Exposed ORM hydration of bitemporal rows, JSON serialisation, the as-of filter logic, lock contention on a coroutine, or something else entirely. Today we know the pod is CPU-bound but we don’t know what it’s spending CPU on.
Why JFR, not async-profiler, on Fargate. Important constraint:
async-profiler does not work on EKS Fargate. It relies on
perf_event_open and AsyncGetCallTrace, which need either
privileged container mode or CAP_PERFMON. Fargate’s micro-VM model
exposes neither. This is why the recommendation title doesn’t
mention async-profiler — JFR is the only viable choice under the
current deployment model. If we ever leave Fargate (Recommendation
#6’s deferred consideration) async-profiler becomes available.
Java Flight Recorder (JFR) is the Fargate-compatible alternative:
- Built into the JVM (free since OpenJDK 8u262 / 11+).
- Low-overhead (~1-2 %) in the
profiletemplate; safe to run continuously in production. - Captures the same data we need: method-level CPU sampling, allocation hotspots, lock contention, GC events, IO.
- Analysed in JDK Mission Control (JMC), IntelliJ’s built-in viewer,
or rendered as flame graphs via
jfr2flamegraph/jfrconv(Datadog), or uploaded to Profiler.dev / Pyroscope.
Where the change lives. Two pieces in the same Helm chart we
touch for #3, plus a short runbook in the documentation site at
documentation/src/content/docs/process/sre/runbooks/.
-
Enable continuous JFR recording via
JAVA_TOOL_OPTIONS. Reuses thecompute.javaToolOptionsplumbing from #3. The recording name and on-exit filename both use the{purpose}-{component}prefix produced byapplication.artifactPrefix(e.g.prod-operations), so JFR artifacts from operations / accounts-component / multiple environments are self-identifying when collected side-by-side:# values-prod.yaml — same javaToolOptions as #3 (static base flags only).compute:javaToolOptions: >--Xlog:gc*:stdout:time,uptime,level,tags-XX:InitialRAMPercentage=50.0-XX:MaxRAMPercentage=75.0-XX:+ExitOnOutOfMemoryError-XX:+HeapDumpOnOutOfMemoryErrorThe path-dependent flags —
-XX:HeapDumpPath(from #3),-XX:StartFlightRecording=name=...,filename=...(added by this recommendation), and the Sentry-javaagent:flag (#5) — are all appended by the deployment template using theapplication.artifactPrefixhelper. At deploy time the prod render producesname=prod-operations, the demo render producesname=demo-operations, and so on. No per-env duplication of the identifier; no{{ ... }}substitution inside values files.Decoding the JFR flag (
{prefix}is the output ofapplication.artifactPrefix, e.g.prod-operations):name={prefix}— names the recording sojcmdcan refer to it and so multiple recordings collected from different partitions don’t collide in JMC / JFR analysis tools.settings=profile— the higher-fidelity preset (vs.default), still under ~2 % overhead.disk=true,maxsize=200M,maxage=2h— rolling 2-hour window on disk, capped at 200 MB. Old data discarded as new arrives.dumponexit=true,filename=/tmp/{prefix}-exit.jfr— writes the buffered recording on graceful pod shutdown so we don’t lose data when the pod is recycled.
-
Extend the
JAVA_TOOL_OPTIONScomposition intemplates/deployment.yamlto append the JFR flag, adding one moreprintffragment to$opts. This is an incremental extension of the #3 baseline snippet:env:# ...existing entries...{{- $prefix := default (include "application.artifactPrefix" .) .Values.compute.artifactPrefix }}{{- $heapDumpFlag := printf "-XX:HeapDumpPath=/tmp/%s-heap.hprof" $prefix }}{{- $jfrFlag := printf "-XX:StartFlightRecording=name=%s,settings=profile,disk=true,maxsize=200M,maxage=2h,dumponexit=true,filename=/tmp/%s-exit.jfr" $prefix $prefix }}{{- $opts := trim (printf "%s %s %s" (default "" .Values.compute.javaToolOptions) $heapDumpFlag $jfrFlag) }}{{- if $opts }}- name: JAVA_TOOL_OPTIONSvalue: {{ $opts | quote }}{{- end }}This is the #3 + #4 composition; Recommendation #5 extends it one more step by appending a Sentry-agent fragment to
$opts. If you ship #3 + #4 without #5, this is the deployment template snippet to use. -
No image change needed. JFR is in the JVM. No new dependencies, no Dockerfile change.
-
Runbook at
documentation/src/content/docs/process/sre/runbooks/jvm-profiling.md. The capture/export sequence requireskubectland is worth documenting once. The page should follow the conventions of the existingprocess/sre/runbooks/directory — frontmatter withtitle,description(40–300 chars),tags,domain: process,maturity,author— and is added in the documentation worktree alongside this project’sgoal.md:Terminal window # The recording is named `{purpose}-{component}` (e.g. prod-operations)# — the output of the application.artifactPrefix Helm helper.# Substitute below for the cluster / partition you're targeting.ENV=prodCOMPONENT=operationsNS=${ENV}-${COMPONENT}REC=${ENV}-${COMPONENT}STAMP=$(date -u +%FT%TZ)# 1. Pick a pod under peak loadPOD=$(kubectl --context Alpha001 -n $NS get pod \-l app=$COMPONENT -o name | head -1)# 2. Dump a snapshot of the rolling recording without stopping it.# jcmd PID 1 is the JVM (PID 1 in the container).kubectl --context Alpha001 -n $NS exec $POD -- \jcmd 1 JFR.dump name=$REC filename=/tmp/${REC}-peak-${STAMP}.jfr# 3. Stream it out (kubectl cp is unreliable for large files; the# cat pattern matches what management/aurora-data-dump uses).kubectl --context Alpha001 -n $NS exec $POD -- \sh -c "cat /tmp/${REC}-peak-${STAMP}.jfr" > ./${REC}-peak-${STAMP}.jfr# 4. Analyse locally (file is self-identifying: prod-operations-peak-*.jfr)# - JDK Mission Control: open the .jfr file# - IntelliJ IDEA Ultimate: Run → Open Profiler Snapshot...# - Flame graph: jfrconv -t cpu --collapse <file>.jfr \# | flamegraph.pl > flame.svgThe runbook should also note:
- The snapshot must be captured during peak hours (12:00–21:00 UTC) to be representative. Off-peak snapshots will not reveal the bottleneck.
- A 5–10 minute snapshot is enough; longer adds noise without insight.
- After capture, delete the in-pod file:
kubectl exec ... rm /tmp/${REC}-peak-*.jfr.
Verification after deploy. Pick a target env (here: prod) and a running pod, then confirm JFR is recording. The variables below are the same ones used in the runbook bash block above:
CONTEXT=Alpha001 # kubectl context (Alpha001 for prod/demo, Alpha002 for stage/dev)COMPONENT=operationsNS=prod-operations # ${ENV}-${COMPONENT}POD=$(kubectl --context $CONTEXT -n $NS get pod \ -l app=$COMPONENT -o name | head -1)
kubectl --context $CONTEXT -n $NS exec $POD -- jcmd 1 JFR.checkExpected output (substituting prod-operations for the prefix when
deployed to prod; stage-operations etc. for other envs):
Recording 1: name=prod-operations duration=0s (running)Risk. Low for the recording itself. Continuous JFR at profile
settings is what Datadog Continuous Profiler, Pyroscope, and AWS
CodeGuru all use in production — well within their measured 1–2 %
CPU overhead and a few MB of disk.
One operational note: disk=true writes to the pod’s ephemeral
storage. With maxsize=200M, that’s small enough not to matter on
Fargate (default ephemeral storage is 20 GiB). If a pod has nothing
else writing to disk, this is invisible.
Effort and ship order. XS for the config change itself. The bulk
of the work is using the snapshot once captured — interpreting the
flame graph and turning that into concrete operations-side fixes
(which would feed PDEV-490 OP4–OP7). But the change to enable it is
the same XS-sized addition to JAVA_TOOL_OPTIONS and a small
markdown file.
Worth shipping in the same PR as #1, #2, and #3 — they’re all “one PR, one Helm chart redesign” — but only after we’ve also raised the prod pod CPU limit. JFR recording on a CPU-starved pod will exhibit the cost of its own profiling more than it reveals the true workload.
Recommendation #5 — JVM-agent-based Sentry instrumentation (no Kotlin code)
Section titled “Recommendation #5 — JVM-agent-based Sentry instrumentation (no Kotlin code)”What it does. Adds server-side observability for the operations
component without touching Kotlin source. The mechanism is the
Sentry OpenTelemetry Java agent (sentry-opentelemetry-agent-<ver>.jar),
attached at JVM startup via -javaagent:. It auto-instruments via the
OpenTelemetry Java instrumentation library and ships spans, errors,
and JVM metrics to Sentry.
Six signals come for free, without a single import:
| Signal | Source |
|---|---|
| HTTP server spans for every Ktor route | OTel Ktor/Netty instrumentation |
| DB query spans (statement, rows, duration) | OTel JDBC instrumentation around HikariCP |
| Distributed trace context | Reads sentry-trace / W3C traceparent from inbound headers and propagates outbound |
| Uncaught exceptions → Sentry issues with stack traces | OTel error capture |
| JVM runtime metrics (heap, GC, threads) | OTel JVM metrics emitter |
| Release / environment tagging | Env vars at startup |
This closes two gaps from the investigation: the “37 % silent 500” path
on kanban-card/details becomes visible to Sentry without an operations
code change (PDEV-490 OP3’s contract fix can land separately on its
own merits), and the BFF→operations correlation becomes a single
distributed trace in Sentry instead of a hand timestamp-join.
Code-side instrumentation — adding custom spans, business-context
tags on the kanban endpoints, breadcrumbs on the 500 == no cards
path, etc. — is naturally PDEV-490 territory and not in scope here.
Where the change lives. Three pieces, all in the operations repo build / Helm chart. No Kotlin source changes.
-
Bundle the agent JAR in the image via Jib’s
extraDirectories. A small Gradle task downloads the agent JAR (pinned by version) intosrc/main/jib/agent/, and Jib copies it into the image at/app/agents/sentry-otel-agent.jar:build.gradle.kts val sentryAgentVersion = "8.5.0" // pin to a specific releaseval sentryAgent = configurations.create("sentryAgent")dependencies {"sentryAgent"("io.sentry:sentry-opentelemetry-agent:$sentryAgentVersion")}val copySentryAgent by tasks.registering(Copy::class) {from(sentryAgent) {include("sentry-opentelemetry-agent-*.jar")rename { "sentry-otel-agent.jar" }}into(layout.projectDirectory.dir("src/main/jib/agent"))}tasks.named("jibDockerBuild") { dependsOn(copySentryAgent) }tasks.named("jib") { dependsOn(copySentryAgent) }jib {extraDirectories {paths { path { setFrom("src/main/jib/agent"); into = "/app/agents" } }}// ...existing container / from / to blocks unchanged}The JAR is ~30 MB — material but not significant relative to the Corretto base image. The agent is always present in every image regardless of whether Sentry is enabled, which keeps the failure modes simple (see below).
-
Extend the
JAVA_TOOL_OPTIONScomposition intemplates/deployment.yamlto append the Sentry-javaagent:fragment, gated onoam.performance.sentry.enabled. This is one moreprintffragment appended to$opts, building on #4’s composition. Gating means the small startup tax (agent loads + bytecode rewriting, ~200–500 ms) stays out of environments that don’t use Sentry:env:# ...existing entries...{{- $prefix := default (include "application.artifactPrefix" .) .Values.compute.artifactPrefix }}{{- $heapDumpFlag := printf "-XX:HeapDumpPath=/tmp/%s-heap.hprof" $prefix }}{{- $jfrFlag := printf "-XX:StartFlightRecording=name=%s,settings=profile,disk=true,maxsize=200M,maxage=2h,dumponexit=true,filename=/tmp/%s-exit.jfr" $prefix $prefix }}{{- $sentryAgentFlag := ternary "-javaagent:/app/agents/sentry-otel-agent.jar" "" .Values.oam.performance.sentry.enabled }}{{- $opts := trim (printf "%s %s %s %s" (default "" .Values.compute.javaToolOptions) $heapDumpFlag $jfrFlag $sentryAgentFlag) }}{{- if $opts }}- name: JAVA_TOOL_OPTIONSvalue: {{ $opts | quote }}{{- end }}This is the #3 + #4 + #5 composition — the final form of the
JAVA_TOOL_OPTIONSinjection when all three recommendations are in place. Whenoam.performance.sentry.enabled: falsethe-javaagent:fragment is the empty string; the agent JAR sits in the image but is never loaded. Whencompute.javaToolOptionsis empty (dev / demo),$optsis empty aftertrimand the env var is omitted entirely — JFR and heap dumps don’t activate either. Values files contain only literal strings and per-env overrides; no{{ ... }}substitutions inside values, notplindirection. -
Add an
oam:top-level section tovalues.yaml, with aperformance:sub-block that houses Sentry (and, in the future, any other performance-monitoring concern — JFR Sentry profiling, APM agents, distributed-tracing collectors, etc.). Keeping observability underoam.performance.*separates it from the pod’scompute:(sizing) concerns and reserves room underoam:for other operational aspects (log shipping, alerting hooks, etc.) that don’t belong incompute::# values.yaml (chart defaults)## OAM — operational-aspect configuration. Houses observability,# log shipping, alerting hooks, and any concern that is about how# the running pod is observed and managed, as opposed to compute# capacity (which lives under `compute:`).oam:performance:sentry:# Enabled in all four envs day-one per the blanket telemetry# policy. The chart's fail-soft contract (secretKeyRef:# optional: true on the DSN) makes this safe even before the# infrastructure-scoped DSN secret is provisioned — see# § Failure Modes for what happens when enabled but DSN# missing.enabled: true# Environment tag. When empty (default), the deployment template# uses the `application.environment` helper which composes# `{infrastructure}-{purpose}` (e.g. `alpha001-prod`). Override# here with a literal string only when a particular env needs# something other than the helper output.environment: ""# Sample rate for traces. Default is 0.1 (10 %) for the# production-facing envs; dev and stage override to 1.0 in# their per-env values files for full debugging visibility.tracesSampleRate: "0.1"The K8s secret name (
be-sentry-dsn) and key (dsn) are fixed in the deployment template, not values-driven — they are produced by anExternalSecretdeclared intemplates/secrets.yaml(see § Provisioning pipeline). There is no per-env override for the secret name; deviation would require breaking the “one-DSN-per-Sentry-project” model.Per-env enablement. Blanket telemetry policy — all four envs ship with
enabled: trueday one. Differentiation is viatracesSampleRateonly (full sampling in the debug-oriented envs, capped in the production-facing ones to keep CPU overhead small). The chart’s fail-soft contract (secretKeyRef: optional: trueon the DSN; see § Failure modes) makes it safe to enable everywhere before the upstream AWS SM secret exists — the Sentry agent simply stays inert until the K8s secret materializes.values-dev.yaml oam:performance:sentry:enabled: truetracesSampleRate: "1.0"values-stage.yaml oam:performance:sentry:enabled: truetracesSampleRate: "1.0"values-demo.yaml oam:performance:sentry:enabled: truetracesSampleRate: "0.1" # 10 % — production-facing demo envvalues-prod.yaml oam:performance:sentry:enabled: truetracesSampleRate: "0.1" # 10 % to keep agent CPU overhead smallThe deployment template injects two env-var groups:
OTEL_SERVICE_NAME— always set, outside the Sentry gate. The component name is a property of the deployed service, not a Sentry-specific knob. Setting it unconditionally means any future OTel-aware tooling (Datadog APM, Honeycomb, self-hosted Tempo, etc.) automatically tags events correctly without requiring a new gate. Sentry consumes it as theservice.nametag — the dimension every alert rule and dashboard widget scopes by in the single-projectplatform-bedesign (seesentry-configuration.md§ Component differentiation viaservice.name).SENTRY_*env vars — gated onoam.performance.sentry.enabled. Use theapplication.environmenthelper for the env-tag default and a fixedsecretKeyRefto thebe-sentry-dsnK8s secret produced by the ExternalSecret (see § Provisioning pipeline).
# templates/deployment.yaml — alongside the existing env entries- name: OTEL_SERVICE_NAMEvalue: {{ include "application.name" . | quote }}{{- if .Values.oam.performance.sentry.enabled }}{{- $sentryEnv := default (include "application.environment" .) .Values.oam.performance.sentry.environment }}- name: SENTRY_DSNvalueFrom:secretKeyRef:name: be-sentry-dsnkey: dsnoptional: true # so pod still starts if upstream lags- name: SENTRY_ENVIRONMENTvalue: {{ $sentryEnv | quote }}- name: SENTRY_TRACES_SAMPLE_RATEvalue: {{ .Values.oam.performance.sentry.tracesSampleRate | quote }}- name: SENTRY_RELEASEvalue: {{ printf "%s@%s" (include "application.name" .) .Chart.AppVersion | quote }}{{- end }} -
Add the
be-sentry-dsnExternalSecrettotemplates/secrets.yaml, alongside the existingExternalSecretresources. Plain-string projection — no JSON, no template indirection. TheremoteRef.keyis composed fromglobal.infrastructureand points at the infrastructure-scoped AWS Secrets Manager secret provisioned by the infrastructure layer (see § Provisioning pipeline):# templates/secrets.yaml — inside the `featureFlag.hasExternalSecrets` block{{- if .Values.oam.performance.sentry.enabled }}---apiVersion: external-secrets.io/v1kind: ExternalSecretmetadata:name: be-sentry-dsnlabels:{{- include "application.labels" . | nindent 4 }}app: {{ include "application.name" . | quote }}spec:refreshInterval: 1hsecretStoreRef:name: {{ include "application.name" . | quote }}target:deletionPolicy: Deletedata:- secretKey: dsnremoteRef:key: "{{ .Values.global.infrastructure }}-SentryDsn"version: "AWSCURRENT"{{- end }}The ExternalSecret is gated on
oam.performance.sentry.enabledso it is not even synthesized in environments that are not yet using Sentry — keeps the namespace clean and avoids ESO log noise about a not-yet-existing source secret. (ThefeatureFlag.hasExternalSecrets: falsebranch — used byhelmInstallToLocal— does not need a vanilla-K8s-secret fallback for Sentry; local installs run withoam.performance.sentry.enabled: false.)
Failure modes — what happens if the Sentry DSN is not provisioned
when the pod starts. This is the question that determines whether
shipping with enabled: false defaults is safe, and what happens
when an environment is flipped on before its DSN secret exists. The
behaviour falls into six cases, only one of which is fail-loud:
| Scenario | JVM start | Pod traffic | What you’ll see |
|---|---|---|---|
oam.performance.sentry.enabled: false (the chart default) | Normal | Normal | No -javaagent: flag rendered; agent JAR sits in the image unused; no Sentry-related log lines. Zero overhead. |
oam.performance.sentry.enabled: true, DSN secret missing or empty | Normal | Normal | Agent loads, attempts init, reads empty SENTRY_DSN and disables itself. Logs (at INFO/WARN) something like "Sentry SDK disabled because no DSN was set". Spans/errors are dropped silently. JVM continues. |
oam.performance.sentry.enabled: true, DSN malformed (not a valid Sentry URL) | Normal | Normal | Agent loads, logs an ERROR: Invalid DSN, then disables itself. Same outcome as missing DSN. |
oam.performance.sentry.enabled: true, DSN valid but Sentry temporarily unreachable | Normal | Normal | Events queue in memory (default ~100 events), then drop oldest. Respect Retry-After headers. JVM unaffected. |
oam.performance.sentry.enabled: true, DSN revoked / project deleted | Normal | Normal | Sentry transport gets 401/403; SDK applies exponential backoff. JVM unaffected. |
oam.performance.sentry.enabled: true, agent JAR missing from /app/agents/ | Pod fails to start with Error opening zip file or JAR manifest missing | n/a | This is the only fail-loud path. Mitigated by bundling the JAR at image build time (step 1) — the JAR’s existence is guaranteed at every pod start. |
Operational property: Sentry is fail-soft. The only path that
crashes the pod is the agent JAR itself being missing, which the
build-time bundling closes deterministically. Missing/invalid DSN
just produces an inert agent — the pod starts and serves traffic
normally with one diagnostic line in its log. This makes it safe to
flip oam.performance.sentry.enabled: true in a values file before the DSN
secret has been created: the worst case is “Sentry is silently
not collecting yet.”
Provisioning pipeline
Section titled “Provisioning pipeline”The DSN value flows through four layers from source to pod. Each layer is independently deployable; the chart’s fail-soft properties make lagging layers safe.
1Password (Arda-SystemsOAM) op://Arda-SystemsOAM/be-sentry-dsn/dsn [source of record] │ ▼ (amm.sh at infra deploy: op read + ::add-mask:: + --parameter-overrides)AWS Secrets Manager (per infrastructure, plain string) Alpha001-SentryDsn = "https://...@.../<project-id>" Alpha002-SentryDsn = "https://...@.../<project-id>" │ ▼ (ESO sync via the existing SecretStore + secretReader IRSA role)K8s Secret (per operations namespace) be-sentry-dsn { dsn: <string> } │ ▼Pod env SENTRY_DSN ← secretKeyRef(be-sentry-dsn, dsn)Resource breakdown, per infrastructure (Alpha001 shown; Alpha002 mirrors):
| Layer | Resource | Cardinality in Alpha001 |
|---|---|---|
| 1P | op://Arda-SystemsOAM/be-sentry-dsn/dsn | 1 total (workspace-wide source of record) |
| AWS SM | Alpha001-SentryDsn (plain SecretString) | 1 per infrastructure |
K8s ExternalSecret | be-sentry-dsn | 1 per operations namespace (so 2 in Alpha001: prod-operations, demo-operations) |
K8s Secret | be-sentry-dsn (key dsn) | 1 per operations namespace (materialized by ESO) |
The IAM piece is already in place: the
{Infrastructure}-SecretsManagerReadRole (used by the secretReader
IRSA ServiceAccount that ESO assumes) has a managed policy
{Infrastructure}-ReadSecrets whose resource scope is
arn:aws:secretsmanager:{region}:{account}:secret:{Infra}-*. That
wildcard already covers {Infra}-SentryDsn alongside the
partition-scoped keys. No IAM change required.
Recommended operational sequence
Section titled “Recommended operational sequence”(Matches
PDEV-492
and the infrastructure work scoped in
infrastructure-improvements.md §4.)
- (done) Provision the Sentry project (
platform-be) in thearda-systemsorg. Capture the single DSN (one DSN per project, used across all four envs by design — seesentry-configuration.md§ Storage Convention). - (done) Add the DSN to the workspace-wide
Arda-SystemsOAM1Password vault asbe-sentry-dsn/dsn. Not the partition vaults — Sentry DSNs are common across envs by design, so they belong in the workspace vault. - Infrastructure CDK: add an
AWS::SecretsManager::Secretresource named{Infrastructure}-SentryDsnto the existing infrastructure CDK app (mounted bysrc/main/cdk/instances/{Infrastructure}/infra.ts). Use acdk.CfnParameter(noEcho: true,minLength: 1) and pass it to the SM resource viacdk.SecretValue.cfnParameter(...)so the synthesized template never embeds the DSN plaintext. Export the secret ARN as{Infrastructure}-I-SentryDsnArn. The legacysrc/main/cfn/partitionSecrets.cfn.yamltemplate is not the model — CDK is the preferred IaC technology in theinfrastructurerepo. Full design ininfrastructure-improvements.md§4. amm.shextension (infra step, before the existing infrastructure CDK deploy at lines 304-342): readop://Arda-SystemsOAM/be-sentry-dsn/dsn, mask via::add-mask::when running under GitHub Actions, append--parameters SentryDsn=$SENTRY_DSNtoinfrastructure_cdk_arguments, and let the existingnpx cdk ... deploycall apply it. The-I-/-API-conventions are CFN-export-name markers only and never appear in AWS resource names; only the export ({Infrastructure}-I-SentryDsnArn) carries the-I-.- Operations chart (this ticket): the
ExternalSecretdeclared in step 4 of the previous subsection materializes the K8s secretbe-sentry-dsnin each operations namespace. - Ship the chart with
oam.performance.sentry.enabled: truein all four envs (blanket telemetry policy). Per-envtracesSampleRatediffers: 1.0 in dev and stage, 0.1 in demo and prod.
If steps 3–5 lag behind step 6, the pod still starts; the only effect is no Sentry data flow for that env until the SM secret + ExternalSecret reconcile. No outage, no rollback.
Verification after deploy. Pick the target env (here: prod) and
sample the logs. Substitute CONTEXT/NS for other envs.
CONTEXT=Alpha001 # Alpha001 for prod/demo, Alpha002 for stage/devNS=prod-operations # ${ENV}-${COMPONENT}
# Agent loaded and initialisedkubectl --context $CONTEXT -n $NS logs \ -l app=operations --tail 500 \ | grep -E 'sentry|JAVA_TOOL_OPTIONS'Expected output when oam.performance.sentry.enabled: true and a
valid DSN is in the K8s secret. The JAVA_TOOL_OPTIONS line lists
every flag composed by the deployment template — verbatim, in the
order the template emits them:
Picked up JAVA_TOOL_OPTIONS: -Xlog:gc*:stdout:time,uptime,level,tags -XX:InitialRAMPercentage=50.0 -XX:MaxRAMPercentage=75.0 -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/prod-operations-heap.hprof -XX:StartFlightRecording=name=prod-operations,settings=profile,disk=true,maxsize=200M,maxage=2h,dumponexit=true,filename=/tmp/prod-operations-exit.jfr -javaagent:/app/agents/sentry-otel-agent.jar[sentry] INFO Initializing Sentry, dsn=https://o<...>@sentry.io/<project>[sentry] INFO OpenTelemetry instrumentation loadedExpected output when enabled but the DSN secret is missing or empty:
Picked up JAVA_TOOL_OPTIONS: -Xlog:gc*:stdout:time,uptime,level,tags -XX:InitialRAMPercentage=50.0 -XX:MaxRAMPercentage=75.0 -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/prod-operations-heap.hprof -XX:StartFlightRecording=name=prod-operations,settings=profile,disk=true,maxsize=200M,maxage=2h,dumponexit=true,filename=/tmp/prod-operations-exit.jfr -javaagent:/app/agents/sentry-otel-agent.jar[sentry] WARN Sentry SDK disabled because no DSN was setThen in the Sentry web UI, the platform-be project should
start receiving:
- Performance > Transactions: spans for
GET /v1/kanban/kanban-card/for-item/{itemEId}etc. - Issues: any uncaught exceptions thrown by the Ktor handlers.
- A trace from the frontend should now extend through API Gateway
into operations (visible by clicking the trace ID from a recent
arda-frontendevent).
Risk. Low. The agent’s runtime overhead at 10 % trace sampling in prod is 1–3 % CPU and ~10–30 MB additional heap residue, within the headroom the 2 vCPU / 4 GiB prod pod sizing provides. Dev and stage at 100 % sampling pay more overhead (5–8 % CPU) but on non-prod traffic that’s acceptable for the catch-misconfiguration benefit; demo at 10 % matches prod. The blanket-on policy is a deliberate choice to surface integration / misconfiguration problems early — we can dial back any env later via a one-line values change if a specific overhead surfaces.
Quota / billing. Operations transaction volume is significant
(2,921 query-by-item calls per day at the time of the investigation;
likely more across all routes). At 10 % prod + 10 % demo + 100 % dev
- 100 % stage sampling, the bulk of traced volume comes from the production-facing envs at modest rates, well within typical Sentry plan limits — but worth a one-line confirmation with whoever owns Sentry billing before deploy.
Effort and ship order. Small. The wiring is the same shape as
the GC-log / JFR plumbing in #3 and #4: a values change, a
deployment-template extension, and a build-system addition (Gradle
task + Jib extraDirectories). No Kotlin code.
Suggested order. Everything below ships in one PR (PDEV-488). What differs is when each piece activates per environment:
- In the PR: #1, #2 (sizing + HPA on in all four envs), #3 (GC
logs + JVM memory), and the Helm wiring for #4 and #5. Values
files at this point: all four envs have
compute.javaToolOptionspopulated (so GC logs + JVM memory + JFR are active everywhere from merge);oam.performance.sentry.enabledistruein all four envs with per-envtracesSampleRate(1.0 for dev and stage, 0.1 for demo and prod). - After the PR merges and deploys, all four envs activate GC
logs + JVM memory + JFR + Sentry agent. The Sentry agent is
fail-soft if the upstream AWS SM secret isn’t yet provisioned
(
secretKeyRef: optional: truekeeps the pod healthy with an inert agent until the secret reconciles). - No follow-up PR is required for telemetry activation. If a
specific env later needs to be dialed back (e.g. quota concern,
noise from an integration), flip its
tracesSampleRateorenabledflag with a one-line values change.
Two related items that fall outside this recommendation but follow naturally:
- The Sentry org / vault provisioning is tracked by
PDEV-492 —
one Sentry project (
platform-be), one DSN (shared across all envs by design), one workspace-vault entry. Owners: whoever manages the Sentry org andArda-SystemsOAMtoday. - PDEV-490
will benefit from this work landing first: once Sentry is visible
on the operations side, OP3 (fix
500 == no cards) becomes much easier to triage because the silent stack traces become loud stack traces in Sentry.
Copyright: © Arda Systems 2025-2026, All rights reserved