Runbook: JVM Profiling — JFR and Heap Dumps
Author: Miguel Pinilla Date: 2026-05-15 Last Verified: 2026-05-15 Environments: dev | stage | demo | prod Linear: PDEV-488
Purpose
Section titled “Purpose”This runbook documents how to inspect and retrieve JVM diagnostic artifacts from Arda back-end pods on EKS:
- Continuous JFR recording — every pod runs a rolling JFR recording
(
settings=profile,maxsize=200M,maxage=2h) writing to ephemeral pod storage. On graceful shutdown the rolling window is dumped to/tmp/{prefix}-exit.jfr. - On-demand JFR dump — capture the current rolling window without stopping the pod, then copy it out.
- Heap dumps — captured automatically on
OutOfMemoryErrorto/tmp/{prefix}-heap.hprof, or on demand viajcmd.
The {prefix} token is defined in Conventions used below and resolves
to {env}-{component} (e.g., prod-operations).
The setup is provisioned by the operations Helm chart via
compute.javaToolOptions and is enabled in all four environments from
day 1 (decision DR-002 in PDEV-488). Other back-end components inherit
the same shape once their charts adopt the javaToolOptions plumbing.
Personas
Section titled “Personas”| Persona | Who | Background |
|---|---|---|
sre | On-call SRE / devOps engineer | kubectl access to the target EKS context (Alpha001 for demo/prod, Alpha002 for dev/stage); local JDK ≥ 21 with jcmd and JDK Mission Control installed. |
Prerequisites
Section titled “Prerequisites”-
The target pod is running (not
CrashLoopBackOff). For dumps from a crashed pod, see Retrieving artifacts from a terminated pod below. -
Your
kubectlcontext points at the cluster hosting the pod:Terminal window # demo / prod live on Alpha001kubectl config use-context Alpha001# dev / stage live on Alpha002kubectl config use-context Alpha002 -
The namespace follows
{env}-{component}(e.g.,prod-operations). -
The JVM banner shows the agent / recording flags. Confirm with:
Terminal window kubectl -n {namespace} logs -l app={component} --tail 500 \| grep 'JAVA_TOOL_OPTIONS'Expect
-XX:StartFlightRecording=…and (if Sentry is active)-javaagent:/app/agents/sentry-otel-agent.jarin the printed banner.
Conventions used below
Section titled “Conventions used below”-
{namespace}— Kubernetes namespace, e.g.,prod-operations. -
{component}— the Helmapplabel, e.g.,operations. -
{prefix}— the artifact prefix,{env}-{component}(e.g.,prod-operations). Computed by the chart’sapplication.artifactPrefixhelper and used in both the JFR filename and-XX:HeapDumpPath. -
POD— a single pod name. Pick one with:Terminal window POD=$(kubectl -n {namespace} get pod -l app={component} \-o jsonpath='{.items[0].metadata.name}')
The application runs as PID 1 inside the container, so all jcmd
invocations target 1.
Section 1: Verify JFR is recording
Section titled “Section 1: Verify JFR is recording”kubectl -n {namespace} exec $POD -- jcmd 1 JFR.checkExpected output (recording name matches {prefix}):
Recording 1: name={prefix} duration=0s (running)If no recording is listed:
- The JVM was not started with the expected
JAVA_TOOL_OPTIONS. Re-check the banner from the Prerequisites step. - The chart’s
compute.javaToolOptionsmay be empty for this env. Inspect withhelm get values {release} -n {namespace}.
Handling JFR recordings and heap dumps
Section titled “Handling JFR recordings and heap dumps”Treat retrieved artifacts as production data. JFR recordings can include request URLs, header values, thread-local context and environment variables; heap dumps additionally include every live object — request bodies, decrypted secrets, access tokens, customer PII, and any in-flight cryptographic material. The same access rules that apply to production database extracts apply here.
Operator obligations:
- Storage — keep the file on an Arda-managed laptop encrypted at rest (FileVault / LUKS). Do not upload to personal cloud storage, shared Google Drive folders, or third-party analysis sites.
- Sharing — if you need to share the artifact for analysis, use a 1Password Document in the relevant
Arda-{Env}OAMvault, orop://Arda-SystemsOAM/...for cross-env investigations. Never attach a.jfror.hprofto a Slack DM or a Linear comment.- Retention — delete the local copy as soon as the investigation closes (and at most within 30 days). When the file is shared via 1Password, set an expiry on the Document.
- Incident write-ups — quote conclusions, not raw frames. If a stack trace or class histogram excerpt is needed in a post-mortem, redact request bodies and any field that could contain a secret.
Section 2: On-demand JFR dump
Section titled “Section 2: On-demand JFR dump”Use this when you want a snapshot of the current rolling window — e.g., to investigate a live performance regression without restarting the pod.
# 1. Dump the current rolling window to a file inside the podkubectl -n {namespace} exec $POD -- jcmd 1 JFR.dump \ name={prefix} filename=/tmp/{prefix}-ondemand.jfr
# 2. Copy the file outkubectl -n {namespace} cp \ $POD:/tmp/{prefix}-ondemand.jfr \ ./{prefix}-ondemand-$(date +%Y%m%d-%H%M%S).jfr
# 3. (Optional) tidy up inside the podkubectl -n {namespace} exec $POD -- rm /tmp/{prefix}-ondemand.jfrNotes:
JFR.dumpis non-destructive: the rolling recording keeps running.- The dump captures the configured
maxage=2h/maxsize=200Mwindow available at the moment of the call. Older data has already been discarded by the rolling buffer. - Files land in pod-local
/tmp(Fargate ephemeral storage). Copy them out before the pod terminates.
Section 3: Retrieve the exit JFR after a graceful shutdown
Section titled “Section 3: Retrieve the exit JFR after a graceful shutdown”The recording is configured with dumponexit=true, so a graceful pod
shutdown writes the rolling window to /tmp/{prefix}-exit.jfr before the
JVM exits. To capture it:
# Inspect the file inside the pod just before / during terminationkubectl -n {namespace} exec $POD -- ls -lh /tmp/{prefix}-exit.jfr
# Copy it out while the pod is still in Terminating statekubectl -n {namespace} cp \ $POD:/tmp/{prefix}-exit.jfr \ ./{prefix}-exit-$(date +%Y%m%d-%H%M%S).jfrRace condition: Fargate reaps the pod’s ephemeral storage as soon as the container exits. If you need the exit JFR, you must either:
- Be quick —
kubectl cpduring theTerminatingwindow, or - Use the on-demand flow in Section 2 before triggering shutdown.
For a hard crash (OutOfMemoryError with +ExitOnOutOfMemoryError,
or a SIGKILL), the exit dump is not written. Use the heap dump
instead (Section 4), and the rolling JFR is lost.
Section 4: Heap dumps
Section titled “Section 4: Heap dumps”4.1 Automatic OOM heap dump — current limitation
Section titled “4.1 Automatic OOM heap dump — current limitation”The chart enables -XX:+HeapDumpOnOutOfMemoryError and
-XX:HeapDumpPath=/tmp/{prefix}-heap.hprof. On any OutOfMemoryError
the JVM writes a heap dump and then exits (+ExitOnOutOfMemoryError).
The auto-dump is, in practice, unrecoverable today. The Deployment uses
restartPolicy: Always, so when the JVM exits the container restarts in place inside the same pod. The pod never entersTerminating; it transitions throughCrashLoopBackOffwhile the new container starts on a fresh writable layer./tmpfrom the dead container — including{prefix}-heap.hprof— is no longer reachable viakubectl cporkubectl execonce the restart has happened.
That makes +HeapDumpOnOutOfMemoryError useful only as a belt-and-
braces signal that the OOM occurred: the JVM logs the dump path before
exiting, which confirms the failure mode for the operator reading
CloudWatch. The bytes themselves are not retrievable.
Three options when an OOM has just fired in production:
- (Preferred) On-demand pre-OOM dump. If you can see the heap filling up — Sentry alerts on memory pressure, latency spikes, or the JVM’s pre-OOM GC churn — take an on-demand dump via Section 4.2 before the OOM trips. This is the only path that reliably captures the offending object graph from the production replica.
- (Fallback) Reproduce in dev or stage. Replay the offending
request (Sentry breadcrumbs / structured logs identify it) against a
dev pod with a tighter memory limit, then capture there via
Section 4.2 or the auto-OOM dump (in dev the artifact still lives
only inside the dying container, so chain it with a
kubectl exec tail -for pre-arm akubectl cploop). - (Future, not in scope of PDEV-488) Provision a
PersistentVolumeor sidecar at/tmpsoHeapDumpPathsurvives container restart. This is a chart change, not an operator workaround — track it on the SRE backlog if production OOMs become recurrent.
See Retrieving artifacts from a terminated pod below for the broader recovery story.
4.2 On-demand heap dump
Section titled “4.2 On-demand heap dump”For live investigations (e.g., suspected leak that hasn’t yet OOM’d):
# 1. Trigger the dump (large pods may take 10–60s)kubectl -n {namespace} exec $POD -- jcmd 1 GC.heap_dump \ /tmp/{prefix}-heap-ondemand.hprof
# 2. Copy it outkubectl -n {namespace} cp \ $POD:/tmp/{prefix}-heap-ondemand.hprof \ ./{prefix}-heap-$(date +%Y%m%d-%H%M%S).hprof
# 3. Clean upkubectl -n {namespace} exec $POD -- rm /tmp/{prefix}-heap-ondemand.hprofGC.heap_dump triggers a full GC and writes a .hprof file the size of
the live heap — for a 2 GiB prod pod expect roughly the same on disk.
Confirm /tmp has headroom (Fargate ephemeral is 20 GiB by default) and
that the pod is healthy enough to absorb the stop-the-world pause before
running on a busy prod replica.
Section 5: Open the artifacts locally
Section titled “Section 5: Open the artifacts locally”-
JDK Mission Control (JMC) — Eclipse Foundation, free. Download from https://www.oracle.com/java/technologies/jdk-mission-control.html or
brew install --cask jdk-mission-control. Open the.jfrfile withFile → Open File. The Automated Analysis page surfaces the usual suspects (allocation hotspots, lock contention, GC pressure). -
jfrCLI (bundled with the JDK) for quick triage:Terminal window jfr summary {prefix}-exit-*.jfrjfr print --events CPULoad,GarbageCollection {prefix}-exit-*.jfr | less
Heap dumps
Section titled “Heap dumps”- Eclipse MAT (Memory Analyzer Tool) — best free option for large
heaps. Download from https://eclipse.dev/mat/downloads.php or
brew install --cask memoryanalyzeron macOS (verify withbrew search— the homebrew-cask token has shifted historically). - VisualVM — handles smaller heaps comfortably; bundled with most JDK distributions.
- For huge dumps, prefer MAT and increase its workbench heap (edit
MemoryAnalyzer.ini→-Xmx8gor higher) before opening.
Section 6: Retrieving artifacts from a terminated pod
Section titled “Section 6: Retrieving artifacts from a terminated pod”If the pod has already exited and Fargate reaped /tmp, the on-pod
artifacts are gone. Options in order of preference:
-
Reproduce in stage / dev with the same input or load shape, then capture from the live pod via Section 2 or 4.2.
-
Inspect CloudWatch logs and Sentry traces for the period leading up to the crash. The Sentry OpenTelemetry agent (decision DR-002) surfaces the request that triggered the OOM in nearly every case.
-
Lower the memory limit in a dev replica and provoke the OOM deliberately. Because
restartPolicy: Alwaysmeans the container restarts in place (noTerminatingstate), the auto-OOM file is still unreachable post-restart — pre-arm a watcher instead:Terminal window # In a separate terminal, before triggering the OOM:while true; dokubectl -n {namespace} cp \$POD:/tmp/{prefix}-heap.hprof \./{prefix}-heap-$(date +%Y%m%d-%H%M%S).hprof 2>/dev/null \&& breaksleep 1doneThe
kubectl cpcall succeeds the first cycle after the JVM has finished writing the dump and before the container fully exits. If it never succeeds, the JVM exited too quickly — fall back to an on-demand dump (Section 4.2) just before tripping the OOM, or provision a persistent/tmpmount.
Other useful operations
Section titled “Other useful operations”The flight-recorder and jcmd surface much more than what’s needed for
day-to-day capture. Reach for these when the dump itself doesn’t tell
the whole story:
| Operation | Command (run inside the pod) | Why / Reference |
|---|---|---|
| List active recordings, GC stats, threads | jcmd 1 help | Built-in catalogue of jcmd subcommands. |
| Thread dump | jcmd 1 Thread.print | Classic deadlock / hung-thread triage. Cheaper than a heap dump. |
| Class histogram | jcmd 1 GC.class_histogram | Quick “what’s eating the heap” view without dumping. |
| JIT compilation queue | jcmd 1 Compiler.queue | Useful when latency p99 spikes correlate with deploys. |
| Native memory tracking | jcmd 1 VM.native_memory summary | Requires -XX:NativeMemoryTracking=summary at JVM start (not enabled by default in our charts). |
| JFR custom event recording | jcmd 1 JFR.start name=ad-hoc settings=default duration=60s filename=/tmp/ad-hoc.jfr | When the default profile template hides what you’re chasing. |
Convert an old .hprof to OQL queries | Eclipse MAT OQL Console | https://help.eclipse.org/latest/topic/org.eclipse.mat.ui.help/concepts/oqlsyntax.html |
Deeper references:
- JDK Flight Recorder docs — https://docs.oracle.com/en/java/javase/21/jfapi/index.html
jcmdreference — https://docs.oracle.com/en/java/javase/21/docs/specs/man/jcmd.html- JDK Mission Control user guide — https://docs.oracle.com/en/java/java-components/jdk-mission-control/9/user-guide/
- Eclipse MAT documentation — https://eclipse.dev/mat/documentation/
- PDEV-488 pod-capacity analysis — the originating analysis lives in
roadmap/in-progress/pod-capacity/pod_capacity.md(moves tocompleted/at project close). Sections § Rec #3 and § Rec #4 cover the chosen JVM tuning and JFR settings respectively.
Copyright: © Arda Systems 2025-2026, All rights reserved