Runbook: JVM Profiling — JFR and Heap Dumps

Author: Miguel Pinilla Date: 2026-05-15 Last Verified: 2026-05-15 Environments: dev | stage | demo | prod Linear: PDEV-488

Purpose

This runbook documents how to inspect and retrieve JVM diagnostic artifacts from Arda back-end pods on EKS:

Continuous JFR recording — every pod runs a rolling JFR recording (settings=profile, maxsize=200M, maxage=2h) writing to ephemeral pod storage. On graceful shutdown the rolling window is dumped to /tmp/{prefix}-exit.jfr.
On-demand JFR dump — capture the current rolling window without stopping the pod, then copy it out.
Heap dumps — captured automatically on OutOfMemoryError to /tmp/{prefix}-heap.hprof, or on demand via jcmd.

The {prefix} token is defined in Conventions used below and resolves to {env}-{component} (e.g., prod-operations).

The setup is provisioned by the operations Helm chart via compute.javaToolOptions and is enabled in all four environments from day 1 (decision DR-002 in PDEV-488). Other back-end components inherit the same shape once their charts adopt the javaToolOptions plumbing.

Personas

Persona	Who	Background
`sre`	On-call SRE / devOps engineer	`kubectl` access to the target EKS context (`Alpha001` for demo/prod, `Alpha002` for dev/stage); local JDK ≥ 21 with `jcmd` and JDK Mission Control installed.

Prerequisites

The target pod is running (not CrashLoopBackOff). For dumps from a crashed pod, see Retrieving artifacts from a terminated pod below.

Your kubectl context points at the cluster hosting the pod:

# demo / prod live on Alpha001
kubectl config use-context Alpha001
# dev / stage live on Alpha002
kubectl config use-context Alpha002

The namespace follows {env}-{component} (e.g., prod-operations).
The JVM banner shows the agent / recording flags. Confirm with:
Terminal window
```
kubectl -n {namespace} logs -l app={component} --tail 500 \
  | grep 'JAVA_TOOL_OPTIONS'
```
Expect -XX:StartFlightRecording=… and (if Sentry is active) -javaagent:/app/agents/sentry-otel-agent.jar in the printed banner.

Conventions used below

{namespace} — Kubernetes namespace, e.g., prod-operations.
{component} — the Helm app label, e.g., operations.
{prefix} — the artifact prefix, {env}-{component} (e.g., prod-operations). Computed by the chart’s application.artifactPrefix helper and used in both the JFR filename and -XX:HeapDumpPath.

POD — a single pod name. Pick one with:

POD=$(kubectl -n {namespace} get pod -l app={component} \
  -o jsonpath='{.items[0].metadata.name}')

The application runs as PID 1 inside the container, so all jcmd invocations target 1.

Section 1: Verify JFR is recording

kubectl -n {namespace} exec $POD -- jcmd 1 JFR.check

Expected output (recording name matches {prefix}):

Recording 1: name={prefix} duration=0s (running)

If no recording is listed:

The JVM was not started with the expected JAVA_TOOL_OPTIONS. Re-check the banner from the Prerequisites step.
The chart’s compute.javaToolOptions may be empty for this env. Inspect with helm get values {release} -n {namespace}.

Handling JFR recordings and heap dumps

Treat retrieved artifacts as production data. JFR recordings can include request URLs, header values, thread-local context and environment variables; heap dumps additionally include every live object — request bodies, decrypted secrets, access tokens, customer PII, and any in-flight cryptographic material. The same access rules that apply to production database extracts apply here.

Operator obligations:

Storage — keep the file on an Arda-managed laptop encrypted at rest (FileVault / LUKS). Do not upload to personal cloud storage, shared Google Drive folders, or third-party analysis sites.

Sharing — if you need to share the artifact for analysis, use a 1Password Document in the relevant Arda-{Env}OAM vault, or op://Arda-SystemsOAM/... for cross-env investigations. Never attach a .jfr or .hprof to a Slack DM or a Linear comment.

Retention — delete the local copy as soon as the investigation closes (and at most within 30 days). When the file is shared via 1Password, set an expiry on the Document.

Incident write-ups — quote conclusions, not raw frames. If a stack trace or class histogram excerpt is needed in a post-mortem, redact request bodies and any field that could contain a secret.

Section 2: On-demand JFR dump

Use this when you want a snapshot of the current rolling window — e.g., to investigate a live performance regression without restarting the pod.

# 1. Dump the current rolling window to a file inside the pod
kubectl -n {namespace} exec $POD -- jcmd 1 JFR.dump \
  name={prefix} filename=/tmp/{prefix}-ondemand.jfr

# 2. Copy the file out
kubectl -n {namespace} cp \
  $POD:/tmp/{prefix}-ondemand.jfr \
  ./{prefix}-ondemand-$(date +%Y%m%d-%H%M%S).jfr

# 3. (Optional) tidy up inside the pod
kubectl -n {namespace} exec $POD -- rm /tmp/{prefix}-ondemand.jfr

Notes:

JFR.dump is non-destructive: the rolling recording keeps running.
The dump captures the configured maxage=2h / maxsize=200M window available at the moment of the call. Older data has already been discarded by the rolling buffer.
Files land in pod-local /tmp (Fargate ephemeral storage). Copy them out before the pod terminates.

Section 3: Retrieve the exit JFR after a graceful shutdown

The recording is configured with dumponexit=true, so a graceful pod shutdown writes the rolling window to /tmp/{prefix}-exit.jfr before the JVM exits. To capture it:

# Inspect the file inside the pod just before / during termination
kubectl -n {namespace} exec $POD -- ls -lh /tmp/{prefix}-exit.jfr

# Copy it out while the pod is still in Terminating state
kubectl -n {namespace} cp \
  $POD:/tmp/{prefix}-exit.jfr \
  ./{prefix}-exit-$(date +%Y%m%d-%H%M%S).jfr

Race condition: Fargate reaps the pod’s ephemeral storage as soon as the container exits. If you need the exit JFR, you must either:

Be quick — kubectl cp during the Terminating window, or
Use the on-demand flow in Section 2 before triggering shutdown.

For a hard crash (OutOfMemoryError with +ExitOnOutOfMemoryError, or a SIGKILL), the exit dump is not written. Use the heap dump instead (Section 4), and the rolling JFR is lost.

Section 4: Heap dumps

4.1 Automatic OOM heap dump — current limitation

The chart enables -XX:+HeapDumpOnOutOfMemoryError and -XX:HeapDumpPath=/tmp/{prefix}-heap.hprof. On any OutOfMemoryError the JVM writes a heap dump and then exits (+ExitOnOutOfMemoryError).

The auto-dump is, in practice, unrecoverable today. The Deployment uses restartPolicy: Always, so when the JVM exits the container restarts in place inside the same pod. The pod never enters Terminating; it transitions through CrashLoopBackOff while the new container starts on a fresh writable layer. /tmp from the dead container — including {prefix}-heap.hprof — is no longer reachable via kubectl cp or kubectl exec once the restart has happened.

That makes +HeapDumpOnOutOfMemoryError useful only as a belt-and- braces signal that the OOM occurred: the JVM logs the dump path before exiting, which confirms the failure mode for the operator reading CloudWatch. The bytes themselves are not retrievable.

Three options when an OOM has just fired in production:

(Preferred) On-demand pre-OOM dump. If you can see the heap filling up — Sentry alerts on memory pressure, latency spikes, or the JVM’s pre-OOM GC churn — take an on-demand dump via Section 4.2 before the OOM trips. This is the only path that reliably captures the offending object graph from the production replica.
(Fallback) Reproduce in dev or stage. Replay the offending request (Sentry breadcrumbs / structured logs identify it) against a dev pod with a tighter memory limit, then capture there via Section 4.2 or the auto-OOM dump (in dev the artifact still lives only inside the dying container, so chain it with a kubectl exec tail -f or pre-arm a kubectl cp loop).
(Future, not in scope of PDEV-488) Provision a PersistentVolume or sidecar at /tmp so HeapDumpPath survives container restart. This is a chart change, not an operator workaround — track it on the SRE backlog if production OOMs become recurrent.

See Retrieving artifacts from a terminated pod below for the broader recovery story.

4.2 On-demand heap dump

For live investigations (e.g., suspected leak that hasn’t yet OOM’d):

# 1. Trigger the dump (large pods may take 10–60s)
kubectl -n {namespace} exec $POD -- jcmd 1 GC.heap_dump \
  /tmp/{prefix}-heap-ondemand.hprof

# 2. Copy it out
kubectl -n {namespace} cp \
  $POD:/tmp/{prefix}-heap-ondemand.hprof \
  ./{prefix}-heap-$(date +%Y%m%d-%H%M%S).hprof

# 3. Clean up
kubectl -n {namespace} exec $POD -- rm /tmp/{prefix}-heap-ondemand.hprof

GC.heap_dump triggers a full GC and writes a .hprof file the size of the live heap — for a 2 GiB prod pod expect roughly the same on disk. Confirm /tmp has headroom (Fargate ephemeral is 20 GiB by default) and that the pod is healthy enough to absorb the stop-the-world pause before running on a busy prod replica.

Section 5: Open the artifacts locally

JFR

JDK Mission Control (JMC) — Eclipse Foundation, free. Download from https://www.oracle.com/java/technologies/jdk-mission-control.html or brew install --cask jdk-mission-control. Open the .jfr file with File → Open File. The Automated Analysis page surfaces the usual suspects (allocation hotspots, lock contention, GC pressure).

jfr CLI (bundled with the JDK) for quick triage:

jfr summary {prefix}-exit-*.jfr
jfr print --events CPULoad,GarbageCollection {prefix}-exit-*.jfr | less

Heap dumps

Eclipse MAT (Memory Analyzer Tool) — best free option for large heaps. Download from https://eclipse.dev/mat/downloads.php or brew install --cask memoryanalyzer on macOS (verify with brew search — the homebrew-cask token has shifted historically).
VisualVM — handles smaller heaps comfortably; bundled with most JDK distributions.
For huge dumps, prefer MAT and increase its workbench heap (edit MemoryAnalyzer.ini → -Xmx8g or higher) before opening.

Section 6: Retrieving artifacts from a terminated pod

If the pod has already exited and Fargate reaped /tmp, the on-pod artifacts are gone. Options in order of preference:

Reproduce in stage / dev with the same input or load shape, then capture from the live pod via Section 2 or 4.2.
Inspect CloudWatch logs and Sentry traces for the period leading up to the crash. The Sentry OpenTelemetry agent (decision DR-002) surfaces the request that triggered the OOM in nearly every case.
Lower the memory limit in a dev replica and provoke the OOM deliberately. Because restartPolicy: Always means the container restarts in place (no Terminating state), the auto-OOM file is still unreachable post-restart — pre-arm a watcher instead:
Terminal window
```
# In a separate terminal, before triggering the OOM:
while true; do
  kubectl -n {namespace} cp \
    $POD:/tmp/{prefix}-heap.hprof \
    ./{prefix}-heap-$(date +%Y%m%d-%H%M%S).hprof 2>/dev/null \
    && break
  sleep 1
done
```
The kubectl cp call succeeds the first cycle after the JVM has finished writing the dump and before the container fully exits. If it never succeeds, the JVM exited too quickly — fall back to an on-demand dump (Section 4.2) just before tripping the OOM, or provision a persistent /tmp mount.

Other useful operations

The flight-recorder and jcmd surface much more than what’s needed for day-to-day capture. Reach for these when the dump itself doesn’t tell the whole story:

Operation	Command (run inside the pod)	Why / Reference
List active recordings, GC stats, threads	`jcmd 1 help`	Built-in catalogue of `jcmd` subcommands.
Thread dump	`jcmd 1 Thread.print`	Classic deadlock / hung-thread triage. Cheaper than a heap dump.
Class histogram	`jcmd 1 GC.class_histogram`	Quick “what’s eating the heap” view without dumping.
JIT compilation queue	`jcmd 1 Compiler.queue`	Useful when latency p99 spikes correlate with deploys.
Native memory tracking	`jcmd 1 VM.native_memory summary`	Requires `-XX:NativeMemoryTracking=summary` at JVM start (not enabled by default in our charts).
JFR custom event recording	`jcmd 1 JFR.start name=ad-hoc settings=default duration=60s filename=/tmp/ad-hoc.jfr`	When the default `profile` template hides what you’re chasing.
Convert an old `.hprof` to OQL queries	Eclipse MAT OQL Console	https://help.eclipse.org/latest/topic/org.eclipse.mat.ui.help/concepts/oqlsyntax.html

Deeper references:

JDK Flight Recorder docs — https://docs.oracle.com/en/java/javase/21/jfapi/index.html
jcmd reference — https://docs.oracle.com/en/java/javase/21/docs/specs/man/jcmd.html
JDK Mission Control user guide — https://docs.oracle.com/en/java/java-components/jdk-mission-control/9/user-guide/
Eclipse MAT documentation — https://eclipse.dev/mat/documentation/
PDEV-488 pod-capacity analysis — the originating analysis lives in roadmap/in-progress/pod-capacity/pod_capacity.md (moves to completed/ at project close). Sections § Rec #3 and § Rec #4 cover the chosen JVM tuning and JFR settings respectively.