Runbook: Incident Response
Author: Miguel Pinilla Date: 2026-05-20 Last Verified: 2026-05-20 Environments: dev | stage | demo | prod
Purpose
Section titled “Purpose”Provide a repeatable procedure that takes a responder from “something is broken in production” to “service is restored, root cause is understood, follow-up work is filed” with minimum elapsed time and maximum certainty.
This runbook is generic — it applies to any production-impacting issue affecting an Arda component. The Worked example section walks through the 2026-05-20 operations 3.0.0 outage using each phase below.
When to Apply
Section titled “When to Apply”Apply this runbook whenever any of the following is true:
- A deployed service is failing user-visible requests, or is in a state where it would fail them if traffic arrived.
- A scheduled job is not completing successfully.
- Observability surfaces (Sentry, CloudWatch alarms, Kubernetes events) report elevated error rates.
- A deploy has resulted in
CrashLoopBackOff, failing health checks, or rolled-back releases.
Do not apply this runbook for routine debugging during development. The discipline below (record-keeping, observability-first, scope locking) is calibrated to high-stakes situations and is overhead for ordinary work.
For most Arda incidents one or two engineers fulfill all roles simultaneously. The roles are listed separately because the responsibilities are distinct and easy to drop when concurrent.
| Role | Responsibility |
|---|---|
| Incident Commander | Owns the incident end-to-end. Decides scope, sequencing, and when to declare resolution. Maintains the incident record. Communicates status. |
| Responder | Executes diagnostic and remediation steps. Reports findings to the Commander. |
| Scribe | Captures timeline, findings, decisions, and links to artifacts in real time. For one-person incidents, the Commander scribes as they go. |
| Communicator | Posts user-facing status (Slack #incidents, customer-facing status page if applicable). |
Phases
Section titled “Phases”The state diagram below shows the six phases of incident response — Detect & Assess, Contain & Stabilize, Diagnose, Remediate, Verify, and Post-Incident — and the transitions between them, including the Verify-to-Diagnose loop when verification surfaces a residual symptom.
Most phases are not strictly sequential — diagnosis may continue while a containment patch holds the service stable, and verification may surface new symptoms that loop back to diagnosis. The point of naming them is to make state transitions explicit and prevent skipping (especially Verify and Post-Incident).
Phase 1 — Detect & Assess
Section titled “Phase 1 — Detect & Assess”The goal of this phase is to confirm there is an incident, scope its blast radius, and decide who needs to know.
Steps:
- Capture the trigger. What alert, user report, or observation prompted this? Record the original signal verbatim (paste the alert, the Slack message, the dashboard URL).
- Confirm impact. Distinguish “an error appeared in logs” from “users are affected.” Check:
- Affected namespaces, regions, and clusters.
- Affected request paths or scheduled jobs.
- User-visible error rates (Sentry, ingress logs).
- Whether redundancy is degraded (running replicas vs. desired).
- Set severity. Use the team’s standard severity levels. As a default heuristic: full outage of a critical path is severity 1, partial degradation or non-critical path is severity 2, latent risk discovered (e.g., a single crashing replica with others healthy) is severity 3.
- Open the incident record. Even for a self-resolved blip, capture a short Linear or Markdown record with timestamp, trigger, severity, and a placeholder for findings. Without a record, nothing learned in the next hour is recoverable.
- Communicate. If severity 1 or 2, post a starting message to
#incidents(or equivalent) with what you know. Do not wait until you have a diagnosis. - Note detection gaps. If the trigger was a user report rather than an automated alert, record the time between the first observability event and the user report — that gap is itself a process-improvement item to file in Phase 6.
Phase 2 — Contain & Stabilize
Section titled “Phase 2 — Contain & Stabilize”The goal of this phase is to stop the bleeding. Restore the service to a working state, even if the root cause is not yet understood. Containment is not “fix” — it is “buy time to fix safely.”
Common containment moves, in rough order of reversibility (safer first):
- Reroute traffic away from the affected component (feature flag, ingress rule, DNS).
- Roll back the most recent deploy. This is the highest-leverage move when the incident correlates with a recent release.
- Scale up healthy replicas to absorb load while degraded replicas churn.
- Restart the failing pods or processes (cheap experiment — sometimes a transient state clears).
- Apply a hotfix to live config (Helm value, ConfigMap, ExternalSecret). Reversible but riskier than rollback.
- Restore from backup — last-resort, expensive, requires confidence that the backup is uncorrupted.
Critical principles during containment:
- Single-replica trap. If you have one healthy replica left and you do anything that might kill it, you are one event away from full outage. Add capacity before experimenting.
- Drift accountability. Every manual change (
kubectl patch,kubectl scale, in-cluster secret edits) drifts from the declarative source of truth. Record what you did, plan how it will be reverted, and revert it as soon as the proper fix is in. - Helm rollback re-renders templates.
helm rollbackis not a Deployment image swap — it re-evaluates every template (Deployment, K8s Secret, ConfigMap, ExternalSecret) from the target chart revision. Resources whose content depends on chart helpers can change in ways the operator did not intend. Confirm what re-rendered before assuming containment is safe. - Containment is not understanding. Once the bleeding stops, transition to Diagnose deliberately. Do not declare resolution.
Phase 3 — Diagnose
Section titled “Phase 3 — Diagnose”The goal of this phase is to identify the actual root cause with evidence.
Diagnostic principles
Section titled “Diagnostic principles”These are the rules that keep diagnosis fast and honest. Each principle is stated as a discipline because they are easy to violate under pressure.
-
Observability first, source-reading last. Before reading source, before extracting JARs and decompiling bytecode, check the observability surfaces:
- Sentry or other error reporters for the actual stack trace.
- CloudWatch or pod logs for breadcrumb context.
- Kubernetes events for scheduling, secrets, and probe state.
- Metrics dashboards for the time-correlated symptoms.
An exception with a stack trace from Sentry answers in seconds what source reading takes minutes to confirm and what bytecode analysis takes hours. Reach for the deepest tools only when the surface-level tools genuinely have no signal.
-
Verify before publishing. When you form a hypothesis that names a specific function, file, dependency, or configuration value, verify it empirically before recording it in any artifact a reader might mistake for a diagnosis:
- Inspect the image or classpath if a missing dependency is hypothesized.
- Query the database if a schema or row state is hypothesized.
- Read the actual deployed config if a config value is hypothesized.
Inference from log timing alone is a hypothesis, not a diagnosis. Writing an unverified hypothesis into a Linear ticket’s Root Cause section or into a Slack message phrased as a conclusion will mislead the next reader and waste responder cycles on the wrong fix. Label hypothesis-stage content as such until evidence converts it.
-
Discriminate competing hypotheses with a single decisive test. When two or more hypotheses survive initial inspection, identify the cheapest test that distinguishes them. Avoid investigations that confirm one without disproving the other — they leave the door open for the rejected hypothesis to be the real cause.
-
Time-bound rabbit holes. Set a budget (15-30 minutes is typical) before starting any deep dive. When the budget expires, stop and reassess. If the dive is productive, extend deliberately. If not, return to the surface and try a different angle.
-
Stay focused on the question. When a digression reveals something interesting but tangential, capture it as a follow-up item and return to the main thread. The incident is not the time to fix every adjacent issue.
-
Record findings as you go. Each diagnostic step has an outcome — confirmed, refuted, inconclusive. Write it down with the evidence (event ID, query result, log timestamp). The post-incident write-up becomes easy if the trail is already recorded.
Diagnostic procedure
Section titled “Diagnostic procedure”- Establish the failure signature. What does the failure look like exactly? Error message, exit code, point in startup, time of first occurrence, environments affected. A precise signature is the unit of work the diagnosis confirms or refutes.
- Establish the asymmetry. If the failure affects some environments, pods, or requests but not others, what is different? This is often the largest single source of evidence. The asymmetry between “fails in stage” and “succeeds in dev” is more informative than ten log files from stage.
- Form hypotheses, then test them. Each hypothesis should be a specific claim about cause that can be verified or refuted. Avoid vague hypotheses (“something with the DB”) — sharpen until the test is obvious.
- Confirm with multiple independent signals. A single signal can mislead. Cross-check: stack trace plus DB row state plus deploy timeline plus environment differences. Convergence is confidence.
Phase 4 — Remediate
Section titled “Phase 4 — Remediate”The goal of this phase is to apply the actual fix, separate from the containment patch.
Steps:
- Choose the smallest fix that solves the problem. Resist the urge to package improvements with the fix.
- Identify the release path. Code fix → patch release → deploy. Map dependencies up-stack (e.g., common library bump first, then service bump).
- Decide containment-vs-fix ordering. Two patterns are common:
- Roll forward. Apply the fix in a new release that supersedes the broken one. Best when the fix is small and testable, and when database schema changes accompany the broken release (rolling back across a schema migration is brittle).
- Roll back, then forward. Revert to the prior known-good release, then deploy a fixed version later. Required when the fix needs design or coordination time.
- Test the fix on a non-production environment before production. Skip only when production is on fire and time-to-restore beats verification fidelity.
- Deploy. Apply the fix using the team’s standard deployment path. Do not invent ad-hoc deploy mechanisms during an incident — they add unfamiliar failure modes.
Cross-repo release path
Section titled “Cross-repo release path”When the fix spans repositories — a library bug consumed by a service, or a chart change consumed by an infrastructure pipeline — the merge order matters. The downstream PR’s CI cannot pass until the upstream tag publishes (Maven, npm, GitHub Packages, container registry). The pattern is:
- Land the upstream PR. Tag the release. Wait for the artifact to publish.
- Re-trigger CI on the already-open downstream PR. CI now resolves the new artifact version. Merge.
- Deploy the downstream release.
The downstream PR can be opened in parallel with the upstream PR (with the bumped version pinned in the PR before the artifact exists) so the merge gate is a CI re-trigger rather than a separate PR-creation step. This shortens elapsed time but requires discipline: do not merge the downstream PR before the upstream tag publishes, even when CI is green from a stale cached resolution.
Composite-build testing during the fix
Section titled “Composite-build testing during the fix”Use Gradle composite builds locally to test a still-uncommitted library fix from the consumer’s perspective:
./gradlew test --include-build /Users/jmp/code/arda/projects/<project>-worktrees/common-moduleRun from the consumer’s worktree. The --include-build flag is a CLI argument, not a permanent settings.gradle.kts entry — see the kotlin-coding skill’s “Composite Build Safety” section for the rule and the reasoning. Do not commit the composite-build wiring to the repo.
Phase 5 — Verify
Section titled “Phase 5 — Verify”The goal of this phase is to confirm the service is fully restored and the symptoms are gone.
Verification is not “deploy succeeded.” It is “the original symptoms no longer occur and would not occur for a reasonable time horizon.” Cover four independent signal classes:
- Pod state. All replicas
Running 1/1, restart count zero, image matches the new release. - Database schema state. Target Flyway migrations applied (
success=t) on affected module DBs. This is a positive check, not just absence-of-error. - User-path UX. End-to-end browser-level verification using a headless Playwright script. The script logs in as a real test account (credentials injected via
op read 'op://Private/<env-creds>/{username,password}'so they never enter the conversation transcript), navigates to the affected route, asserts expected DOM content, and asserts the absence of failure markers (e.g., the literal stringSSRM 500). A screenshot is captured toscratch/<env>-<feature>.pngfor the post-incident record. - Telemetry. Sentry shows the expected span shapes for the new release (e.g., wrapper-driven
pg_catalog.aurora_replica_status()spans for the AWS Advanced JDBC Wrapper’sfailover2andinitialConnectionplugins, plus applicationhttp.servertransactions). Sentry fatal-event sweep returns empty underenvironment:<env> release:<component>@<new-version> lastSeen:-15m level:[error,fatal].
Then:
- Revert drift. Any manual
kubectl patch, ad-hoc scaling, or temporary config changes from containment must be undone before the incident is closed.
Phase 6 — Post-Incident
Section titled “Phase 6 — Post-Incident”The goal of this phase is to capture what was learned so the same incident does not recur, and so the team’s response gets better.
Steps:
- Write the post-incident summary. Standard structure:
- Timeline. UTC timestamps for trigger, detection, containment, diagnosis, fix, verification. Use the running scribe record as the source.
- Impact. What users or systems were affected, for how long, how degraded.
- Root cause. The actual cause, with evidence links.
- Detection. How it was detected. Did the alert fire correctly? Was it noticed via a user complaint instead?
- Response. What worked, what didn’t. Honest assessment.
- Action items. With owners and target dates. File each as a tracked ticket.
- File follow-up tickets. Common categories:
- Direct fix follow-ups (test coverage for the bug that escaped, observability gap that delayed detection).
- Latent issues uncovered during diagnosis (often the most valuable).
- Process improvements (tooling, runbook updates, alert tuning).
- Run a blameless review. For severity 1 or 2 incidents, hold a short review meeting within a week. Focus on systemic factors, not individual mistakes.
- Update this runbook. If anything in the response surprised you — a missing tool, an ambiguous instruction, a gap in coverage — fix it here.
- Close the incident record. Mark resolved with links to the fix PR, the post-incident summary, and the follow-up tickets.
Diagnostic Discipline — Anti-Patterns
Section titled “Diagnostic Discipline — Anti-Patterns”These are the failure modes that cost time during the worked example below. Recognize them in yourself and stop.
| Anti-pattern | What it looks like | Cost |
|---|---|---|
| Premature publication | Recording a confident diagnosis based on inference in a Linear ticket’s Root Cause section or in a Slack message before verifying with empirical evidence. The hypothesis-stage label is missing; readers take it as the diagnosis. | Team commits to wrong fix path; real cause persists. |
| Source-reading before observability | Reaching for unzip, vendor -sources.jar lookup, source spelunking, or javap before checking Sentry, logs, dashboards. | Minutes-to-hours of analysis to discover what an exception in Sentry says in one line. |
| Hypothesis monogamy | Pursuing one hypothesis exclusively without testing competitors. | Confirmation bias; mis-attributed cause. |
| Tangent drift | Following an interesting side observation away from the main thread. | Lost focus; incident extends; original question unanswered. |
| Forgetting the scribe | Diagnosing in conversation without writing findings down. | Post-incident write-up takes three times longer; findings reconstructed from memory are less reliable. |
| Skipping verify | Declaring resolution based on “deploy succeeded” without exercising the original symptom. | Incident reopens; trust degrades. |
| Drift left behind | Manual changes from containment not reverted before closure. | Surprise drift discovered weeks later, often during the next incident. |
Tooling Reference
Section titled “Tooling Reference”What to use when, ordered roughly from cheapest to most expensive. The “When to escalate” column says when to stop using the primary tool and reach for a deeper one.
| Question | Primary tool | Secondary | When to escalate |
|---|---|---|---|
| Is the service up? | kubectl get pods -n <ns> | Helm release status | Move to logs if anything not Running 1/1. |
| What is the error? | Sentry issue search (MCP: mcp__claude_ai_Sentry__search_issues) | Pod logs via kubectl logs --previous | Move to source-reading only if Sentry has no event and logs have no stack trace. |
| What is in a specific Sentry event or trace? | MCP: mcp__claude_ai_Sentry__get_sentry_resource (by URL or ID) | Sentry UI | Note: looking up a trace ID may return zero spans because production runs at a fractional tracesSampleRate; only errors override sampling. Zero spans on a trace is normal, not a smoking gun. |
| What spans are being produced? | MCP: mcp__claude_ai_Sentry__search_events (dataset=spans) | Sentry UI Performance tab | Use to confirm wrapper or framework-level instrumentation is alive for a given release. |
| What changed recently? | helm history <release> -n <ns> | git log on the relevant repo, GitHub PR list | Caveat: helm rollback re-renders all templates from the rolled-back chart revision — it is not a Deployment image swap. K8s Secrets backed by template helpers will revert too. This is the mechanism behind some rollback-time races. |
| What is in the live config? | Read the K8s Secret or ConfigMap. Use a fenced block below to avoid markdown-table pipe-escaping issues. | Render the Helm chart locally and diff | Inspect deploy artifacts (image contents) if config is ambiguous. |
| What is the DB state? | Bastion pod plus psql, via management/aurora-data-dump/bastion-discover.sh | Direct RDS console (read-only metrics) | Caveat: until Arda-cards/management#912 merges, the bash URL parser does not handle jdbc:aws-wrapper:postgresql://... and will mis-extract aws-wrapper:postgresql: as the port. Workaround: read the K8s Secret directly (see fenced block below) and pass host or port to a bastion psql pod manually. |
| What did the image actually ship? | docker pull <image> plus docker run --entrypoint /bin/sh ... -c 'ls /app/libs' | unzip -l on specific JARs | Decompile with javap -v only when the source code’s behaviour is genuinely uncertain. |
| What does a third-party library actually do? | unzip -p ~/.gradle/caches/modules-2/files-2.1/<group>/<artifact>/<version>/.../<artifact>-<version>-sources.jar <relative/path/to/file>.java | Vendor docs (docs.aws.amazon.com, flywaydb.org, etc.) | javap -v only when no -sources.jar is published. |
| Why is the wrapper or driver or framework misbehaving? | Vendor docs first | GitHub issues for the project | Source-read or decompile last. |
Live-config read pattern referenced in the table:
kubectl --context <ctx> -n <ns> get secret <name> -o json \ | jq -r '.data."secrets.properties"' \ | base64 -dWorked Example: Operations 3.0.0 Rollout Outage (2026-05-20)
Section titled “Worked Example: Operations 3.0.0 Rollout Outage (2026-05-20)”This is a real incident captured in PDEV-561, where two compounding bugs in common-module 8.4.0 surfaced as the operations 3.0.0 rollout outage. Bug A — the JdbcUri parser rejecting jdbc:aws-wrapper: URLs — was the visible crash. Bug B — the auroraInitialConnection typo in the wrapper plugin list — was latent behind Bug A and would have re-crashed every pod the moment Bug A’s fix let the URL through.
The example illustrates each phase, the diagnostic discipline above, and the anti-patterns to avoid.
Operations chart 3.0.0 deployed to all four cloud environments on 2026-05-20:
dev-operations(Alpha002) — chart applied ~07:57 UTC; first Sentry fatal event 07:58:25 UTC.demo-operations(Alpha001) — chart applied ~08:06 UTC.stage-operations(Alpha002) — chart applied ~08:02 UTC; first Sentry fatal event 14:16:08 UTC after stage-specific timing.prod-operations(Alpha001) — chart applied ~08:09 UTC.
The release bundled a common-module bump (8.3.0 → 8.4.0), two new Flyway migrations (V007 on the kanban DB, V015 on the item DB), and an opt-in to the AWS Advanced JDBC Wrapper via a jdbc:aws-wrapper: URL prefix activated by the featureFlag.hasExternalSecrets gate (true in cloud environments, false on local docker-desktop).
Phase 1 — Detect & Assess (~14:00 UTC)
Section titled “Phase 1 — Detect & Assess (~14:00 UTC)”Trigger: a user report of items page failure on live.app.arda.cards (“SSRM query failed: 500”). Frontend Sentry corroborated with ARDA-FRONTEND-1D (8 distinct prod users affected). The responder pivoted to backend Sentry, which surfaced PLATFORM-BE-4 — 674 fatal events since 07:58 UTC across all four cloud environments, tagged release=operations@3.0.0, mechanism=UncaughtExceptionHandler, level=fatal. kubectl get pods confirmed CrashLoopBackOff across the affected namespaces.
Detection gap: the user report came roughly six hours after the first fatal event. No automated alert was configured for the relevant Sentry signal. Filed as follow-up — the absent alert is itself a process-improvement item.
Impact: pods in CrashLoopBackOff in all four cloud environments. Prod had one pre-existing healthy 2.25.1 pod that predated the 3.0.0 deploy and continued serving, so user-visible impact was intermittent rather than total. Severity 2 (partial degradation with redundancy compromised everywhere).
Phase 2 — Contain & Stabilize (~14:21 UTC)
Section titled “Phase 2 — Contain & Stabilize (~14:21 UTC)”helm rollback operations <prev_revision> -n prod-operations to operations-2.25.1 (the pre-3.0.0 release). Repeated for stage, demo, and dev.
The rollback re-rendered all templates from the 2.25.1 chart revision — including the K8s operations Secret, which reverted to plain jdbc:postgresql://... URLs because the 2.25.1 chart’s jdbcUrlFor helper does not apply the wrapper prefix. This re-render is the mechanism behind a side effect documented in Sub-diagnosis B below.
Drift incurred: none material. A rolling-update surge briefly raised pod count to three; the HPA’s minReplicas = 2 reconciled on its own once the rollout completed. The Deployment’s spec.replicas and HPA settings were not modified.
Phase 3 — Diagnose
Section titled “Phase 3 — Diagnose”Two distinct failures were diagnosed: the original 3.0.0 crash (the proximate cause of the outage), and a post-rollback symptom on prod and stage (a side effect of the rollback itself).
Sub-diagnosis A — the 3.0.0 crash
Section titled “Sub-diagnosis A — the 3.0.0 crash”Failure signature: Sentry PLATFORM-BE-4 captured the fatal stack from the first event onward — AppError$ArgumentValidation: str is Invalid: jdbc:aws-wrapper:postgresql://...:5432/<env>-operations.business_affiliate_db needs :// before user or host at cards.arda.common.lib.util.network.JdbcUri.<init>(JdbcUri.kt:19). Same signature in all four cloud environments, each tagged with its own environment and business_affiliate_db URL. The pod died during the first lazy access of cfg.dataSource from businessAffiliateModule.kt:50 — businessAffiliates is the first module in Main.kt’s init order to touch its lazy DataSource — well before HikariDataSource was constructed.
The investigation could have started — and on subsequent environments did start — by reading that Sentry event.
Anti-pattern committed: an earlier draft of PDEV-561’s Root Cause section recorded a different hypothesis (NoClassDefFoundError from a missing io.sentry:sentry-*.jar) based on log-timing inference from stdout. That hypothesis was not labeled as hypothesis-stage; a reader would have taken it as the diagnosis. The hypothesis was refuted on first contact with Sentry and the ticket was rewritten end-to-end. This is premature publication — see Diagnostic principles, principle 2. The framing that hypothesis was based on (“no stack trace on stdout”) was itself wrong: Sentry had the full Kotlin stack from event 1; the stdout logback warning was a real hygiene issue but not the blocker.
Test 1 — refuting the missing-jar hypothesis. Three independent signals each disprove it; any one would suffice. In ascending order of cost:
-
Sentry was receiving events from the failing pod (PLATFORM-BE-4 with
mechanism=UncaughtExceptionHandler). If the Sentry SDK could not load, theUncaughtExceptionHandlerregistered byComponent.configureServercould not capture, and no fatal event would have reached Sentry at all. The existence of the issue alone refutes the hypothesis and is free given that observability was already the entry point — see Diagnostic principles, principle 1. -
Read
common-module/lib/build.gradle.kts:63-67—implementation(libs.sentry)andimplementation(libs.sentry.logback)are declared as transitive dependencies that every consumer (including operations) pulls in. Seconds to confirm. -
Image inspection — the most authoritative confirmation, but the slowest of the three:
Terminal window docker pull ghcr.io/arda-cards/containers/operations:3.0.0docker run --rm --entrypoint /bin/sh ghcr.io/arda-cards/containers/operations:3.0.0 \-c 'ls /app/libs | grep -iE "sentry|aws-advanced"'Result:
sentry-8.41.0.jarwas present (transitively pulled by common-module 8.4.0);aws-advanced-jdbc-wrapper-4.0.1.jarwas present.
The first signal alone is sufficient — and is free — so reaching for the image pull first would itself be a mild “source-reading before observability” misstep on a smaller scale. The recipe above is recorded because it is the most rigorous confirmation; it is not the first check to run.
Test 2 — read common-module 8.4.0 source for URL parsing in JdbcUri.kt. URI(str) parses jdbc:aws-wrapper:postgresql://host:5432/db as scheme=jdbc, schemeSpecificPart=aws-wrapper:postgresql://host:5432/db. The single-level peel produces an inner URI with path=null (no :// immediately after the inner scheme), so the validation throws. JDBC URLs are not strict RFC 3986 hierarchical URIs — they are a JDBC convention layered on top of URI syntax, and the AWS wrapper’s jdbc:aws-wrapper:<inner-driver-url> form is a stacked-scheme variant of that convention. The parser needed to peel recursively. This is Bug A.
Test 3 — wrapper plugin code registry. The fix for Bug A would let the URL through and immediately expose whatever followed. To check for a latent second bug, the responder located the aws-advanced-jdbc-wrapper-4.0.1-sources.jar in the local Gradle cache, used unzip -p on software/amazon/jdbc/ConnectionPluginChainBuilder.java, and read the literal pluginFactoriesByCode map. The string auroraInitialConnection was not a registered key; the correct code (initialConnection) was mapped to AuroraInitialConnectionStrategyPluginFactory — the class name had been mistaken for the registry key. This is Bug B. Source reading from the vendor’s published -sources.jar was the right move for Test 3 — but only after observability had narrowed the suspect to the wrapper plugin chain. Reaching for the vendor source before Sentry would have been the anti-pattern.
Root cause (Bug A): JdbcUri.<init> rejected jdbc:aws-wrapper:postgresql://... URLs at parse time. The exception bubbled out of module init, killed the main thread, JVM exited 1.
Root cause (Bug B, latent): wrapperPlugins = "auroraInitialConnection,..." in DataSource.kt:65 would have thrown SQLException("ConnectionPluginManager.unknownPluginCode") on first connection had Bug A not killed the pod first.
Sub-diagnosis B — post-rollback Flyway validation failure on prod and stage
Section titled “Sub-diagnosis B — post-rollback Flyway validation failure on prod and stage”Failure signature: after the 2.25.1 rollback, pods on prod and stage entered CrashLoopBackOff on a different exception. Sentry PLATFORM-BE-6 captured 104 events under release=operations@2.25.1, environments Alpha001-prod and Alpha002-stage only — zero events from demo or dev. The exception was AppError$Infrastructure: Database could not be migrated: Validate failed: Migrations have failed validation. Detected applied migration not resolved locally: 015., with the stack pointing to cards.arda.operations.reference.item.ModuleKt.itemModule(Module.kt:103) → DataSource.newDb(DataSource.kt:49) → DbMigration.migrate-IoAF18A(DbMigration.kt:39) → Flyway.migrate(Flyway.java:186). The failure was in the item module, not businessAffiliate.
Asymmetry: prod and stage failed; demo and dev succeeded. This was the diagnostic engine. Confirmed by querying reference_item_flyway_history and resources_kanban_flyway_history across all four environments — V015 and V007 rows were present in prod and stage (installed during the 3.0.0 outage window) and absent in demo and dev.
Root cause: a Helm-rollback race. The mechanism is:
- 3.0.0 was deployed. Its Helm template
jdbcUrlForrendered the K8soperationsSecret withjdbc:aws-wrapper:postgresql://...URLs. - Pods crashed on Bug A.
helm rollbackto 2.25.1 was invoked. Helm re-rendered all templates from the rolled-back chart revision. The 2.25.1 chart does not have the wrapper-prefixjdbcUrlForhelper, so the K8s Secret reverted to plainjdbc:postgresql://...URLs.- During the brief window between the Secret re-render and the Deployment’s ReplicaSet scaling down, a still-running 3.0.0 pod restarted. It read the now-plain URLs from its mounted Secret volume, parsed them cleanly (Bug A no longer triggered), reached the
itemmodule init, and ran Flyway. V015 and V007 applied successfully (installed_on2026-05-20T14:23:00Z and 14:23:04Z on prod, 14:43:23Z and 14:43:29Z on stage;success=ton every row). - The Deployment update completed; the 3.0.0 ReplicaSet scaled down; the 2.25.1 ReplicaSet scaled up.
- The 2.25.1 image does not contain
V015__item_bitemporal_indexes.sql(it was introduced in 3.0.0). Flyway validation on every 2.25.1 pod restart now sees an applied migration with no local file → throws.
Dev and demo did not hit the race window — their flyway_schema_history was unchanged from pre-3.0.0, so Flyway validation passed and 2.25.1 came up cleanly.
The shape here is general: any release that ships a Flyway migration is hard to roll back via helm rollback alone, because the rolled-back image’s resources directory will fail validation against migrations that were already applied. The Arda practice of keeping database migrations backwards-compatible mitigates this — the supported recovery path is to roll forward, not roll back across a migration. That is what was done for this incident.
Phase 4 — Remediate
Section titled “Phase 4 — Remediate”The release path was two coordinated PRs:
- Arda-cards/common-module#173 —
common-module 8.4.1. Carries theJdbcUriparser fix (Bug A: rewrites the parser to peel layered schemes recursively, with a regression test stack covering plain, single-wrapper, double-stacked-wrapper, and malformed wrapper URLs) and thewrapperPluginstypo fix (Bug B:auroraInitialConnection→initialConnection, plus a reflective regression testDataSourceWrapperPluginsTestasserting every token in the wrapper plugin list resolves inConnectionPluginChainBuilder.pluginFactoriesByCode). References PDEV-561. - Arda-cards/operations#174 —
operations 3.0.1. One-linegradle/libs.versions.tomlbump fromarda-common-version = "8.4.0"to"8.4.1". Closes PDEV-561.
Cross-repo merge gate: operations#174 had been opened in parallel with common-module#173 with the bumped pin already in place, but its CI failed until common-module v8.4.1 published. Once published, re-triggering CI on operations#174 resolved 8.4.1 from GitHub Packages, CI went green, the PR merged.
Composite-build testing during the fix: the consumer-side test suite was run against the still-uncommitted common-module fix from the operations worktree using ./gradlew test --include-build /Users/jmp/code/arda/projects/operations-hot-fix-worktrees/common-module. The --include-build setting was intentionally not committed to settings.gradle.kts per the kotlin-coding skill’s “Composite Build Safety” rule.
3.0.1 was then deployed to dev → stage → demo → prod sequentially, verifying each environment per Phase 5 before promoting.
Phase 5 — Verify
Section titled “Phase 5 — Verify”After 3.0.1 deployment on each environment, all four signal classes were exercised:
- Pod state: all replicas
Running 1/1, restart count zero, image matchedoperations:3.0.1. - Database schema state: positive Flyway-history queries confirmed V015 (item) and V007 (kanban) applied with
success=ton each affected module DB. Orphan-index check onpg_index.indisvalidfor%bitemporal%indexes returnedtfor all rows on each environment. - User-path UX: Playwright headless Chromium script with 1Password-injected Cognito credentials logged in, navigated to
/items, asserted AG Grid rows were visible and the literal stringSSRM 500was absent from the rendered DOM. Captured a screenshot per environment toscratch/<env>-items.png. Dev / stage / prod returned 3 / 5 / 6 rows respectively; demo returned 0 rows (the demo tenant is empty), cross-checked via Sentry confirming the items query had executed cleanly. - Telemetry: Sentry span query (
mcp__claude_ai_Sentry__search_eventswithdataset=spansandenvironment:<env>) confirmed wrapper-driven topology probes (pg_catalog.aurora_replica_status(),pg_catalog.pg_is_in_recovery()) were emitting under the new release, andhttp.servertransactions were instrumented end-to-end. Fatal-event Sentry sweep (environment:<env> release:operations@3.0.1 lastSeen:-15m level:[error,fatal]) returned empty on every environment.
Drift revert: none material to revert (no manual changes were applied during containment; the rolling-update surge reconciled on its own).
Phase 6 — Post-Incident
Section titled “Phase 6 — Post-Incident”Filed tickets:
- PDEV-561 — wrapper opt-in startup crash. Two compounding bugs (A: JdbcUri parser; B: wrapperPlugins typo). Closed by operations 3.0.1.
- PDEV-554 — original Sentry auto-filed “JdbcUri rejects wrapper URLs” issue. Subsumed by PDEV-561.
- PDEV-558 — Sentry auto-filed “Database could not be migrated” issue from Sub-diagnosis B. Resolution: rolled forward to 3.0.1.
Process improvements captured:
- Sibling fix in
management/aurora-data-dump/bastion-discover.sh(Arda-cards/management#912). The bash URL parser had the same hardcodedjdbc:postgresql://strip as the brokenJdbcUri.kt. With operations 3.0.1 rendering wrapper-prefixed secrets, the bastion tooling stopped working until this fix landed. Rewritten with the same recursive scheme-peel pattern. - Common-module read-only transaction audit (PDEV-562 / PDEV-563 / PDEV-564 under parent PDEV-442). Tracks annotating
inTransaction(db, readOnly = true)everywhere it applies, so the wrapper’sreadWriteSplittingplugin (now active) can actually route read traffic to Aurora reader instances. - Empty-path URL parity (PDEV-567). The bash parser was tightened during review of management#912 to reject URLs with no
/dbpath segment. The Kotlin peer incommon-modulehas the same gap. - Missing alert rule. The user report came ~6 hours after the first Sentry fatal event. A Sentry alert rule on
level=fatal mechanism=UncaughtExceptionHandlerevents tagged to the operations project would have surfaced the outage within minutes. Filed as follow-up. - CI coverage gap. Local
make localInstallnever exercises thejdbc:aws-wrapper:URL path (thefeatureFlag.hasExternalSecretsgate is off on docker-desktop). A CI integration test that boots the operations image against an Aurora-compatible test target would have caught Bug A and Bug B before deploy. Filed as follow-up. - Logback
STDOUTappender unreferenced. TheSTDOUTappender in operations’logback.xmlis defined but not attached to ROOT. Uncaught exception stack traces frommaintherefore never reach stdout — they only reach Sentry. This was a real hygiene issue that complicated the early diagnostic dive (the stdout-only signature looked more confusing than it should have). The fix did not ship in 3.0.1. Filed as an open follow-up. - Diagnostic-discipline lessons. The “premature publication” and “source-reading before observability” anti-patterns documented above were both committed during this incident’s first hour. They are codified here so the next responder recognizes them.
Templates
Section titled “Templates”- Operational Runbook — for new operational runbooks.
See Also
Section titled “See Also”- SRE Overview — section landing page.
- Runbook: JVM Profiling — sibling runbook for performance investigation.
- Runbook: Amazon Creators API Onboarding — sibling runbook for partition-vault provisioning.
- Bastion database access scripts live in the
managementrepo ataurora-data-dump/. Credential extraction and the psql-via-ephemeral-pod pattern are used in Phase 5’s DB-state verification step.
Copyright: © Arda Systems 2025-2026, All rights reserved