Operations Sentry — Learnings
These lessons are worth carrying into the PDEV-533 (accounts-component) Sentry adoption and any future component-level instrumentation.
Sentry JVM SDK 8.x has no per-request session emission
Section titled “Sentry JVM SDK 8.x has no per-request session emission”The Sentry JVM SDK at 8.41.0 does not emit per-request sessions for Ktor server handlers, despite the published Release Health docs implying server frameworks emit them out of the box. enableAutoSessionTracking=true only produces one session per JVM lifecycle. The supersession in dt-005/dt-004 documents the empirical discovery during dev verification: zero sessions on the Sentry Release Health tab while requests were flowing normally.
Mitigation now in place: Main.kt installs a custom SentryRequestSession Ktor application plugin that calls Sentry.startSession() on onCall and Sentry.endSession() on ResponseSent, guarded by Sentry.isEnabled(). The plugin is idempotent (pluginOrNull check) so common-module’s auto-install does not collide. Until the official sentry-ktor server plugin is published, every consumer needs this snippet.
Knowledge-base entry: needs to land in operations/knowledge-base/sentry-ktor-session-plugin.md.
The bundled OTel Java Agent will hammer localhost:4318 by default
Section titled “The bundled OTel Java Agent will hammer localhost:4318 by default”The sentry-otel-agent.jar shipped under /app/agents/ is a full OpenTelemetry Java Agent with the OTLP HTTP exporter wired in. With no OTLP collector running, it logs an error every ~10 seconds attempting POST http://localhost:4318/v1/traces. The Sentry-native exporter is a separate pipeline and is unaffected.
Mitigation: set OTEL_TRACES_EXPORTER=none, OTEL_METRICS_EXPORTER=none, OTEL_LOGS_EXPORTER=none in the pod env. The agent still loads and instruments, but the upstream OTLP exporter is silenced. Captured in helm/templates/deployment.yaml.
ESO key templating: use .Values.global.purpose, not .Release.Namespace
Section titled “ESO key templating: use .Values.global.purpose, not .Release.Namespace”The first dev deploy failed silently because the ExternalSecret used .Release.Namespace (Alpha002-dev-operations) to build the Secrets Manager key, while the underlying secret is partition-scoped (Alpha002-dev-SentryScrubSalt). ESO’s reconcile loop deleted the existing Kubernetes secret when the lookup returned 404, and the pod started without SENTRY_SCRUB_SALT.
Rule: partition-scoped infra (DBs, partition secrets, S3 buckets) → .Values.global.purpose. Component-scoped infra (per-namespace TLS certs, in-cluster service tokens) → .Release.Namespace. The two are not interchangeable. Documented in the rewritten how-to.
JSON-shaped Secrets Manager values need property: in ExternalSecret
Section titled “JSON-shaped Secrets Manager values need property: in ExternalSecret”SentryScrubSalt is stored in Secrets Manager as a JSON document {"salt": "..."} (the CDK pattern across Arda partition secrets). ESO will project the whole JSON blob into the Kubernetes secret data field if no property: is supplied — i.e. SENTRY_SCRUB_SALT= would end up as {"salt":"..."} literal, breaking the SDK’s salt parsing.
Rule: every ExternalSecret pointing at a partition CDK secret needs remoteRef.property: <field> matching the JSON key in the stored value. Apply this consistently when adding new partition secrets.
CFN export naming conventions are load-bearing
Section titled “CFN export naming conventions are load-bearing”The CDK pattern across Arda partitions uses three distinct prefixes:
<partition>-API-<name>— cross-repo consumption (e.g. Helm charts inoperationslooking up partition resources).<partition>-I-<name>— cross-stack consumption within the same CDK app.<partition>-<name>(bare) — marker-prefixed private; not for cross-stack use.
SentryScrubSaltArn ended up needing both -API- (for the operations chart’s SecretStore block) and -I- (for the CDK app’s compose-time stack wiring). Picking one in PR #459 surfaced the convention split clearly. PDEV-531’s pre-existing naming work made this discoverable.
Frontend env-var aliases: short forms in Amplify, long forms in code
Section titled “Frontend env-var aliases: short forms in Amplify, long forms in code”Amplify deploys the FE app with NEXT_PUBLIC_DEPLOY_ENV set to DEV, STAGING, PROD — the abbreviated forms. Code that conditionally branches by env name (development vs dev, staging vs stage, production vs prod) silently no-ops in the deployed app.
Verification: empirical — a preview deploy of PR #845 produced trace 3d1df3c9... with no BE-side spans because the BFF dropped trace propagation. Mapping both forms in BACKEND_HOSTS was the fix (commits 03a19aba, 62c350ba). Alias-parity unit tests now guard against regression.
*/ inside a JSDoc snippet closes the comment
Section titled “*/ inside a JSDoc snippet closes the comment”The first cut of trace-propagation-targets.ts had a JSDoc fragment containing [/.*/] as an example. The */ sub-string closed the JSDoc block prematurely, breaking the docstring and shifting the next function out of TypeScript’s documentation index. Fixed by replacing the regex literal with prose (“permissive match-all regex”).
Rule of thumb: avoid regex literals containing */ inside JSDoc comments. Either rewrite as prose or use a code fence inside the JSDoc.
Fire-and-forget coroutine scope must outlive the request
Section titled “Fire-and-forget coroutine scope must outlive the request”CSV upload kicks off a background job from a request handler. The original implementation used the request’s CoroutineScope, which the Ktor pipeline cancels on response. The job died mid-import.
Pattern: spin up an independent SupervisorJob()-rooted scope for any work that must outlive the originating request. Nesting matters: runCatching { runSuspendingBoundary(...) { processor.process(...) } }.flatten().onFailure { trkr.update(...) } — the boundary inside, the catching outside, so Sentry sees the throwable before the tracker records the terminal status. Documented in operations/src/main/kotlin/…/CsvUploadService.kt.
Reportability is a property of the error type, not the capture site
Section titled “Reportability is a property of the error type, not the capture site”The original draft of BoundaryCapture had the API/HTTP layer deciding what to send to Sentry (e.g. “send everything except 4xx”). That conflated transport-level concerns with semantic ones — a validation failure on a background batch is still a 4xx-equivalent, but there’s no HTTP response to drive the decision.
Refactor: AppError.reportable(): List<Throwable> returns the list of throwables a given error case wants to send. Internal.* and Generic return themselves; Invocation.* returns empty. The capture site iterates the result. Now every consumer — boundary, Logback appender, last-resort handler — uses the same policy without copy-pasting the classification.
Documentation cross-link rot is the long-tail cost of moves
Section titled “Documentation cross-link rot is the long-tail cost of moves”Moving the project from roadmap/in-progress/operations-sentry/ to roadmap/completed/ will break any inbound .md link that hard-codes the in-progress path. The remark-resolve-md-links plugin catches these at build time, so make pr-checks will fail loudly — but only for links within the documentation repo. Workbook references (workbooks/notebooks/operations-sentry/) are local-only and decoupled, but external references in PR bodies, knowledge-base files in other repos, and Linear comments will silently 404 over time.
Habit to keep: when a project closes, grep across all repos for roadmap/in-progress/<project-name> and replace with the new path. Cheap during closure; expensive six months later when on-call follows a broken link.
Copyright: © Arda Systems 2025-2026, All rights reserved