Sentry Observability

Sentry is Arda’s primary observability surface for the running platform. It carries error capture, performance traces, and release-health sessions for both the frontend (arda-frontend-app) and the backend (operations today; future Kotlin components inherit the same wiring via common-module). This page describes what runs where, what each piece is responsible for, and how the pieces compose.

For the implementer’s how-to (dependencies, Helm values, wiring recipes), see Sentry Integration. For the design rationale behind each decision in this page, see the operations-sentry decision log under roadmap/completed/operations-sentry/decision-log.md.

Two halves, one model

The platform’s Sentry surface has two independently-deployed halves that share the same data model (events, transactions, traces, sessions) and the same Sentry tenant:

Half	Sentry project	What it captures
Frontend (`arda-frontend-app`, Next.js + React)	`arda-frontend`	Browser-side errors and performance; SSR-side errors and performance; Edge-runtime errors; user replays where enabled.
Backend (`operations`, Kotlin/Ktor on EKS Fargate; `accounts-component` on adoption)	`platform-be`	Server-side errors, request transactions, JDBC and outbound HTTP spans, per-request release-health sessions, infrastructure-callback errors via Logback forwarding.

The two projects are linked through the trace_id that propagates browser → BFF (same-origin) → backend (cross-origin via Next.js outbound), so a frontend transaction and the backend transactions it triggers appear in the same Sentry trace. The release tags are intentionally independent — frontend uses the Next.js SDK default (Vercel/Amplify commit SHA); backend uses {appName}@{Chart.AppVersion} — because the two halves have independent deploy cadences and release-health metrics for each are tracked on their own surface.

The single-Hub composition on the backend

The backend instrumentation is the more involved half because it composes two Sentry-aware components that both touch the same in-process state:

The Sentry OpenTelemetry Java agent (io.sentry:sentry-opentelemetry-agent, bundled into the operations container image as /app/agents/sentry-otel-agent.jar and attached via -javaagent:) auto-instruments Ktor, Exposed/JDBC, and outbound HTTP clients. It produces OpenTelemetry-shaped spans and ships them to Sentry via Sentry’s own span exporter (not OTLP — see the configuration note below).
The in-process Sentry SDK (io.sentry:sentry, plus io.sentry:sentry-logback) is initialised by common-module’s SentryInit.init() early in Component.build(...). It configures beforeSend / beforeSendTransaction for PII scrubbing, attaches the JVM uncaught-exception integration, and registers the Sentry Logback appender that consuming components opt into via their own logback.xml.

Both components operate on the same Sentry Hub per JVM. The OTel agent’s bundled SDK initialises the Hub first at JVM startup; the in-process Sentry.init { ... } then reconfigures the existing Hub with the operations-specific options. There is one Sentry SDK, one Hub, and one stream of events leaving the pod — the agent and the SDK init cooperate, neither replaces the other.

Why both, not one or the other

The agent alone would produce spans but cannot run beforeSend-style PII scrubbing or apply AppError-aware filtering — those are SDK callbacks tied to the event-emit path. The in-process SDK alone would lack auto-instrumentation for Ktor’s request boundary and the JDBC layer; we would have to write and maintain those wrappers ourselves. The composition gives us auto-instrumentation from the agent plus event-level control from the SDK, with no code duplication.

Why not OTLP-only

OTLP carries traces, metrics, and logs but not Sentry’s event model — Issues, exceptions, release-health sessions, breadcrumbs, and beforeSend are all Sentry-SDK protocol. An OTLP-only architecture would lose error capture as Issues (no grouping, no triage workflow, no fingerprinting), would lose per-request sessions, and would push PII scrubbing into a collector with a strictly weaker attribute-only data model. The Sentry-native transport for these surfaces is non-negotiable.

The bundled OTel agent’s upstream OpenTelemetry pipeline has OTLP exporters that, if left on their defaults, attempt to send to http://localhost:4318 (no collector listens there). The operations Helm chart sets OTEL_TRACES_EXPORTER=none, OTEL_METRICS_EXPORTER=none, and OTEL_LOGS_EXPORTER=none to short-circuit that pipeline at the agent’s boot; Sentry’s own span exporter inside the same agent continues to ship spans over the Sentry-native transport.

Capture-path topology

Five concurrent capture paths exist in steady state on the backend (P1 boundary, P2 Logback, P3a runBoundary, P3b global CoroutineExceptionHandler, P4 JVM uncaught). They are deliberately non-exclusive: an unhandled error reaching the StatusPages handler is captured by the boundary path AND logged by app.log.warn(...) — which the Logback appender then re-captures. The duplication is accepted; Sentry’s default fingerprinting groups the two events into a single Issue.

PlantUML diagram

The four paths in order:

P1, boundary capture at StatusPages — the primary path. Every unhandled Throwable reaching the StatusPages handler is fed through the bridging Throwable.reportable() extension and the resulting list of throwables is captured one-by-one with the boundary=http tag and the request route name. The original app.log.warn(...) log line is preserved alongside, which path P2 then re-captures (intentional duplication).
P2, Logback SentryAppender — the backstop for code paths that log and continue without throwing to a boundary: defensive log.error(...) calls, third-party library logs, infrastructure callbacks. It does not apply the reportable() filter — a deliberate log.error(...) on an Invocation.* still surfaces because someone judged it noteworthy enough to log. INFO and above flow as breadcrumbs; ERROR (or any event carrying a Throwable) becomes a captured event.
P3a, runBoundary/runSuspendingBoundary — wraps every out-of-request entry point (scheduled jobs, fire-and-forget coroutines launched from request handlers, queue consumers when introduced). Same policy as P1: reportable() filters, boundary=batch tag identifies the job by label.
P3b, global CoroutineExceptionHandler — catches coroutine-side exceptions the thread-level handler does not naturally see (root-scope launches that escape every other boundary).
P4, JVM uncaught handler — Sentry’s SDK installs UncaughtExceptionHandlerIntegration by default. By the time it fires, every other capture path has missed. The policy is to capture everything regardless of classification, tagged via=uncaught-handler so it is distinguishable in Sentry search.

Error filtering: the `AppError.reportable()` policy

Reporting is driven by Arda’s AppError semantics through a reportable(): List<Throwable> method on the sealed type. Capture sites ask the throwable what to do and iterate the result; they never inspect HTTP status or the call site.

`AppError` branch	`reportable()` returns	Lands as Sentry Issue?
`Internal.*` (e.g. `Implementation`, `Infrastructure`, `ExternalService`, `InternalService`)	`listOf(this)`	Yes
`Generic`	`listOf(this)`	Yes
`Invocation.*` (validation, auth, not-found, conflict, etc.)	`emptyList()`	No
`Composite(causes = [...])`	`causes.flatMap { it.reportable() }`	One Issue per reportable cause; each tagged `wrapped_in_composite: <composite.message>`
Non-`AppError` `Throwable`	`listOf(this)` (via bridging extension)	Yes

The rule reflects intent. Internal.* exists for the bugs we own; Invocation.* exists for caller-driven inputs the BFF and FE already surface. A composite carrying multiple causes emits each reportable cause independently so triage stays on individual root causes rather than the wrapper.

Fingerprinting

Fingerprinting is applied at the capture site, not in beforeSend. The default fingerprint formula derives from the AppError subtype’s FQCN plus a discriminator from the subtype’s own fields (serviceName for ExternalService / InternalService / InternalTimeout; operationName for NotImplemented). Non-AppError throwables fingerprint by concrete class plus the first non-framework stack frame. The result: untuned components still produce sensible Issue grouping; tuned components can override the formula per-capture-site without going through beforeSend.

Performance and tracing

The Sentry OTel Java agent auto-emits transactions and spans:

Ktor route transactions — named by the route template (e.g. POST /v1/kanban/(authenticate auth)/kanban-card/details), tagged with the HTTP method, status, and call ID where available.
Exposed / JDBC spans — every DB call appears as a span.op:db child of the surrounding transaction. Statement values are redacted to ? before the event leaves the pod (see the PII section below).
Outbound HTTP client spans — calls from operations to Documint or other downstream services appear as span.op:http.client spans, with sentry-trace and baggage headers propagated outbound so the downstream service’s spans join the same trace if it is also Sentry-instrumented.
JVM-level spans — GC and other JVM diagnostics are emitted by the agent’s built-in instrumentation.

Per-environment trace sample rate is set in operations’ Helm values (oam.performance.sentry.tracesSampleRate): dev and stage at 1.0 (full sampling — diagnostic targets), demo and prod at 0.2 (matches the frontend’s prod sample rate so FE-initiated traces survive on the BE side). Local builds are off by default.

Trace propagation FE↔BE

The chain SPA (browser) → BFF (Next.js API) → API (operations) → DB (Exposed/JDBC) is end-to-end traced. The browser’s Sentry SDK attaches sentry-trace and baggage headers on every outbound HTTP request whose URL matches the configured tracePropagationTargets. The Next.js BFF forwards those headers on its server-side outbound calls. The Sentry OTel Java agent on the operations side reads them on the incoming request and continues the trace, propagating downstream to the Exposed spans.

The frontend’s tracePropagationTargets is configured explicitly per-environment in src/lib/sentry/trace-propagation-targets.ts and wired into all three Sentry init paths (instrumentation-client.ts, sentry.server.config.ts, sentry.edge.config.ts). Each environment’s allow-list contains Sentry’s same-origin defaults plus the env-specific backend host (api.arda.cards, stage.alpha002.io.arda.cards, dev.alpha002.io.arda.cards).

Release health: per-request sessions

The backend emits one Sentry session per HTTP request when oam.performance.sentry.sessions.enabled is true and Sentry is otherwise initialised. The “Crash-free request rate per release” metric on the Sentry Release Health surface is meaningful for platform-be for the first time.

The mechanism is intentionally not the Sentry JVM SDK’s enableAutoSessionTracking flag, which on the Java SDK 8.x emits at most one session per JVM lifecycle (not per request). Sentry publishes no Ktor server plugin — io.sentry:sentry-ktor-client is for the HTTP client only. The documented mechanism on a non-Spring Java server is manual Sentry.startSession() / Sentry.endSession() calls at the request boundary. Operations achieves this by installing a small Ktor application plugin (SentryRequestSession) that starts and ends a session around each call. The plugin is guarded by Sentry.isEnabled() so it no-ops when SENTRY_DSN is absent, matching the fail-soft posture in SentryInit.

The session implicitly inherits the trace sample rate — there is no separate sessionSampleRate knob on the JVM SDK. A session is emitted iff its enclosing request was sampled into Sentry.

Release tags follow {appName}@{Chart.AppVersion} (e.g. operations@2.25.1) so each Helm release shows as a distinct release on the Sentry Release Health tab. The chart’s app version is the source of truth.

PII handling and payload scrubbing

All PII scrubbing runs in-process in beforeSend and beforeSendTransaction callbacks registered by common-module’s SentryInit.init(). Sentry-side data scrubbers stay enabled as defence in depth, but the code-side policy is authoritative — by the time an event leaves the pod, it carries no plaintext PII for any of the following surfaces.

Opaque user identification

The Sentry SDK’s sendDefaultPii is false. The user identification surface is exactly one field: event.user.id, computed as HMAC-SHA-256(salt, JWT-subject-claim) truncated to 16 hex characters. No user.email, no user.username, no user.ip_address, no other user fields. The salt is per partition (Alpha001-prod, Alpha001-demo, Alpha002-dev, Alpha002-stage each have their own) so cross-purpose user correlation is deliberately broken — prod and demo users cannot be linked, nor can dev and stage users. Cross-component correlation works only within the same purpose.

The salt is materialised at pod startup by the External Secrets Operator from AWS Secrets Manager {Infrastructure}-{purpose}-SentryScrubSalt, provisioned by the PartitionSecrets CDK stack in infrastructure. The K8s Secret name is be-sentry-scrub-salt; the pod env var is SENTRY_SCRUB_SALT; the SDK’s secretKeyRef is optional: true so a missing salt does not block pod startup — PiiScrubber falls back to a deterministic placeholder until the salt is present.

Headers, body, and DB statement

Surface	Policy
Request headers	Deny-by-default allow-list. `X-Request-Id` and `X-Forwarded-For` pass through unchanged. `X-Tenant-Id` is replaced with an HMAC-hashed `tenant_hash` tag and stripped from the headers map. Every other header (`Authorization`, `Cookie`, `Set-Cookie`, custom auth headers) is removed.
Request body	Captured-with-redaction. A regex pass masks JWT-like substrings, AWS access keys, Sentry DSN-like substrings, and a small set of other well-known credential shapes.
`AppError.context` lambda output	Scrubbed by the same redaction pass before it lands as event extras.
`db.statement` span attributes	Single-quoted string literals and numeric literals replaced with `?`. In practice this is a defensive backstop — Exposed parameterises every statement; this pass activates only on raw-SQL escape hatches.

The four sites are independent — each has its own scrubber in common-module/lib/src/main/kotlin/cards/arda/common/lib/runtime/observability/ (PiiScrubber, OpaqueId, HeadersAllowList, Redactor, DbStatementRedactor). They run in a fixed order during beforeSend / beforeSendTransaction. The arrangement is testable in unit isolation; the project ships a unit test suite covering each scrubber.

Logback forwarding

The io.sentry:sentry-logback appender attaches to the root logger in each consuming component’s logback.xml. It forwards ERROR-level events and any event carrying a Throwable (regardless of level) to Sentry as captured events. INFO and above ride along as breadcrumbs on the next captured event. The appender does not apply the AppError.reportable() filter — log forwarding is the second-chance net that catches everything someone bothered to log loudly. Duplication with the boundary path (P1) is accepted; Sentry’s grouping handles it.

The appender attachment is per-component, not in common-module itself, because the logback.xml is a consuming-component artifact. common-module ships the sentry-logback dependency; each component opts in by adding the <appender name="SENTRY"> block and an <appender-ref ref="SENTRY"/> on the root logger. When SENTRY_DSN is empty (local builds, tests), the appender no-ops cleanly.

Configuration surface

The full set of knobs and the layer that owns each:

Knob	Where	What it does
`oam.performance.sentry.enabled`	`operations` Helm values	Master switch. When false, the OTel `-javaagent:` is not attached, no Sentry env vars are emitted, no ExternalSecrets are reconciled.
`oam.performance.sentry.environment`	`operations` Helm values	Overrides the `SENTRY_ENVIRONMENT` env var. Defaults to `application.environment` (`{Infrastructure}-{purpose}`).
`oam.performance.sentry.tracesSampleRate`	`operations` Helm values	Sets `SENTRY_TRACES_SAMPLE_RATE`. Per-env: local off, dev/stage `1.0`, demo/prod `0.2`.
`oam.performance.sentry.sessions.enabled`	`operations` Helm values	When true, sets `SENTRY_ENABLE_AUTO_SESSION_TRACKING=true` and `SENTRY_AUTO_SESSION_TRACKING=true` (the dual env vars cover both the SDK’s canonical name and the legacy name `common-module` reads today). The Ktor `SentryRequestSession` plugin starts/ends sessions per request when this is enabled.
`SENTRY_DSN`	ESO ⟶ `be-sentry-dsn` K8s Secret	Sentry DSN for the `platform-be` project. Provisioned at the Infrastructure layer by `InfrastructureSecretsStack`. `optional: true` on the `secretKeyRef`.
`SENTRY_SCRUB_SALT`	ESO ⟶ `be-sentry-scrub-salt` K8s Secret	HMAC salt for opaque user IDs and `tenant_hash`. Provisioned at the partition layer by `PartitionSecrets`. `optional: true`. Per-partition.
`OTEL_*_EXPORTER=none`	`operations` Helm template	Hard-coded constants. Disable the upstream OpenTelemetry pipeline’s OTLP exporters so the agent ships exclusively via Sentry’s transport.

The Helm values file for each environment documents the rationale for its specific sample rate and toggles; the values.yaml schema is the canonical reference.

Out-of-scope today

For completeness, what the platform’s Sentry surface does not do today, even though the data model supports it:

EU-region Sentry instance with encrypted-at-rest event content. Deferred until customer geography requires it.
Per-AppError-subtype structured context that replaces the lambda-based context: LazyMessage with explicit safe-vs-sensitive fields. The lambda output is scrubbed wholesale instead.
AppError-creation-site breadcrumbs. The creation-site capture alternative was rejected in favour of boundary capture; the breadcrumb-on-creation softer variant is a possible later refinement.
Sentry-side alerting (alert routes, SLO-based notification on regression metrics, slow-route alerts). Deferred to a future alerting / observability project.
Sentry Logs ingest on the backend (the distinct stream the frontend enables via enableLogs: true). Possible follow-up.
Outbound trace propagation hardening from operations via sentry-ktor-client. The OTel agent’s auto-instrumentation already produces outbound spans with trace headers; the explicit plugin is a richness improvement, not a correctness fix.

Reference

Sentry Integration how-to — the implementer’s recipe: dependencies, Helm values, wiring snippets, runBoundary adoption, Logback appender XML, PII test recipes.
Service Monitoring — the reactive monitoring surface that runs alongside Sentry (CloudWatch alarms on the request path, NLB, DNS).
AppError reference — the sealed error hierarchy whose reportable() policy drives Sentry capture.
Email Module Runbook — an example of module-specific Sentry tags (the email.drift.* family) that rely on the environment dimension above rather than duplicating the partition as a context field.