Skip to content

Operations Sentry — Project Goal

Lifecycle: Completed

Bring the operations component into a comprehensive, end-to-end Sentry observability model alongside the already-instrumented arda-frontend-app, so that the platform’s backend produces first-class error, performance, and release-health signals that join the frontend signals already in Sentry.

The unifying purpose: when a user reports something is broken, the on-call engineer can stay inside Sentry to follow the full distributed trace from browser to backend, see the backend stack trace alongside the frontend stack trace, and diagnose without leaving the tool. When something quietly degrades — slower routes, higher error rates per release — Sentry surfaces the regression before users complain.

The bulk of the implementation lives in common-module. Every Kotlin/Ktor component that consumes common-module (operations today; accounts-component and future services) inherits the observability behaviour through standard dependency consumption. This project executes the design and lands the adoption in operations; adoption in accounts-component is tracked separately under PDEV-533 and is out of scope here.

Capabilities the platform must offer at the end of this project.

  • Every unhandled Throwable reaching a request boundary in operations produces a Sentry Issue under the platform-be project, with stack trace, request context, and the trace ID that joins the frontend’s view of the same request.
  • Errors are filtered by semantic category: AppError.Internal.* (bug-worthy) and AppError.Generic reach Sentry; AppError.Invocation.* (caller-driven — validation, auth, not-found, conflict) are dropped. Non-AppError throwables that reach a boundary are captured as-is.
  • The classification decision is encapsulated in the AppError type itself (reportable(): List<Throwable> method), not in the API/HTTP layer; the Sentry capture site simply iterates the result.
  • Background paths that do not reach a request boundary (defensive log.error(...) calls, library logs, infrastructure callbacks) also reach Sentry via a Logback appender, with the understanding that duplication with boundary capture is accepted.
  • A JVM-level last-resort handler captures any throwable that escapes every other capture path before the thread dies, so observability survives even when the boundary instrumentation has a gap.
  • Ktor route transactions, Exposed/JDBC spans, outbound HTTP-client spans, and JVM-level spans are emitted by the existing Sentry OpenTelemetry Java agent and reach the platform-be Sentry project at the configured sampling rate.
  • Frontend-initiated traces (arda-frontend → backend) remain continuous on the backend side: a request that the frontend sampled into Sentry also appears in the backend’s view of the same trace.
  • Backend trace sample rate matches the frontend’s in production (0.2), bumped from today’s 0.1, so the trace-continuity ratio is consistent across environments.
  • The backend emits Sentry sessions in request mode in all non-local environments. The “Crash-free request rate per release” metric becomes meaningful on platform-be for the first time.
  • Backend release tags follow the existing {appName}@{Chart.AppVersion} scheme; frontend release tags follow the Next.js SDK’s default. The two tag schemes are intentionally independent — backend release health and frontend release health are tracked side by side, reflecting their independent deploy cadences.
  • A single, idempotent SDK initialization in common-module is invoked as part of the standard component bootstrap. Every Kotlin component that consumes common-module inherits Sentry behaviour with no per-service plumbing.
  • Sentry behaviour is gated entirely by Helm values: oam.performance.sentry.enabled and the per-env tracesSampleRate and sessions.* values. When the DSN is absent (locals, tests), the SDK init succeeds silently and no events are emitted.
  • A new how-to under current-system/oam/ (or adjacent) describes the agent + SDK coexistence, the AppError.reportable() policy, boundary capture, Logback forwarding, session config, and PII scrubbing — superseding the stale sentry-integration.md.

Each constraint is anchored in a Design Topic (DT-XXX) decided during the project’s exploration phase. The full text of each decision lives in the project workbook under workbooks/notebooks/operations-sentry/decisions/.

  • The Sentry OpenTelemetry Java agent (io.sentry:sentry-opentelemetry-agent, version pinned in operations/gradle/libs.versions.toml) remains as-wired today; the explicit SDK initialization is added on top of the agent, not as a replacement (DT-001).
  • The SDK init must be idempotent and safe when the agent has already loaded the bundled SDK; option settings re-apply to the existing Hub rather than triggering a re-initialization.
  • The SDK initialization lives in common-module/lib/src/main/kotlin/cards/arda/common/lib/runtime/observability/ (final package name decided at implementation). It is invoked by Component.build(...) so every consumer inherits it without per-component code (DT-002).
  • AppError’s reportable() method lives in the same file as the sealed AppError hierarchy (common-module/.../lang/errors/AppError.kt) so the policy is co-located with the type that owns the semantic claim (DT-003).
  • The boundary-capture helpers (the shared low-level invocation and the per-transport wrappers) live alongside the SDK init in common-module (DT-006).
  • Reporting is driven by AppError semantics through a reportable(): List<Throwable> method on the sealed type — not by HTTP status, not by inspection at the call site. Internal.* and Generic report; Invocation.* drop; Composite recurses through its causes; non-AppError Throwable reports (DT-003).
  • A composite’s reported causes are emitted as independent Sentry events, each tagged wrapped_in_composite: <composite.message> so they remain correlatable in Sentry search.
  • Fingerprinting lives at the capture site, with reasonable defaults derived from the AppError subtype (FQCN + serviceName / operationName discriminators where present) so untuned components still produce sensible grouping.
  • Per-fingerprint rate limiting is implemented via the SDK’s sampleRate; not by Sentry-side ingest quota policy.
  • Sessions emit one per HTTP request; application-mode (one session per JVM lifecycle) is rejected because it duplicates Kubernetes pod-health signals already in CloudWatch (DT-004).
  • Session sampling inherits from tracesSampleRate automatically (one session per request, emitted iff the trace is sampled): dev/stage 1.0, demo/prod 0.2, local off. The JVM SDK does not expose a separate sessionSampleRate knob.
  • Demo and prod tracesSampleRate is bumped from today’s 0.1 to 0.2 to align with the frontend’s prod rate, restoring consistent FE↔BE trace continuity (DT-004).
  • Per-request sessions are emitted by a small Ktor application plugin (SentryRequestSession) installed by each consuming component — the documented manual approach for the JVM SDK 8.x. The Sentry JVM SDK’s isEnableAutoSessionTracking alone emits at most one session per JVM lifecycle, and Sentry publishes no sentry-ktor server plugin (only sentry-ktor-client for HTTP client instrumentation). The plugin is Sentry.isEnabled()-guarded so it no-ops when SENTRY_DSN is absent, and pluginOrNull-guarded at the install site so a future lift into common-module’s Component.build(...) (tracked under PDEV-490) does not require coordinated per-app removal.
  • All scrubbing runs in beforeSend and beforeSendTransaction in common-module’s SDK init; Sentry-side data scrubbers are kept enabled as defense-in-depth but the code-side policy is authoritative (DT-005).
  • user.id is HMAC-SHA-256(salt, JWT-subject-claim) truncated to 16 hex chars. No user.email, no user.username, no user.ip_address. sendDefaultPii = false.
  • HTTP request bodies are captured-with-redaction; a regex pass masks JWT, AWS access-key, and DSN-like substrings.
  • HTTP headers are deny-by-default. X-Request-Id and X-Forwarded-For pass through unchanged. X-Tenant-Id is replaced with an HMAC-hashed tenant_hash tag and stripped from the headers map. Every other header (including Authorization, Cookie, Set-Cookie) is removed.
  • AppError.context lambda outputs are scrubbed by the same redaction pass before they land as event extras.
  • DB statement attributes get a regex literal-redaction pass (level (b) in the thread): single-quoted string literals and numeric literals are replaced with ?. In practice this is zero-cost because Exposed parameterises every statement; it activates only on raw-SQL escape hatches.
  • The scrub salt is delivered per-partition via ESO, mirroring the existing SENTRY_DSN delivery pattern: AWS Secrets Manager {Infrastructure}-{purpose}-SentryScrubSalt (e.g. Alpha001-prod-SentryScrubSalt) → K8s Secret be-sentry-scrub-salt → pod env var SENTRY_SCRUB_SALT. The salt is per-purpose, not per-Infrastructure: prod and demo (both in Alpha001) have different salts, as do dev and stage (both in Alpha002). Cross-purpose correlation deliberately broken; cross-component correlation works only within the same purpose. The salt is materialised by a new PartitionSecrets CDK stack in the infrastructure repository (Deliverable 7a).
  • Capture happens at each module entry boundary via a shared common-module helper plus thin per-transport wrappers. Ktor is the only transport instrumented in this project; gRPC and async-worker wrappers are not added until those transports exist (DT-006).
  • Background batch jobs in operations/src/main/kotlin/cards/arda/operations/system/batch/ adopt runBoundary("<job-label>") { … } as the standard boundary wrapper.
  • JVM-level last-resort capture uses the Sentry SDK’s UncaughtExceptionHandlerIntegration, unfiltered by reportable() and tagged via: uncaught-handler so it is distinguishable in Sentry from boundary captures.
  • A global CoroutineExceptionHandler is installed alongside the SDK init to catch coroutine-side escapes the thread-level handler does not naturally see.
  • The Sentry Logback appender (io.sentry:sentry-logback) is installed in common-module and configured to forward log events at level ERROR and any log event carrying an exception (regardless of level) (DT-008).
  • The appender does not apply the reportable() filter on the log-side path. Duplication with boundary capture is accepted; Sentry’s default fingerprinting is expected to group duplicate events into a single Issue.
  • The appender attachment is conditional on the SDK actually initialising (DSN present); when the DSN is empty, no appender is attached.
  • The frontend’s three Sentry init paths (instrumentation-client.ts, sentry.server.config.ts, sentry.edge.config.ts) gain explicit env-aware tracePropagationTargets configuration (DT-007). This is implemented in arda-frontend-app as an independent PR; it is a prerequisite for hardened trace propagation but does not block the backend waves.

Each is a black-box statement at the system boundary; verification artefacts produced during planning will decompose these into mechanical checks.

  1. SDK init lives in common-module and is consumed transparently. A new Kotlin component built on common-module with oam.performance.sentry.enabled: true and a valid DSN emits Sentry events at the configured sample rate with no per-component code.
  2. AppError.Internal.* reach Sentry; AppError.Invocation.* do not. A deliberate AppError.Internal.Implementation raised in any Ktor route produces a Sentry Issue tagged boundary: http with the request route. An AppError.Invocation.NotAuthorized raised on the same route produces no Sentry event.
  3. Composite recursion emits independent events. A Composite carrying one Internal.Infrastructure and one Invocation.GeneralValidation produces exactly one Sentry event (the Infrastructure cause), tagged wrapped_in_composite: <composite.message>.
  4. Sessions are emitted per request, sampled in tandem with traces. The Release Health tab for the deployed release shows a non-zero session count proportional to tracesSampleRate for the environment (sessions emit iff their enclosing trace is sampled — there is no separate session sample rate on the JVM SDK).
  5. Trace continuity is preserved end-to-end. A request initiated by arda-frontend that is sampled into Sentry on the frontend side appears in the same Sentry trace on the platform-be side, joining the FE and BE spans under one trace ID.
  6. PII scrubbing is verified. A deliberate error triggered with a request body containing a fake JWT, a fake email, and an X-Tenant-Id header produces a Sentry event where: user.id is the opaque hash, no plaintext email or JWT remains in any field, the X-Tenant-Id header is absent from request.headers, a tenant_hash tag carries the hashed value, and db.statement shows the parameterised form (no literal values).
  7. Last-resort capture works. A deliberate Thread.start { throw RuntimeException("…") } outside any boundary handler produces a Sentry event tagged via: uncaught-handler before the thread dies.
  8. Logback forwarding works. A log.error(...) call from a code path that catches and continues (does not throw to a boundary) produces a Sentry event. The same exception caught at a request boundary and reaching the StatusPages handler produces an event that Sentry groups into the same Issue (or a parallel Issue acceptable for triage).
  9. Frontend tracePropagationTargets are explicit in all three Sentry init paths in arda-frontend-app and the env-specific backend host is included in the allow-list for each environment.
  10. Operational off-switches work. Setting oam.performance.sentry.enabled: false in Helm disables all backend Sentry behaviour; setting SENTRY_AUTO_SESSION_TRACKING=false disables sessions while leaving exception capture intact; an empty SENTRY_DSN results in fail-soft no-op behaviour (pod starts, no events emitted).
  11. Documentation is current. Two pages are in place: an architectural reference at current-system/oam/sentry-observability.md (what it is, how the platform uses it) and an implementer how-to at process/craft/operations-and-monitoring/sentry-integration.md (rewritten from the previous stale content) covering dependencies, wiring, Helm values, runBoundary adoption, Logback appender, and PII-scrubbing test recipes.
  • accounts-component adoption. Tracked separately under PDEV-533. The common-module work in this project enables that adoption; the actual accounts-component PR is filed under PDEV-533 on its own timeline.
  • Sentry-side alerting (alert routes, SLO-based notification on regression metrics, slow-route alerts on the routes already visible today). Deferred to a future alerting / observability project.
  • EU-region Sentry instance and encrypted-at-rest event content. Deferred until customer geography requires it.
  • Data-subject-deletion runbook. Sentry supports the operation via API; not a code-side concern for this project.
  • Per-AppError-subtype structured context (replacing the lambda-based context: LazyMessage with explicit safe-vs-sensitive fields). Possible future refinement; today the lambda output is scrubbed wholesale.
  • AppError-creation-site breadcrumbs. The “creation-site reporting” alternative was rejected in favour of boundary capture; the breadcrumb-on-creation softer variant is noted as a possible later refinement but not implemented here.
  • Pre-Sentry.init errors. Code that runs before SDK init bypasses capture. Acceptable today (no consumers wired pre-init); noted in DT-006.
  • Backend-side tracePropagationTargets. The OTel agent handles propagation in both directions already; no allow-list of the frontend’s shape is needed.
  • Sentry Logs ingest on the backend (the distinct stream the frontend enables via enableLogs: true). Possible follow-up; not in DT-008.
#DeliverableRepositoryDescription
1Sentry SDK initialisation and observability primitivescommon-moduleNew runtime/observability/ package containing SentryInit, BoundaryCapture, PiiScrubber, OpaqueId, CoroutineExceptionHandlerFactory, AppErrorContextScrubber. New dependencies (io.sentry:sentry, io.sentry:sentry-logback). Hooked into Component.build(...).
2AppError.reportable() policycommon-moduleMethod added to the sealed AppError hierarchy; Internal.* and Generic inherit the default, Invocation.* overrides to empty list, Composite flat-maps over causes. Bridging extension for non-AppError Throwable.
3Sentry Logback dependency (common-module) and per-component XML wiringcommon-module + each consuming componentcommon-module adds the io.sentry:sentry-logback runtime dependency only. Each consuming component (operations in this project; accounts-component via PDEV-533) adds the <appender name="SENTRY"> block and <appender-ref ref="SENTRY"/> on the root logger to its own logback.xml. Filters: ERROR-level events and any event carrying an exception; INFO and above as breadcrumbs. DSN-empty fail-soft handled by SentryAppender itself.
4Helm chart + logback adoption in operationsoperationsoam.performance.sentry.sessions.{enabled} sub-object (single field — JVM SDK has no mode or sampleRate); per-env values updated; demo/prod tracesSampleRate bumped from 0.1 to 0.2; deployment template emits the new env vars; be-sentry-scrub-salt ExternalSecret declared. src/main/resources/logback.xml updated with the Sentry appender block and root <appender-ref> per Deliverable #3.
5runBoundary adoption in operationsoperationsEvery entry point under src/main/kotlin/cards/arda/operations/system/batch/ wraps its work in runBoundary("<label>") { ... }. arda-common-version bumped to the new release.
6New Sentry documentation — architectural referencedocumentationNew page at current-system/oam/sentry-observability.md: what Sentry is in the Arda platform, how the agent + SDK coexist, the boundary-capture topology, session/release-health mechanics, and the FE/BE release-tag divergence. Audience: anyone reading the platform’s runtime documentation.
6bNew Sentry documentation — implementer how-todocumentationRewrite of process/craft/operations-and-monitoring/sentry-integration.md as the step-by-step implementer how-to: dependencies, SDK init wiring, Helm values, runBoundary adoption, Logback appender, PII-scrubbing test recipes. References the architectural page for design intent. The previous stale content is superseded by this rewrite.
7aNew CDK stack PartitionSecretsinfrastructureNew stack file at src/main/cdk/stacks/purpose/partition-secrets.ts following the partition-authn.ts / partition-bulk-stores.ts shape. Creates an AWS Secrets Manager Secret per partition holding the Sentry scrub salt. Default content is generated by AWS via generateSecretString (random on first deploy, preserved thereafter); an optional sentryScrubSaltOverride lets a known value be supplied from source. The stack publishes the secret ARN as a CFN export ({publishingPrefix}-API-SentryScrubSaltArn, following the -API- cross-repo export convention established by PDEV-531 so stackTypes.publish() emits a raw-ARN output that ESO can read).
7bPartition-scoped configurationinfrastructureExtend PartitionInfo / Partition in src/main/cdk/platforms.ts with sentryScrubSaltOverride?: string. ENVIRONMENTS entries default to omitting it (random salt generated by AWS); any partition can opt in to a fixed value. SandboxKyle002 is deprecated and not configured.
7cWire PartitionSecrets into the partition appinfrastructuresrc/main/cdk/apps/Al1x/partition.ts instantiates PartitionSecrets inside buildPartition() alongside the existing per-partition stacks (authn, bulk-stores, storage, compute, dns, ingress, image-storage). Stack id: ${partitionPrefix}-PartitionSecrets. The new stack adds no addDependency to existing stacks.
7dPer-partition AWS Secrets Manager entriesinfrastructure (deploy output)After per-partition deploys, the following AWS Secrets Manager items exist: Alpha001-prod-SentryScrubSalt, Alpha001-demo-SentryScrubSalt, Alpha002-dev-SentryScrubSalt, Alpha002-stage-SentryScrubSalt. Each holds a JSON {"salt": "..."} with a 64-character random string (or the override value). The salt is partition-scoped, deliberately broken across purposes (prod vs demo) and across Infrastructures (Alpha001 vs Alpha002); cross-component correlation works only within the same purpose. No 1Password mirroring required: the salt is not credential-grade.
8Frontend tracePropagationTargetsarda-frontend-app (parallel)Explicit env-aware tracePropagationTargets configuration in src/instrumentation-client.ts, sentry.server.config.ts, sentry.edge.config.ts. Single PR; no backend dependency.

Eight Design Topics resolved during the project’s exploration phase. Each has a dedicated decision record in the project workbook (workbooks/notebooks/operations-sentry/decisions/dt-NNN-<slug>.md).

#TopicProject bearing
DT-001Use both Sentry SDK and OpenTelemetry agentThe bundled OTel agent stays; an explicit SDK init is added on top for exception capture, filtering, log forwarding, and session tracking. Both cooperate on one Hub.
DT-002SDK init in common-moduleA single init module in common-module is invoked as part of the standard component bootstrap. Every consuming Kotlin service inherits Sentry behaviour transparently.
DT-003Error filtering policyEncapsulated in AppError.reportable(): List<Throwable>. Internal.* and Generic report; Invocation.* drops; Composite recurses through causes.
DT-004Session-based release healthPer-request sessions in non-local environments via a small Ktor application plugin (SentryRequestSession) installed by each consuming component — the documented manual approach for the JVM SDK 8.x where no Sentry-published Ktor server plugin exists. Sessions are sampled in tandem with traces (no separate session sample rate on the JVM SDK). Demo/prod tracesSampleRate bumped from 0.1 to 0.2.
DT-005PII handling and payload scrubbingbeforeSend in common-module. Opaque user IDs via HMAC + per-partition salt via ESO. Capture-with-redaction for bodies. Narrow headers allow-list. DB-statement literal redaction.
DT-006Boundary capture and JVM-level last-resort reportingShared common-module helper plus per-transport wrappers. SDK’s UncaughtExceptionHandlerIntegration as the JVM safety net, unfiltered, tagged via: uncaught-handler. Global CoroutineExceptionHandler.
DT-007Frontend tracePropagationTargetsExplicit env-aware configuration in all three frontend Sentry init paths. Single-PR change in arda-frontend-app.
DT-008Backend Sentry Logback appenderInstall sentry-logback in common-module. Forward ERROR-level and exception-bearing log events. No reportable() filter on the log path. Duplication accepted.
TermDescription
Sentry DSNThe “data source name” — a URL credential that identifies a Sentry project and authenticates the SDK to send events to it. Delivered to the pod via ESO from AWS Secrets Manager, declared optional: true so a missing DSN does not block pod startup.
Sentry SDKThe in-process Sentry client (io.sentry:sentry) that captures events, applies beforeSend, manages the Hub, and ships events to the Sentry server. Configured by the SDK init in common-module.
Sentry OpenTelemetry agentThe bundled -javaagent: (io.sentry:sentry-opentelemetry-agent) that auto-instruments Ktor, JDBC, HTTP clients, and the JVM, producing spans and transactions without code changes. Reads the same env vars as the SDK and shares its Hub.
HubSentry SDK abstraction representing the current SDK state, including the active scope, breadcrumbs, and tags. Hub is per-thread by default; both the agent and the in-process SDK init operate on the same Hub once both have run.
AppErrorArda’s sealed Kotlin error hierarchy (cards.arda.common.lib.lang.errors.AppError), with Internal, Invocation, Composite, and Generic branches. Owns the reportable(): List<Throwable> method that drives Sentry capture.
reportable()Method on AppError (and a bridging extension on Throwable) that returns the list of throwables to report to Sentry for a given error. Empty for caller-driven errors, single-element for bug-worthy errors, multi-element for composites that carry multiple reportable causes.
beforeSend / beforeSendTransactionSentry SDK hooks that run on every event/transaction just before it is sent. The project uses them for PII scrubbing: opaque user IDs, header allow-list, body redaction, AppError context scrubbing, DB statement redaction.
SessionSentry concept that represents a unit of usage (here, one HTTP request). Sessions are tagged with the release and environment; aggregated they produce the “Crash-free request rate per release” metric.
Release HealthSentry product surface that visualises session-based metrics — crash-free rate, regression detection per release, adoption — built on session events.
FingerprintSentry’s grouping key for events into Issues. Default fingerprinting uses the stack-trace top frame; this project customises it at the capture site using the AppError subtype + service/operation discriminators where present.
Trace propagationThe mechanism by which a distributed trace identifier (sentry-trace / W3C traceparent headers) is forwarded across service boundaries, so spans recorded in different processes are joined into one trace. The Sentry OTel agent on the backend accepts these headers natively.
tenant_hashAn opaque, HMAC-derived tag attached to Sentry events that allows correlation of events belonging to the same tenant without revealing the tenant identity in plaintext. Per-partition salts isolate cross-partition correlation deliberately.
ESOExternal Secrets Operator — the Kubernetes operator that synchronises secret material from AWS Secrets Manager into Kubernetes secrets accessible to pods. Used for SENTRY_DSN today; extended for SENTRY_SCRUB_SALT by this project.
PartitionA unit within an Arda Infrastructure (Alpha001 or Alpha002) dedicated to a specific purpose (prod, stage, dev, demo). Each partition has its own AWS-side resources and its own per-partition Sentry scrub salt.
  • Analysis — current-state survey of FE + BE Sentry today, the diagnosis gap, and constraints from the existing codebase.
  • Requirements — capability-level requirements (R-NNN) decomposed from the goal’s outcomes and acceptance criteria.
  • Specification — architectural overview, capture-path topology, common-module module structure, per-repo changes, multi-repo PR sequencing.
  • Verification — prerequisite checklist, unit-test plan, smoke-test procedures, per-AC verification with Sentry MCP queries.
  • workbooks/notebooks/operations-sentry/conclusions.md — the closing artefact of the exploration phase. Summarises all eight Design Topics and lists the implementation changes by repository. Internal workbook artefact; not published to the Starlight site.
  • workbooks/notebooks/operations-sentry/decisions/dt-001-sdk-and-agent.md through …/dt-008-be-log-forwarding.md — the eight Design Topic decision records. Internal workbook artefacts.
  • current-system/oam/sentry-observability.md (forthcoming) — the new architectural reference page. Will be the canonical published reference once the documentation deliverable lands.
  • current-system/architecture/ — existing design patterns the implementation conforms to.

Copyright: (c) Arda Systems 2025-2026, All rights reserved