Skip to content

Operations Sentry — Analysis

Synthesise the current state of Sentry observability across the Kotlin/Ktor backend and the Next.js frontend, identify the gap to the goal stated in goal.md, and surface the constraints from the existing codebase that the implementation must respect.

The exploration phase ran in the workbooks repository under workbooks/notebooks/operations-sentry/; the current-state survey and the DT-004 assessment are the underlying source material. This document is the published, project-scoped synthesis.

Backend (operations) — what is wired today

Section titled “Backend (operations) — what is wired today”

The Sentry OpenTelemetry Java agent is bundled into the Jib image at build time. Specifically:

  • operations/gradle/libs.versions.toml pins sentry-otel-agent-version = "8.41.0" and exposes the dependency as libs.sentry.otel.agent = io.sentry:sentry-opentelemetry-agent.
  • operations/build.gradle.kts defines a dedicated sentryAgent Gradle configuration. The copySentryAgent task downloads the agent jar from Maven Central and renames it to sentry-otel-agent.jar. The Jib plugin bundles it at /app/agents/sentry-otel-agent.jar inside the container image.

No application-level Sentry SDK is wired — io.sentry:sentry and io.sentry:sentry-logback are not declared as dependencies, and there is no Sentry.init { … } call anywhere in operations/src/main/kotlin/ or common-module/lib/src/main/kotlin/.

The Helm chart in operations/src/main/helm/ already exposes a Sentry-aware configuration surface:

  • values.yaml defines oam.performance.sentry.{enabled,environment,tracesSampleRate}. Default enabled: false.
  • templates/deployment.yaml appends -javaagent:/app/agents/sentry-otel-agent.jar to JAVA_TOOL_OPTIONS when sentry.enabled is true, and emits four env vars: SENTRY_DSN (from be-sentry-dsn K8s Secret, optional: true → fail-soft posture), SENTRY_ENVIRONMENT, SENTRY_RELEASE set to {appName}@{Chart.AppVersion}, and SENTRY_TRACES_SAMPLE_RATE.
  • templates/secrets.yaml declares the ExternalSecret that materialises be-sentry-dsn from {Infrastructure}-SentryDsn in AWS Secrets Manager (e.g. Alpha001-SentryDsn).
  • Per-environment values: dev and stage run at tracesSampleRate: 1.0 (full sampling to exercise the pipeline end-to-end); demo and prod at 0.1 (bound quota). Local is off.

The Ktor application is bootstrapped from operations/src/main/kotlin/cards/arda/operations/runtime/Main.kt, which calls into common-module’s Component.build(...). All cross-cutting Ktor plugin installs live in common-module/lib/src/main/kotlin/cards/arda/common/lib/component/Component.kt:200-247. The relevant installs are:

  • ServerCallId — read or generate X-Request-Id, propagate to MDC and the response.
  • CallLogging at INFO level with callIdMdc(CallFieldNames.callId).
  • StatusPages with a single exception<Throwable> handler that unwraps the cause when the wrapper message equals the cause’s, maps to a response via toErrorResponse(), logs app.log.warn("Error: …", path, toProcess), and responds with the mapped HTTP status.
  • ContentNegotiation, ServerMDCPropagationPlugin, ServerPerfMonitoringPlugin, Authentication.

The StatusPages handler is the single canonical point where every unhandled exception reaches today. Logback in operations/src/main/resources/logback.xml has only two console appenders (STDOUT, PERF_STDOUT); the root is at debug routed to PERF_STDOUT. There is no Sentry Logback appender.

The application’s errors flow through a sealed AppError hierarchy defined in common-module/lib/src/main/kotlin/cards/arda/common/lib/lang/errors/AppError.kt:

  • AppError.Internal.*Implementation, Infrastructure, ExternalService, InternalService, InternalTimeout, IncompatibleState, NotImplemented. All bug-worthy or operational signals.
  • AppError.Invocation.*GeneralValidation, ArgumentValidation, ContextValidation, NullArgument, NotFound, Duplicate, plus Authorization.{NotAuthenticated,NotAuthorized}. All caller-driven.
  • AppError.Composite — wraps multiple underlying AppErrors.
  • AppError.Generic — catch-all when no other type fits.

The HTTP-status mapping in common-module/lib/src/main/kotlin/cards/arda/common/lib/api/rest/types/HttpResponses.kt (Responses.appErrorResponse) translates each AppError to an HTTP response. As of today every Internal.* maps to a 5XX status and every Invocation.* maps to a 4XX status. The Sentry filtering policy in this project is intentionally decoupled from that mapping — see specification.md.

operations/src/main/kotlin/cards/arda/operations/system/batch/ contains existing batch infrastructure:

  • business/Job.kt — domain model for a tracked job.
  • service/JobService.kt — service interface with suspend operations on a JobTracker.
  • persistence/{JobPersistence,JobUniverse}.kt — bitemporal-backed persistence.
  • csvupload/api/rest/server/{CsvUploadRoutes.kt,Model.kt} — REST routes that initiate CSV-upload jobs.

The REST routes go through the Ktor StatusPages handler today; their errors are covered by the boundary capture once it lands. Whether long-running job execution proceeds inside the request coroutine (covered by the request boundary) or on a separate coroutine scope launched in the background (not covered) requires a code-level audit during implementation. The runBoundary("<job-label>") { … } helper from DT-006 is intended to wrap any work the audit finds outside the request boundary.

Frontend (arda-frontend-app) — what is wired today

Section titled “Frontend (arda-frontend-app) — what is wired today”

The frontend is the more mature consumer of Sentry. The arda-frontend-app package depends on @sentry/nextjs ^10.43.0 and wires three init paths (Next.js convention):

  • src/instrumentation-client.ts — browser context. Enables replayIntegration(). tracesSampleRate = 1.0 in dev/stage, 0.2 in prod. replaysSessionSampleRate = 1.0/0.1; replaysOnErrorSampleRate = 1.0 always. enableLogs: true. sendDefaultPii: false. Exports onRouterTransitionStart = Sentry.captureRouterTransitionStart for SPA route-change spans.
  • sentry.server.config.ts — Next.js server runtime. Same trace rate. enableLogs: true. No replay.
  • sentry.edge.config.ts — Next.js Edge runtime. Mirror of the server config.

The build wiring in next.config.ts calls withSentryConfig(...) with org: "arda-systems", project: "arda-frontend", tunnelRoute: "/monitoring", widenClientFileUpload: true. Source maps upload in CI. src/instrumentation.ts dispatches to the right runtime init and exports onRequestError = Sentry.captureRequestError.

The DSN is delivered via NEXT_PUBLIC_SENTRY_DSN env var, read identically in all three init paths.

tracePropagationTargets is not set explicitly in any of the three configs. The SDK default applies (localhost + same-origin paths in the browser; broader server default). This is the gap DT-007 addresses.

Queried via the Sentry MCP against the live arda-systems organisation, two projects: arda-frontend and platform-be.

  • platform-be: 1 Issue total. N+1 Query (auto-detected by Sentry’s performance detector), 912 events, culprit POST /v1/kanban/(authenticate auth)/kanban-card/details. No application exceptions are captured today because no SDK init exists; the OTel agent emits spans but not Issues from caught exceptions.
  • arda-frontend: 28 Issues. Mix of app-code errors (TypeError: filteringProperties is not iterable, ReferenceError: filteredItems is not defined), HTTP performance detections (N+1 API Call, 2 304 events on /items), and — most telling for end-to-end diagnosis — Issues like Error: Update item failed: 500 and Error: SSRM query failed: 500. These are backend 500s observed by the frontend; the corresponding backend exceptions are missing from Sentry.
Projecthttp span classSpan countUnique traces
platform-behttp.server38 8192 796
arda-frontendhttp.client11 222105

Slowest backend routes by p95 latency over the last seven days (a sample):

  • POST /v1/kanban/.../kanban-card/details — p95 ≈ 2.08 s.
  • GET /v1/kanban/.../kanban-card/for-item/{item-eid} — p95 ≈ 2.79 s.
  • POST /v1/kanban/.../kanban-card/details/{status} — p95 ≈ 0.19 s.

A targeted Sentry MCP query confirms that trace IDs propagate between projects. Three concrete examples from the last hour at the time of analysis:

Trace IDarda-frontend transactionsplatform-be transactions
a5ba309360fbdf0158558f3ab16b3513POST /api/arda/kanban/kanban-card/query-details-by-item (100); POST /api/arda/kanban/kanban-card (60); POST /api/arda/kanban/kanban-card/[eId]/event/fulfill (60)POST /v1/kanban/.../kanban-card/details (100); POST /v1/kanban/.../kanban-card (60); POST /v1/kanban/.../kanban-card/{card-eid}/event/receive (60)
aed801215d6849c981d3bd16543225fcGET /api/arda/kanban/kanban-card/query-by-item (96); POST /api/arda/kanban/kanban-card/query-details-by-item (35)GET /v1/kanban/.../kanban-card/for-item/{item-eid} (149); POST /v1/kanban/.../kanban-card/details (51)
10f97a7b4683415b8a697df52b700de7POST /api/arda/kanban/kanban-card/query-details-by-item (30)POST /v1/kanban/.../kanban-card/details (30); GET /v1/kanban/.../kanban-card/for-item/{item-eid} (30)

The cardinality mismatch in the 24-hour totals (FE: 105 unique traces vs. BE: 2 796) reflects two real properties, not broken propagation:

  • Per-page trace lifetime on the frontend. One Next.js trace spans many fetch calls inside one page session — ~105 user sessions × ~100 API calls each is plausible for a kanban-heavy UI.
  • Non-FE traffic on the backend. Some backend traces originate outside the BFF — mobile clients, scripted callers, health probes, internal jobs. Those traces never had a FE-side root.

Implication: the protocol layer of end-to-end correlation already works today. The remaining gaps are about what each side captures at the endpoints of those traces, not whether traces stitch together.

ConcernFrontend (arda-frontend)Backend (platform-be)End-to-end
Exception captureJS exceptions, async errors, request errorsOnly Sentry-auto-detected perf issues; no app exceptionsBackend stack traces missing for FE-observed 500s
Trace / spansPage transactions, fetch spans, route transitionsKtor/Netty/Exposed spans via OTel agentTrace IDs propagate FE↔BE
LogsenableLogs: true (sent to Sentry)stdout/CloudWatch only; no Sentry forwardingAsymmetric
Session replayBrowser replays (10 % prod, 100 % on error)n/aOK where applicable
Release health (sessions)Implicit via replay + Next.js sessionNo Sentry sessions emittedNo backend crash-free metric per release
Source maps / releasesUploaded in CI by withSentryConfigRelease tag set but no SDK to use it for source-mapped stack tracesFE/BE release schemes diverge
Sample rate (prod)0.20.1Misaligned — half of FE-sampled traces drop on BE

The end-to-end picture today breaks down at exactly the moment when it matters most: when a user reports something is broken. A representative scenario:

  1. User submits an item update; backend returns 500.
  2. Frontend captures Error: Update item failed: 500 with full client stack, replay, request URL. → Investigator opens the FE Issue in Sentry.
  3. FE Issue has a trace ID; the trace view shows the FE span calling the BFF, the BFF span calling the backend, and a backend span ending in status=error.
  4. The backend span is annotated by the OTel agent (route, duration, attributes) but has no stack trace because no SDK has captured the exception. The investigator now leaves Sentry and goes to CloudWatch logs to find the backend stack.

Closing step 4 is the highest-leverage change in the project. It is what DT-003 and DT-006 deliver.

documentation/src/content/docs/process/craft/operations-and-monitoring/sentry-integration.md (current location in the documentation site) describes an SDK-only approach with a SentryMonitor.init(...) call in operations/.../runtime/Main.kt. That contradicts the agent-based approach already implemented in build.gradle.kts and the Helm chart. The document is misleading at face value and is superseded by this project’s deliverables (a new architectural page and a rewritten how-to; see goal.md Deliverables #6 and #6b).

The implementation must respect these properties of the code as it stands today:

  1. The OTel agent must continue to attach. Removing the agent loses the auto-instrumentation that produces the spans Sentry already captures. DT-001 explicitly keeps both the agent and the in-process SDK.
  2. Component.build(...) is the only bootstrap path. Every component (operations, accounts-component, future services) goes through it; adding the SDK init here means every component inherits the behaviour with no per-service plumbing.
  3. StatusPages is the single Ktor boundary today. Modifying its installation in common-module/.../component/Component.kt:200-247 replaces today’s app.log.warn(...) with the reportable()-driven capture; no per-route changes are needed.
  4. AppError is sealed. Adding the reportable(): List<Throwable> method on the base with Invocation and Composite overrides is exhaustive by construction. Kotlin’s sealed-class semantics guarantee that any new subtype added later receives the default behaviour.
  5. Helm value paths are stable. Extending oam.performance.sentry.{...} with a sessions sub-object and a scrubSalt-bearing env var keeps the existing enabled, environment, tracesSampleRate keys intact.
  6. ESO + AWS Secrets Manager is the established secret-delivery path. The new SENTRY_SCRUB_SALT secret mirrors the existing SENTRY_DSN pattern, but per partition (purpose) rather than per Infrastructure: {Infrastructure}-{purpose}-SentryScrubSalt (e.g. Alpha001-prod-SentryScrubSalt) → ESO → be-sentry-scrub-salt K8s Secret in the component’s namespace → pod env var, optional: true for fail-soft startup. The CDK-side stack (PartitionSecrets) is created alongside the existing per-purpose stacks (partition-authn, partition-bulk-stores, …) in apps/Al1x/partition.ts.
  7. logback.xml root is at debug. Lowering the root to info is out of scope; the Sentry appender filters at ERROR regardless of root level, so the existing logging surface stays unchanged.
  8. common-module is consumed by composite build. Operations builds with --include-build ../common-module; any change in common-module is immediately visible to operations during local development. PR sequencing for cross-repo deployments still applies via the release-lifecycle skill.
  9. The arda-common-version in operations/gradle/libs.versions.toml is the version pin. A bump here is the operations-side adoption of the new common-module work.
  10. accounts-component consumes the same common-module. Any signature, default, or option in common-module’s SDK init must be compatible with accounts-component’s eventual adoption (PDEV-533). Defaults should be safe for both consumers.
  1. The frontend’s three Sentry init paths share env-variable conventions. NEXT_PUBLIC_DEPLOY_ENV drives the env name in all three configs. The backend-host discovery for tracePropagationTargets should reuse whatever the existing API client uses to know which backend to call — not introduce a parallel mechanism.
  2. The tunnelRoute: "/monitoring" in next.config.ts must remain in any allowlist applied to tracePropagationTargets, since Sentry’s outbound traffic in the browser is routed through it.

Against the goal, today’s state has these material gaps:

  • No backend exception capture. Closed by the SDK init + AppError.reportable() + boundary wrapper + Logback appender + JVM uncaught handler.
  • No backend session emission. Closed by the SDK session config.
  • Backend tracesSampleRate misaligned with frontend in prod (0.1 vs 0.2). Closed by the Helm rate bump.
  • Frontend tracePropagationTargets implicit. Closed by explicit env-aware configuration.
  • No PII-aware scrubbing. Closed by beforeSend in the SDK init.
  • Stale how-to documentation. Closed by the new architectural page + rewritten how-to.
  • Partition-level Sentry scrub salt not yet provisioned in AWS Secrets Manager. A new PartitionSecrets CDK stack in infrastructure (stacks/purpose/partition-secrets.ts, wired in apps/Al1x/partition.ts) creates {Infrastructure}-{purpose}-SentryScrubSalt per partition via Secret.generateSecretString, with an optional override from platforms.ts’s PartitionInfo.sentryScrubSaltOverride. The salt is per-purpose (shared across every Kotlin component within the same purpose); cross-purpose correlation deliberately broken. Not credential-grade — AWS Secrets Manager is used here only for ESO compatibility, not formal credential discipline. No 1Password mirroring required.

The constraints above are why the implementation is concentrated in common-module rather than operations. Every gap above touches common-module’s central wiring; operations only inherits and configures.