Operations Sentry — Analysis
Purpose
Section titled “Purpose”Synthesise the current state of Sentry observability across the Kotlin/Ktor backend and the Next.js frontend, identify the gap to the goal stated in goal.md, and surface the constraints from the existing codebase that the implementation must respect.
The exploration phase ran in the workbooks repository under workbooks/notebooks/operations-sentry/; the current-state survey and the DT-004 assessment are the underlying source material. This document is the published, project-scoped synthesis.
Backend (operations) — what is wired today
Section titled “Backend (operations) — what is wired today”Build and image
Section titled “Build and image”The Sentry OpenTelemetry Java agent is bundled into the Jib image at build time. Specifically:
operations/gradle/libs.versions.tomlpinssentry-otel-agent-version = "8.41.0"and exposes the dependency aslibs.sentry.otel.agent = io.sentry:sentry-opentelemetry-agent.operations/build.gradle.ktsdefines a dedicatedsentryAgentGradle configuration. ThecopySentryAgenttask downloads the agent jar from Maven Central and renames it tosentry-otel-agent.jar. The Jib plugin bundles it at/app/agents/sentry-otel-agent.jarinside the container image.
No application-level Sentry SDK is wired — io.sentry:sentry and io.sentry:sentry-logback are not declared as dependencies, and there is no Sentry.init { … } call anywhere in operations/src/main/kotlin/ or common-module/lib/src/main/kotlin/.
Helm chart
Section titled “Helm chart”The Helm chart in operations/src/main/helm/ already exposes a Sentry-aware configuration surface:
values.yamldefinesoam.performance.sentry.{enabled,environment,tracesSampleRate}. Defaultenabled: false.templates/deployment.yamlappends-javaagent:/app/agents/sentry-otel-agent.jartoJAVA_TOOL_OPTIONSwhensentry.enabledistrue, and emits four env vars:SENTRY_DSN(frombe-sentry-dsnK8s Secret,optional: true→ fail-soft posture),SENTRY_ENVIRONMENT,SENTRY_RELEASEset to{appName}@{Chart.AppVersion}, andSENTRY_TRACES_SAMPLE_RATE.templates/secrets.yamldeclares theExternalSecretthat materialisesbe-sentry-dsnfrom{Infrastructure}-SentryDsnin AWS Secrets Manager (e.g.Alpha001-SentryDsn).- Per-environment values: dev and stage run at
tracesSampleRate: 1.0(full sampling to exercise the pipeline end-to-end); demo and prod at0.1(bound quota). Local is off.
Application code
Section titled “Application code”The Ktor application is bootstrapped from operations/src/main/kotlin/cards/arda/operations/runtime/Main.kt, which calls into common-module’s Component.build(...). All cross-cutting Ktor plugin installs live in common-module/lib/src/main/kotlin/cards/arda/common/lib/component/Component.kt:200-247. The relevant installs are:
ServerCallId— read or generateX-Request-Id, propagate to MDC and the response.CallLoggingatINFOlevel withcallIdMdc(CallFieldNames.callId).StatusPageswith a singleexception<Throwable>handler that unwraps the cause when the wrapper message equals the cause’s, maps to a response viatoErrorResponse(), logsapp.log.warn("Error: …", path, toProcess), and responds with the mapped HTTP status.ContentNegotiation,ServerMDCPropagationPlugin,ServerPerfMonitoringPlugin,Authentication.
The StatusPages handler is the single canonical point where every unhandled exception reaches today. Logback in operations/src/main/resources/logback.xml has only two console appenders (STDOUT, PERF_STDOUT); the root is at debug routed to PERF_STDOUT. There is no Sentry Logback appender.
Error type hierarchy
Section titled “Error type hierarchy”The application’s errors flow through a sealed AppError hierarchy defined in common-module/lib/src/main/kotlin/cards/arda/common/lib/lang/errors/AppError.kt:
AppError.Internal.*—Implementation,Infrastructure,ExternalService,InternalService,InternalTimeout,IncompatibleState,NotImplemented. All bug-worthy or operational signals.AppError.Invocation.*—GeneralValidation,ArgumentValidation,ContextValidation,NullArgument,NotFound,Duplicate, plusAuthorization.{NotAuthenticated,NotAuthorized}. All caller-driven.AppError.Composite— wraps multiple underlyingAppErrors.AppError.Generic— catch-all when no other type fits.
The HTTP-status mapping in common-module/lib/src/main/kotlin/cards/arda/common/lib/api/rest/types/HttpResponses.kt (Responses.appErrorResponse) translates each AppError to an HTTP response. As of today every Internal.* maps to a 5XX status and every Invocation.* maps to a 4XX status. The Sentry filtering policy in this project is intentionally decoupled from that mapping — see specification.md.
Batch infrastructure
Section titled “Batch infrastructure”operations/src/main/kotlin/cards/arda/operations/system/batch/ contains existing batch infrastructure:
business/Job.kt— domain model for a tracked job.service/JobService.kt— service interface withsuspendoperations on aJobTracker.persistence/{JobPersistence,JobUniverse}.kt— bitemporal-backed persistence.csvupload/api/rest/server/{CsvUploadRoutes.kt,Model.kt}— REST routes that initiate CSV-upload jobs.
The REST routes go through the Ktor StatusPages handler today; their errors are covered by the boundary capture once it lands. Whether long-running job execution proceeds inside the request coroutine (covered by the request boundary) or on a separate coroutine scope launched in the background (not covered) requires a code-level audit during implementation. The runBoundary("<job-label>") { … } helper from DT-006 is intended to wrap any work the audit finds outside the request boundary.
Frontend (arda-frontend-app) — what is wired today
Section titled “Frontend (arda-frontend-app) — what is wired today”The frontend is the more mature consumer of Sentry. The arda-frontend-app package depends on @sentry/nextjs ^10.43.0 and wires three init paths (Next.js convention):
src/instrumentation-client.ts— browser context. EnablesreplayIntegration().tracesSampleRate=1.0in dev/stage,0.2in prod.replaysSessionSampleRate=1.0/0.1;replaysOnErrorSampleRate=1.0always.enableLogs: true.sendDefaultPii: false. ExportsonRouterTransitionStart = Sentry.captureRouterTransitionStartfor SPA route-change spans.sentry.server.config.ts— Next.js server runtime. Same trace rate.enableLogs: true. No replay.sentry.edge.config.ts— Next.js Edge runtime. Mirror of the server config.
The build wiring in next.config.ts calls withSentryConfig(...) with org: "arda-systems", project: "arda-frontend", tunnelRoute: "/monitoring", widenClientFileUpload: true. Source maps upload in CI. src/instrumentation.ts dispatches to the right runtime init and exports onRequestError = Sentry.captureRequestError.
The DSN is delivered via NEXT_PUBLIC_SENTRY_DSN env var, read identically in all three init paths.
tracePropagationTargets is not set explicitly in any of the three configs. The SDK default applies (localhost + same-origin paths in the browser; broader server default). This is the gap DT-007 addresses.
What Sentry actually captures today
Section titled “What Sentry actually captures today”Queried via the Sentry MCP against the live arda-systems organisation, two projects: arda-frontend and platform-be.
Issues — last 30 days
Section titled “Issues — last 30 days”platform-be: 1 Issue total.N+1 Query(auto-detected by Sentry’s performance detector), 912 events, culpritPOST /v1/kanban/(authenticate auth)/kanban-card/details. No application exceptions are captured today because no SDK init exists; the OTel agent emits spans but not Issues from caught exceptions.arda-frontend: 28 Issues. Mix of app-code errors (TypeError: filteringProperties is not iterable,ReferenceError: filteredItems is not defined), HTTP performance detections (N+1 API Call, 2 304 events on/items), and — most telling for end-to-end diagnosis — Issues likeError: Update item failed: 500andError: SSRM query failed: 500. These are backend 500s observed by the frontend; the corresponding backend exceptions are missing from Sentry.
Spans — last 24 hours
Section titled “Spans — last 24 hours”| Project | http span class | Span count | Unique traces |
|---|---|---|---|
platform-be | http.server | 38 819 | 2 796 |
arda-frontend | http.client | 11 222 | 105 |
Slowest backend routes by p95 latency over the last seven days (a sample):
POST /v1/kanban/.../kanban-card/details— p95 ≈ 2.08 s.GET /v1/kanban/.../kanban-card/for-item/{item-eid}— p95 ≈ 2.79 s.POST /v1/kanban/.../kanban-card/details/{status}— p95 ≈ 0.19 s.
Distributed tracing — already working
Section titled “Distributed tracing — already working”A targeted Sentry MCP query confirms that trace IDs propagate between projects. Three concrete examples from the last hour at the time of analysis:
| Trace ID | arda-frontend transactions | platform-be transactions |
|---|---|---|
a5ba309360fbdf0158558f3ab16b3513 | POST /api/arda/kanban/kanban-card/query-details-by-item (100); POST /api/arda/kanban/kanban-card (60); POST /api/arda/kanban/kanban-card/[eId]/event/fulfill (60) | POST /v1/kanban/.../kanban-card/details (100); POST /v1/kanban/.../kanban-card (60); POST /v1/kanban/.../kanban-card/{card-eid}/event/receive (60) |
aed801215d6849c981d3bd16543225fc | GET /api/arda/kanban/kanban-card/query-by-item (96); POST /api/arda/kanban/kanban-card/query-details-by-item (35) | GET /v1/kanban/.../kanban-card/for-item/{item-eid} (149); POST /v1/kanban/.../kanban-card/details (51) |
10f97a7b4683415b8a697df52b700de7 | POST /api/arda/kanban/kanban-card/query-details-by-item (30) | POST /v1/kanban/.../kanban-card/details (30); GET /v1/kanban/.../kanban-card/for-item/{item-eid} (30) |
The cardinality mismatch in the 24-hour totals (FE: 105 unique traces vs. BE: 2 796) reflects two real properties, not broken propagation:
- Per-page trace lifetime on the frontend. One Next.js trace spans many
fetchcalls inside one page session — ~105 user sessions × ~100 API calls each is plausible for a kanban-heavy UI. - Non-FE traffic on the backend. Some backend traces originate outside the BFF — mobile clients, scripted callers, health probes, internal jobs. Those traces never had a FE-side root.
Implication: the protocol layer of end-to-end correlation already works today. The remaining gaps are about what each side captures at the endpoints of those traces, not whether traces stitch together.
End-to-end picture today
Section titled “End-to-end picture today”| Concern | Frontend (arda-frontend) | Backend (platform-be) | End-to-end |
|---|---|---|---|
| Exception capture | JS exceptions, async errors, request errors | Only Sentry-auto-detected perf issues; no app exceptions | Backend stack traces missing for FE-observed 500s |
| Trace / spans | Page transactions, fetch spans, route transitions | Ktor/Netty/Exposed spans via OTel agent | Trace IDs propagate FE↔BE |
| Logs | enableLogs: true (sent to Sentry) | stdout/CloudWatch only; no Sentry forwarding | Asymmetric |
| Session replay | Browser replays (10 % prod, 100 % on error) | n/a | OK where applicable |
| Release health (sessions) | Implicit via replay + Next.js session | No Sentry sessions emitted | No backend crash-free metric per release |
| Source maps / releases | Uploaded in CI by withSentryConfig | Release tag set but no SDK to use it for source-mapped stack traces | FE/BE release schemes diverge |
| Sample rate (prod) | 0.2 | 0.1 | Misaligned — half of FE-sampled traces drop on BE |
The diagnosis gap
Section titled “The diagnosis gap”The end-to-end picture today breaks down at exactly the moment when it matters most: when a user reports something is broken. A representative scenario:
- User submits an item update; backend returns 500.
- Frontend captures
Error: Update item failed: 500with full client stack, replay, request URL. → Investigator opens the FE Issue in Sentry. - FE Issue has a trace ID; the trace view shows the FE span calling the BFF, the BFF span calling the backend, and a backend span ending in
status=error. - The backend span is annotated by the OTel agent (route, duration, attributes) but has no stack trace because no SDK has captured the exception. The investigator now leaves Sentry and goes to CloudWatch logs to find the backend stack.
Closing step 4 is the highest-leverage change in the project. It is what DT-003 and DT-006 deliver.
Stale guidance
Section titled “Stale guidance”documentation/src/content/docs/process/craft/operations-and-monitoring/sentry-integration.md (current location in the documentation site) describes an SDK-only approach with a SentryMonitor.init(...) call in operations/.../runtime/Main.kt. That contradicts the agent-based approach already implemented in build.gradle.kts and the Helm chart. The document is misleading at face value and is superseded by this project’s deliverables (a new architectural page and a rewritten how-to; see goal.md Deliverables #6 and #6b).
Constraints from the existing codebase
Section titled “Constraints from the existing codebase”The implementation must respect these properties of the code as it stands today:
- The OTel agent must continue to attach. Removing the agent loses the auto-instrumentation that produces the spans Sentry already captures. DT-001 explicitly keeps both the agent and the in-process SDK.
Component.build(...)is the only bootstrap path. Every component (operations,accounts-component, future services) goes through it; adding the SDK init here means every component inherits the behaviour with no per-service plumbing.StatusPagesis the single Ktor boundary today. Modifying its installation incommon-module/.../component/Component.kt:200-247replaces today’sapp.log.warn(...)with thereportable()-driven capture; no per-route changes are needed.AppErroris sealed. Adding thereportable(): List<Throwable>method on the base withInvocationandCompositeoverrides is exhaustive by construction. Kotlin’s sealed-class semantics guarantee that any new subtype added later receives the default behaviour.- Helm value paths are stable. Extending
oam.performance.sentry.{...}with asessionssub-object and ascrubSalt-bearing env var keeps the existingenabled,environment,tracesSampleRatekeys intact. - ESO + AWS Secrets Manager is the established secret-delivery path. The new
SENTRY_SCRUB_SALTsecret mirrors the existingSENTRY_DSNpattern, but per partition (purpose) rather than per Infrastructure:{Infrastructure}-{purpose}-SentryScrubSalt(e.g.Alpha001-prod-SentryScrubSalt) → ESO →be-sentry-scrub-saltK8s Secret in the component’s namespace → pod env var,optional: truefor fail-soft startup. The CDK-side stack (PartitionSecrets) is created alongside the existing per-purpose stacks (partition-authn,partition-bulk-stores, …) inapps/Al1x/partition.ts. logback.xmlroot is atdebug. Lowering the root toinfois out of scope; the Sentry appender filters atERRORregardless of root level, so the existing logging surface stays unchanged.common-moduleis consumed by composite build. Operations builds with--include-build ../common-module; any change incommon-moduleis immediately visible tooperationsduring local development. PR sequencing for cross-repo deployments still applies via therelease-lifecycleskill.- The
arda-common-versioninoperations/gradle/libs.versions.tomlis the version pin. A bump here is the operations-side adoption of the new common-module work. accounts-componentconsumes the samecommon-module. Any signature, default, or option incommon-module’s SDK init must be compatible withaccounts-component’s eventual adoption (PDEV-533). Defaults should be safe for both consumers.
Frontend constraints
Section titled “Frontend constraints”- The frontend’s three Sentry init paths share env-variable conventions.
NEXT_PUBLIC_DEPLOY_ENVdrives the env name in all three configs. The backend-host discovery fortracePropagationTargetsshould reuse whatever the existing API client uses to know which backend to call — not introduce a parallel mechanism. - The
tunnelRoute: "/monitoring"innext.config.tsmust remain in any allowlist applied totracePropagationTargets, since Sentry’s outbound traffic in the browser is routed through it.
Summary of the gap
Section titled “Summary of the gap”Against the goal, today’s state has these material gaps:
- No backend exception capture. Closed by the SDK init +
AppError.reportable()+ boundary wrapper + Logback appender + JVM uncaught handler. - No backend session emission. Closed by the SDK session config.
- Backend
tracesSampleRatemisaligned with frontend in prod (0.1vs0.2). Closed by the Helm rate bump. - Frontend
tracePropagationTargetsimplicit. Closed by explicit env-aware configuration. - No PII-aware scrubbing. Closed by
beforeSendin the SDK init. - Stale how-to documentation. Closed by the new architectural page + rewritten how-to.
- Partition-level Sentry scrub salt not yet provisioned in AWS Secrets Manager. A new
PartitionSecretsCDK stack in infrastructure (stacks/purpose/partition-secrets.ts, wired inapps/Al1x/partition.ts) creates{Infrastructure}-{purpose}-SentryScrubSaltper partition viaSecret.generateSecretString, with an optional override fromplatforms.ts’sPartitionInfo.sentryScrubSaltOverride. The salt is per-purpose (shared across every Kotlin component within the same purpose); cross-purpose correlation deliberately broken. Not credential-grade — AWS Secrets Manager is used here only for ESO compatibility, not formal credential discipline. No 1Password mirroring required.
The constraints above are why the implementation is concentrated in common-module rather than operations. Every gap above touches common-module’s central wiring; operations only inherits and configures.
Copyright: © Arda Systems 2025-2026, All rights reserved