Skip to content

Operations Sentry — Decision Log

Records the decisions taken during the project’s exploration phase. Each entry captures what was decided and how it bears on this project’s implementation. Read this file to understand the design; read the corresponding workbook entry (workbooks/notebooks/operations-sentry/decisions/<dt-id>-<slug>.md) only when you need the why — the alternatives considered, the journey, and the full rationale.

Each decision keeps a stable identifier (DT-NNN) shared with the workbook so cross-references between the two are unambiguous.

DT-001 — Use both the Sentry SDK and the OpenTelemetry agent

Section titled “DT-001 — Use both the Sentry SDK and the OpenTelemetry agent”

Should Sentry observability rely on the bundled Sentry OpenTelemetry Java agent (already wired), an explicit in-process SDK initialisation, or both?

Use both. The OpenTelemetry agent stays as-wired today — it provides zero-code APM (Ktor / Netty / Exposed auto-instrumentation), release and environment tagging, and JVM-level attach via -javaagent:. An explicit in-process Sentry SDK initialisation is added on top, enabling exception capture, beforeSend filtering, log forwarding, and session-based release health that the agent alone does not deliver.

The agent loads first at JVM start and initialises the bundled SDK with a default Hub. The in-process Sentry.init { … } reconfigures the same Hub — both cooperate, neither replaces the other.

  • The Jib image continues to bundle sentry-opentelemetry-agent at /app/agents/sentry-otel-agent.jar; no build-side change to the agent path.
  • common-module gains dependencies on io.sentry:sentry and io.sentry:sentry-logback, version aligned with the agent’s version pin (8.41.0 today).
  • The SDK init must be idempotent and safe when the agent has already loaded the Hub; option re-application must not trigger a re-initialisation.
  • Both paths read the same env-var contract (SENTRY_DSN, SENTRY_ENVIRONMENT, SENTRY_RELEASE, SENTRY_TRACES_SAMPLE_RATE); no new secret plumbing is needed for these.

See workbooks/notebooks/operations-sentry/decisions/dt-001-sdk-and-agent.md for alternatives considered (agent-only, SDK-only) and the rationale for the both-on-one-Hub choice.

DT-002 — SDK initialisation lives in common-module

Section titled “DT-002 — SDK initialisation lives in common-module”

Where should the explicit Sentry SDK initialisation live — inside operations (and each future Kotlin component), or in the shared common-module library?

In common-module. A single SDK-init module under common-module/lib/src/main/kotlin/cards/arda/common/lib/runtime/observability/ is invoked as part of the standard component bootstrap (Component.build(...)). Every Kotlin/Ktor component consuming common-module (operations today, accounts-component on adoption, future services) inherits the Sentry behaviour transparently with no per-service code.

  • The new package contains SentryInit, BoundaryCapture, PiiScrubber, OpaqueId, Fingerprinting, LogbackBridge, CoroutineExceptionHandlerFactory, and supporting helpers. Final naming is at the implementer’s discretion; the package location is fixed.
  • Component.kt’s build(...) calls SentryInit.init() at the start; the Ktor StatusPages handler is replaced with the reportable()-driven wrapper.
  • Init is fail-soft on missing SENTRY_DSN: pod starts, no events emitted, no exceptions thrown. Same posture as the existing Helm optional: true secretKeyRef.
  • arda-common-version in each consuming component is bumped to the new release that ships the observability module. That is the only operations-side dependency change.
  • Tests run with an empty DSN; no separate Sentry test transport is required.

See workbooks/notebooks/operations-sentry/decisions/dt-002-sdk-init-in-common-module.md for the analysis of placement choices and how this aligns with the existing Ktor-plugin-installation pattern in Component.kt:200-247.

DT-003 — Error filtering policy via AppError.reportable()

Section titled “DT-003 — Error filtering policy via AppError.reportable()”

Which errors reach Sentry as Issues, which are dropped, and where is the classification expressed?

The classification is encapsulated on the AppError type itself, via a reportable(): List<Throwable> method on the sealed class. The Sentry capture site asks the throwable what to do and iterates the result; it never inspects HTTP status or other transport-layer concerns.

The rule:

  • AppError.Internal.* and AppError.Generic inherit the default listOf(this) → captured as a Sentry Issue.
  • AppError.Invocation.* overrides to emptyList() → dropped.
  • AppError.Composite overrides to causes.flatMap { it.reportable() } → recursion through the cause chain, one Sentry event per reportable leaf.
  • Non-AppError Throwable (via a bridging extension) returns listOf(this) → captured.
  • The method lives in common-module/.../lang/errors/AppError.kt alongside the sealed hierarchy.
  • The Ktor StatusPages handler runs throwable.reportable().forEach { Sentry.captureException(it) } before the existing HTTP response is emitted. The original app.log.warn(...) line stays (it is consumed by the Logback appender per DT-008).
  • Events from a Composite carry a Sentry tag wrapped_in_composite: <composite.message> so the underlying events remain correlatable in Sentry search.
  • Fingerprinting is applied at the capture site, not in beforeSend. Default fingerprint: AppError subtype FQCN plus a discriminator from the subtype’s own fields (serviceName for ExternalService / InternalService / InternalTimeout; operationName for NotImplemented). Non-AppError throwables fingerprint by concrete class + first non-framework stack frame.
  • Per-fingerprint sampling is controlled at the SDK level (sampleRate), not by Sentry-side ingest quotas.
  • The HTTP-status mapping in common-module/.../api/rest/types/HttpResponses.kt is parallel to this policy but explicitly decoupled in code: a future change to HTTP framing does not reroute Sentry behaviour.

See workbooks/notebooks/operations-sentry/decisions/dt-003-error-filtering.md for the analysis of options (HTTP-status-anchored vs AppError-semantic vs creation-site capture) and why encapsulation on the data type wins.

DT-004 — Session-based release health, sampling inherited from traces

Section titled “DT-004 — Session-based release health, sampling inherited from traces”

Should the operations component emit Sentry sessions for release-health metrics? If yes, at what granularity?

Yes — enable Sentry session tracking in non-local environments. The Sentry JVM SDK 8.41.0 has a single auto-session knob (isEnableAutoSessionTracking); it does not expose a SessionMode enum or a sessionSampleRate field — those concepts exist in Sentry’s browser / mobile SDKs but not on the JVM. The flag emits at most one session per JVM lifecycle (release-mode), not per request, and the bundled Sentry OTel Java agent does not contribute its own session emitter — Sentry publishes no sentry-ktor server plugin (only sentry-ktor-client for HTTP client instrumentation). The documented mechanism for per-request sessions on a non-Spring Java server is therefore manual: Sentry.startSession() / Sentry.endSession() calls at each request boundary.

The project ships a small Ktor application plugin (SentryRequestSession) installed by each consuming component (currently in operations/.../Main.kt) that wraps every request with start/end calls, guarded by Sentry.isEnabled() so it no-ops when the DSN is absent. This plugin should lift into common-module’s Component.build(...) so future consumers inherit it without per-app wiring (tracked separately under PDEV-490). Sessions are sampled in tandem with traces (a session is emitted iff its enclosing trace is sampled); there is no separate session sample rate to set.

Per-environment trace sampling (the bump that delivers the FE-aligned rate):

EnvironmenttracesSampleRateEffective session sampling
dev1.01.0
stage1.01.0
demo0.2 (from 0.1)0.2
prod0.2 (from 0.1)0.2
localoffoff

The demo/prod bump from 0.1 to 0.2 aligns with the frontend’s existing prod sample rate so frontend-initiated traces consistently survive on the backend side.

  • The Helm chart gains oam.performance.sentry.sessions.{ enabled } — a single-key sub-object, no mode or sampleRate field.
  • The deployment template emits both SENTRY_ENABLE_AUTO_SESSION_TRACKING (the SDK-canonical env var read by the agent at boot) and SENTRY_AUTO_SESSION_TRACKING (the legacy name common-module 8.3.0’s SentryInit reads) when sessions.enabled is true. The dual emission is a temporary workaround until common-module’s SentryInit is patched to read the canonical name (tracked under PDEV-538); after that, the legacy entry can be removed.
  • SentryInit reads SENTRY_AUTO_SESSION_TRACKING and sets options.isEnableAutoSessionTracking accordingly (defaults true). No sessionMode or sessionSampleRate calls — those properties do not exist on SentryOptions in this SDK version.
  • Each consuming component installs the SentryRequestSession Ktor application plugin in its Application configurer (or wherever the Ktor pipeline is set up) via install(SentryRequestSession) guarded by pluginOrNull(SentryRequestSession) == null. The guard makes the install idempotent so the future common-module move does not require a coordinated removal in each consumer.
  • The backend release tag stays {appName}@{Chart.AppVersion} (set by the existing deployment template). The frontend keeps the Next.js SDK’s release tag. The two schemes are independent and the divergence is intentional.
  • User-tagging on sessions is deferred to DT-005; sessions are untagged until that decision settles.
  • A small empirical verification (a single dev deploy + observation of the Release Health tab) confirms session emission post-implementation. Not a prerequisite.

The workbook’s DT-004 entry assumed Sentry’s session API had a mode toggle and a separate sampleRate. That framing was wrong on the JVM SDK — confirmed by reading sentry-java/sentry/src/main/java/io/sentry/SentryOptions.java at the 8.41.0 tag (no SessionMode enum exists in the JVM SDK source; only enableAutoSessionTracking plus timing knobs sessionTrackingIntervalMillis and sessionFlushTimeoutMillis). This decision-log entry carries the corrected design; the workbook stays as the historical record of the exploration (not edited per the project’s no-touch-workbook rule).

See workbooks/notebooks/operations-sentry/decisions/dt-004-session-tracking.md and the companion …/threads/dt-004-session-tracking/assessment.md for the end-to-end analysis that drove the sample-rate alignment and the framing of release health as a backend-side capability complementary to frontend session replay.

DT-005 — PII handling and payload scrubbing in beforeSend

Section titled “DT-005 — PII handling and payload scrubbing in beforeSend”

What data is permitted in Sentry payloads from the backend, and what scrubbing must run before any event leaves the pod?

Scrubbing runs in beforeSend / beforeSendTransaction callbacks registered in SentryInit in common-module. The policy:

  • User identification — opaque. user.id is HMAC-SHA-256(salt, JWT-subject-claim) truncated to 16 hex chars. user.email, user.username, user.ip_address are unset. sendDefaultPii = false.
  • HTTP request body — capture-with-redaction. Bodies remain captured; a regex pass masks JWT-shaped substrings, AWS access-key patterns, and Sentry-DSN-like URLs with ***.
  • HTTP headers — deny by default, narrow allow-list. X-Request-Id and X-Forwarded-For pass through unchanged. X-Tenant-Id is removed from the headers and replaced with a Sentry tag tenant_hash whose value is HMAC-SHA-256(salt, tenant-id) truncated. Every other header (including Authorization, Cookie, Set-Cookie) is stripped.
  • AppError.context lambdas are scrubbed by the same body-redaction regex set before they land as event extras.
  • DB statement attributes (db.statement) get literal redaction. A regex pass replaces single-quoted string literals with '?' and numeric literals adjacent to operators with ?. Exposed parameterises every statement today, so the redaction is effectively a safety net for any raw-SQL escape hatch.
  • Defensive backstop. A final regex pass over the serialised event JSON catches JWT, AWS access-key, and DSN-like patterns as a last resort.
  • Salt scope — per partition (purpose). The salt is {Infrastructure}-{purpose}-SentryScrubSalt in AWS Secrets Manager (e.g. Alpha001-prod-SentryScrubSalt). Same purpose → same salt across components (cross-component correlation works); different purposes deliberately uncorrelatable (privacy boundary). The salt is materialised by the new PartitionSecrets CDK stack in infrastructure (see DT-005 implications below); 1Password mirroring is not required — the salt is not credential-grade.
  • Compliance baseline — US-first, with GDPR-aligned defaults adopted where they impose no diagnostic cost (already the case for everything above).
  • A new file under common-module/.../runtime/observability/ (likely PiiScrubber.kt) implements the two beforeSend callbacks, an OpaqueId.opaqueId(plaintext) helper, and a HeadersAllowList filter. Sentry org-side data scrubbers stay enabled as defence in depth; the code-side policy is authoritative.
  • The Helm chart sources SENTRY_SCRUB_SALT from a K8s Secret be-sentry-scrub-salt (materialised by ESO from {Infrastructure}-{purpose}-SentryScrubSalt), with optional: true for fail-soft startup.
  • Empty salt at runtime falls back to a deterministic placeholder ("no-salt:<id>") — exception capture still works, but opaque-ID values are effectively plaintext. Acceptable for local / test contexts; not desired in deployed environments.
  • The salt and the per-partition CDK stack are owned by the infrastructure repository in this project; see DT entries below for the stack design and naming.

See workbooks/notebooks/operations-sentry/decisions/dt-005-pii-scrubbing.md for the threat-model analysis that drove the opaque-ID HMAC choice, the headers allow-list rationale, and the DB-statement redaction policy. Note the refinement banner at the top of that file: the ceremony around the salt (originally a 1Password-mirrored formal credential) was dialed down once the threat model clarified — the salt is partition-scoped and not credential-grade.

DT-006 — Boundary capture topology and JVM-level last-resort reporting

Section titled “DT-006 — Boundary capture topology and JVM-level last-resort reporting”

Where does the reportable()-driven Sentry capture call run, and what JVM-level safety net catches throwables that escape every boundary?

Capture happens at each module entry boundary, via a shared low-level helper in common-module plus thin per-transport wrappers:

  • Shared helper: Throwable.captureViaReportable(boundary, vararg tags) iterates reportable(), attaches the boundary tag and any extras (route, job label, etc.), applies the fingerprint, and calls Sentry.captureException per leaf.
  • Ktor (StatusPages) wrapper is the only transport-specific instrumentation today; it tags boundary: http and route: <path>. Installed by Component.kt automatically.
  • Batch / synchronous boundaries use runBoundary(label) { … } and runSuspendingBoundary(label) { … } (suspending variant for coroutine-launched work). Tag boundary: batch, job: <label>. operations-side audit during implementation determines which entry points in system/batch/ adopt them.
  • Coroutine escapes are caught by a global CoroutineExceptionHandler factory installed alongside the SDK init. Tag boundary: coroutine.
  • JVM-level last resort. Sentry’s UncaughtExceptionHandlerIntegration is enabled (default in 8.x; made explicit). It captures unfiltered — any throwable that escapes every boundary is reported regardless of reportable() classification, on the principle that an escape is itself bug-worthy. Events from this path carry via: uncaught-handler so they are distinguishable.
  • The Logback Sentry appender (covered by DT-008) is not treated as a boundary; it forwards on a different filter (log-level + exception presence) without reportable() filtering.
  • Ktor’s existing StatusPages exception<Throwable> handler in Component.kt:230 gains the capture call before the existing log.warn(...) and HTTP response emission. The log line is preserved (it powers the Logback appender path from DT-008).
  • Background-job audit in operations/src/main/kotlin/cards/arda/operations/system/batch/: every entry point that launches work outside the request coroutine wraps the work in runBoundary or runSuspendingBoundary. If the audit finds no out-of-request entry points, the helper ships from common-module and operations adopts no calls in this project — the helper is available for future use.
  • Pre-init errors (anything raised before Sentry.init returns) bypass capture. Acceptable today; documented in the how-to.

See workbooks/notebooks/operations-sentry/decisions/dt-006-boundary-capture-and-last-resort.md. Note the partial-supersession banner: the original “no Sentry Logback appender” stance was reversed by DT-008; everything else remains in force.

DT-007 — Frontend tracePropagationTargets — explicit and env-aware

Section titled “DT-007 — Frontend tracePropagationTargets — explicit and env-aware”

How does the frontend’s distributed-tracing propagation behaviour become explicit and env-aware, so that backend-host changes do not silently break trace continuity?

Configure tracePropagationTargets explicitly in all three frontend Sentry init paths (arda-frontend-app/src/instrumentation-client.ts, arda-frontend-app/sentry.server.config.ts, arda-frontend-app/sentry.edge.config.ts). The allow-list includes:

  • The same-origin patterns the SDK default already covers (localhost, /^\/(?!\/)/).
  • The /monitoring tunnel route used by withSentryConfig’s tunnelRoute.
  • The env-specific backend host, derived from NEXT_PUBLIC_DEPLOY_ENV and the existing API-client host-discovery mechanism (the helper reuses that mechanism rather than introducing a parallel one).
  • Single-PR change in arda-frontend-app. No backend dependency, no common-module change.
  • A small helper file (likely under src/lib/sentry/) computes the allow-list per environment; the three init paths import and use it.
  • Distributed tracing between arda-frontend and platform-be already works today through the BFF same-origin paths; this change locks the configuration down so future direct browser-to-backend calls (if any are added) continue to propagate cleanly.

See workbooks/notebooks/operations-sentry/decisions/dt-007-fe-trace-propagation-targets.md and the supporting end-to-end empirical analysis in …/threads/dt-004-session-tracking/assessment.md (which confirmed via Sentry MCP that trace propagation works today but is implicit and brittle).

DT-008 — Backend Sentry Logback appender for ERROR-level events and exception traces

Section titled “DT-008 — Backend Sentry Logback appender for ERROR-level events and exception traces”

Should the backend forward log events to Sentry via the sentry-logback appender, in addition to the boundary-driven exception capture from DT-003 / DT-006?

Yes — install the sentry-logback appender in each consuming component’s logback.xml, with common-module providing the dependency:

  • Every log event at level ERROR or above produces a Sentry event.
  • Every log event carrying an attached Throwable, at any level, produces a Sentry event with the throwable’s stack trace.
  • Lower-level log events (INFO and above) feed Sentry as breadcrumbs attached to events captured later in the same request.

The appender does not apply reportable() filtering — log-side capture is intentionally independent of the boundary-side classification. Duplication with boundary capture is accepted; Sentry’s default fingerprinting groups duplicate events into a single Issue.

Wiring mechanism — per-component XML. Each component adds the <appender name="SENTRY" class="io.sentry.logback.SentryAppender"> block plus an <appender-ref ref="SENTRY"/> on the root logger to its own logback.xml. The wiring is visible at the surface where logging is normally configured. The DSN-empty fail-soft behaviour is handled by SentryAppender itself — when Sentry.init runs with no DSN, the appender becomes a no-op (no Hub available, events silently dropped).

  • common-module adds a runtime dependency on io.sentry:sentry-logback. No programmatic appender attach is performed by SentryInit; the SDK init module is unchanged from a Logback standpoint.
  • Each consuming component must add the XML wiring to its logback.xml as part of adopting common-module’s Sentry observability. The how-to (process/craft/operations-and-monitoring/sentry-integration.md) carries the exact XML snippet plus a verification recipe.
  • operations/src/main/resources/logback.xml changes as part of this project’s operations wave — the <appender> + <appender-ref> block is added next to the existing STDOUT / PERF_STDOUT appenders.
  • accounts-component follows the same pattern when it adopts the integration (tracked by PDEV-533).
  • The existing app.log.warn("Error: …") line in StatusPages continues to fire — it produces a Sentry event via the Logback appender, in addition to the boundary-driven event from DT-003. Sentry’s fingerprinting should merge the two into one Issue; if it does not in practice, an MDC marker at the capture site can suppress the log-side event (refinement, not a blocker).
  • Verification adds a post-deploy check: confirm the appender is attached at runtime in the deployed pod (parse logback.xml or trigger a deliberate ERROR log and observe the Sentry event).
  • A failure mode to watch for: a component that adopts common-module’s observability but forgets the XML snippet has working boundary-side capture (DT-003 / DT-006) but no log-forwarding (DT-008). The verification step catches this.

See workbooks/notebooks/operations-sentry/decisions/dt-008-be-log-forwarding.md. This decision supersedes the “no Logback appender installed” line in DT-006’s resolution; all other clauses of DT-006 remain in force. The rationale for accepting the duplication cost and reversing the DT-006 stance is in the dt-008 decision file.

Cross-cutting design topic — PartitionSecrets CDK stack (provides the DT-005 salt)

Section titled “Cross-cutting design topic — PartitionSecrets CDK stack (provides the DT-005 salt)”

PartitionSecrets is the new CDK stack that materialises the Sentry scrub salt for DT-005. It is part of the same scope as the project’s other infrastructure changes; it does not have its own DT identifier because the design was settled during implementation planning rather than during the DT-001..DT-008 exploration phase. Recorded here so the decision-log is the complete project record.

A new CDK stack PartitionSecrets in infrastructure/src/main/cdk/stacks/purpose/partition-secrets.ts is created per partition (purpose) alongside the existing per-purpose stacks (partition-authn, partition-bulk-stores, purpose-storage, purpose-compute, purpose-dns, purpose-ingress, image-storage). It holds an AWS Secrets Manager Secret per partition named {Infrastructure}-{purpose}-SentryScrubSalt (e.g. Alpha001-prod-SentryScrubSalt), with a JSON payload {"salt": "..."} containing a 64-character random string generated by AWS Secrets Manager via generateSecretString (default) or an explicit override.

The override mechanism: a new optional field sentryScrubSaltOverride?: string on PartitionInfo / Partition in infrastructure/src/main/cdk/platforms.ts. ENVIRONMENTS entries default to omitting the field; any partition that needs a known value adds it inline.

  • The stack is wired in infrastructure/src/main/cdk/apps/Al1x/partition.ts’s buildPartition() (stack id ${partitionPrefix}-PartitionSecrets), alongside the other per-partition stack instantiations. No addDependency to other stacks.
  • The CFN export name for each partition’s salt ARN is {publishingPrefix}-API-SentryScrubSaltArn. The -API- infix follows the cross-repo export convention established by PDEV-531 (-API- for exports consumed outside the infrastructure repository; -I- reserved for cross-stack reads within CDK). ESO in EKS reads this export from outside infrastructure, so -API- is correct. stackTypes.publish() recognises -API- and emits two CloudFormation Outputs: a public raw-ARN form on -API-SentryScrubSaltArn (consumed verbatim by ESO) and a private marker-prefixed form on the bare-name (for CDK-to-CDK readImports()).
  • The privacy boundary is per-purpose: cross-component correlation works only within the same purpose. Cross-purpose and cross-Infrastructure correlation is deliberately broken.
  • SandboxKyle002 is deprecated and not configured.
  • infrastructure and common-module are independent PR tracks; the salt-name and JSON-structure spec is the only shared contract between them.

The per-partition scope was settled in conversation during implementation planning; the workbook’s DT-005 entry carries a refinement banner pointing to the published specs for the post-banner design. There is no separate workbook decision file for PartitionSecrets.

IDTopicRepositories affected
DT-001SDK + OpenTelemetry agent coexistcommon-module, operations
DT-002SDK init in common-modulecommon-module
DT-003Error filtering via AppError.reportable()common-module
DT-004Session-based release health, sample rates aligned with frontendcommon-module, operations
DT-005PII handling and payload scrubbing in beforeSendcommon-module, operations, infrastructure
DT-006Boundary capture topology and JVM-level last-resort handlercommon-module, operations
DT-007Frontend tracePropagationTargets — explicit and env-awarearda-frontend-app
DT-008Backend Sentry Logback appender for ERROR-level + exception tracescommon-module, operations
PartitionSecrets CDK stack (provides the DT-005 salt)infrastructure