Operations Sentry — Decision Log
Purpose
Section titled “Purpose”Records the decisions taken during the project’s exploration phase. Each entry captures what was decided and how it bears on this project’s implementation. Read this file to understand the design; read the corresponding workbook entry (workbooks/notebooks/operations-sentry/decisions/<dt-id>-<slug>.md) only when you need the why — the alternatives considered, the journey, and the full rationale.
Each decision keeps a stable identifier (DT-NNN) shared with the workbook so cross-references between the two are unambiguous.
DT-001 — Use both the Sentry SDK and the OpenTelemetry agent
Section titled “DT-001 — Use both the Sentry SDK and the OpenTelemetry agent”Should Sentry observability rely on the bundled Sentry OpenTelemetry Java agent (already wired), an explicit in-process SDK initialisation, or both?
Decision
Section titled “Decision”Use both. The OpenTelemetry agent stays as-wired today — it provides zero-code APM (Ktor / Netty / Exposed auto-instrumentation), release and environment tagging, and JVM-level attach via -javaagent:. An explicit in-process Sentry SDK initialisation is added on top, enabling exception capture, beforeSend filtering, log forwarding, and session-based release health that the agent alone does not deliver.
The agent loads first at JVM start and initialises the bundled SDK with a default Hub. The in-process Sentry.init { … } reconfigures the same Hub — both cooperate, neither replaces the other.
Implications
Section titled “Implications”- The Jib image continues to bundle
sentry-opentelemetry-agentat/app/agents/sentry-otel-agent.jar; no build-side change to the agent path. common-modulegains dependencies onio.sentry:sentryandio.sentry:sentry-logback, version aligned with the agent’s version pin (8.41.0today).- The SDK init must be idempotent and safe when the agent has already loaded the Hub; option re-application must not trigger a re-initialisation.
- Both paths read the same env-var contract (
SENTRY_DSN,SENTRY_ENVIRONMENT,SENTRY_RELEASE,SENTRY_TRACES_SAMPLE_RATE); no new secret plumbing is needed for these.
See workbooks/notebooks/operations-sentry/decisions/dt-001-sdk-and-agent.md for alternatives considered (agent-only, SDK-only) and the rationale for the both-on-one-Hub choice.
DT-002 — SDK initialisation lives in common-module
Section titled “DT-002 — SDK initialisation lives in common-module”Where should the explicit Sentry SDK initialisation live — inside operations (and each future Kotlin component), or in the shared common-module library?
Decision
Section titled “Decision”In common-module. A single SDK-init module under common-module/lib/src/main/kotlin/cards/arda/common/lib/runtime/observability/ is invoked as part of the standard component bootstrap (Component.build(...)). Every Kotlin/Ktor component consuming common-module (operations today, accounts-component on adoption, future services) inherits the Sentry behaviour transparently with no per-service code.
Implications
Section titled “Implications”- The new package contains
SentryInit,BoundaryCapture,PiiScrubber,OpaqueId,Fingerprinting,LogbackBridge,CoroutineExceptionHandlerFactory, and supporting helpers. Final naming is at the implementer’s discretion; the package location is fixed. Component.kt’sbuild(...)callsSentryInit.init()at the start; the KtorStatusPageshandler is replaced with thereportable()-driven wrapper.- Init is fail-soft on missing
SENTRY_DSN: pod starts, no events emitted, no exceptions thrown. Same posture as the existing Helmoptional: truesecretKeyRef. arda-common-versionin each consuming component is bumped to the new release that ships the observability module. That is the only operations-side dependency change.- Tests run with an empty DSN; no separate Sentry test transport is required.
See workbooks/notebooks/operations-sentry/decisions/dt-002-sdk-init-in-common-module.md for the analysis of placement choices and how this aligns with the existing Ktor-plugin-installation pattern in Component.kt:200-247.
DT-003 — Error filtering policy via AppError.reportable()
Section titled “DT-003 — Error filtering policy via AppError.reportable()”Which errors reach Sentry as Issues, which are dropped, and where is the classification expressed?
Decision
Section titled “Decision”The classification is encapsulated on the AppError type itself, via a reportable(): List<Throwable> method on the sealed class. The Sentry capture site asks the throwable what to do and iterates the result; it never inspects HTTP status or other transport-layer concerns.
The rule:
AppError.Internal.*andAppError.Genericinherit the defaultlistOf(this)→ captured as a Sentry Issue.AppError.Invocation.*overrides toemptyList()→ dropped.AppError.Compositeoverrides tocauses.flatMap { it.reportable() }→ recursion through the cause chain, one Sentry event per reportable leaf.- Non-
AppErrorThrowable(via a bridging extension) returnslistOf(this)→ captured.
Implications
Section titled “Implications”- The method lives in
common-module/.../lang/errors/AppError.ktalongside the sealed hierarchy. - The Ktor
StatusPageshandler runsthrowable.reportable().forEach { Sentry.captureException(it) }before the existing HTTP response is emitted. The originalapp.log.warn(...)line stays (it is consumed by the Logback appender per DT-008). - Events from a
Compositecarry a Sentry tagwrapped_in_composite: <composite.message>so the underlying events remain correlatable in Sentry search. - Fingerprinting is applied at the capture site, not in
beforeSend. Default fingerprint:AppErrorsubtype FQCN plus a discriminator from the subtype’s own fields (serviceNameforExternalService/InternalService/InternalTimeout;operationNameforNotImplemented). Non-AppErrorthrowables fingerprint by concrete class + first non-framework stack frame. - Per-fingerprint sampling is controlled at the SDK level (
sampleRate), not by Sentry-side ingest quotas. - The HTTP-status mapping in
common-module/.../api/rest/types/HttpResponses.ktis parallel to this policy but explicitly decoupled in code: a future change to HTTP framing does not reroute Sentry behaviour.
See workbooks/notebooks/operations-sentry/decisions/dt-003-error-filtering.md for the analysis of options (HTTP-status-anchored vs AppError-semantic vs creation-site capture) and why encapsulation on the data type wins.
DT-004 — Session-based release health, sampling inherited from traces
Section titled “DT-004 — Session-based release health, sampling inherited from traces”Should the operations component emit Sentry sessions for release-health metrics? If yes, at what granularity?
Decision
Section titled “Decision”Yes — enable Sentry session tracking in non-local environments. The Sentry JVM SDK 8.41.0 has a single auto-session knob (isEnableAutoSessionTracking); it does not expose a SessionMode enum or a sessionSampleRate field — those concepts exist in Sentry’s browser / mobile SDKs but not on the JVM. The flag emits at most one session per JVM lifecycle (release-mode), not per request, and the bundled Sentry OTel Java agent does not contribute its own session emitter — Sentry publishes no sentry-ktor server plugin (only sentry-ktor-client for HTTP client instrumentation). The documented mechanism for per-request sessions on a non-Spring Java server is therefore manual: Sentry.startSession() / Sentry.endSession() calls at each request boundary.
The project ships a small Ktor application plugin (SentryRequestSession) installed by each consuming component (currently in operations/.../Main.kt) that wraps every request with start/end calls, guarded by Sentry.isEnabled() so it no-ops when the DSN is absent. This plugin should lift into common-module’s Component.build(...) so future consumers inherit it without per-app wiring (tracked separately under PDEV-490). Sessions are sampled in tandem with traces (a session is emitted iff its enclosing trace is sampled); there is no separate session sample rate to set.
Per-environment trace sampling (the bump that delivers the FE-aligned rate):
| Environment | tracesSampleRate | Effective session sampling |
|---|---|---|
| dev | 1.0 | 1.0 |
| stage | 1.0 | 1.0 |
| demo | 0.2 (from 0.1) | 0.2 |
| prod | 0.2 (from 0.1) | 0.2 |
| local | off | off |
The demo/prod bump from 0.1 to 0.2 aligns with the frontend’s existing prod sample rate so frontend-initiated traces consistently survive on the backend side.
Implications
Section titled “Implications”- The Helm chart gains
oam.performance.sentry.sessions.{ enabled }— a single-key sub-object, nomodeorsampleRatefield. - The deployment template emits both
SENTRY_ENABLE_AUTO_SESSION_TRACKING(the SDK-canonical env var read by the agent at boot) andSENTRY_AUTO_SESSION_TRACKING(the legacy namecommon-module8.3.0’sSentryInitreads) whensessions.enabledis true. The dual emission is a temporary workaround untilcommon-module’s SentryInit is patched to read the canonical name (tracked under PDEV-538); after that, the legacy entry can be removed. SentryInitreadsSENTRY_AUTO_SESSION_TRACKINGand setsoptions.isEnableAutoSessionTrackingaccordingly (defaults true). NosessionModeorsessionSampleRatecalls — those properties do not exist onSentryOptionsin this SDK version.- Each consuming component installs the
SentryRequestSessionKtor application plugin in itsApplicationconfigurer (or wherever the Ktor pipeline is set up) viainstall(SentryRequestSession)guarded bypluginOrNull(SentryRequestSession) == null. The guard makes the install idempotent so the future common-module move does not require a coordinated removal in each consumer. - The backend release tag stays
{appName}@{Chart.AppVersion}(set by the existing deployment template). The frontend keeps the Next.js SDK’s release tag. The two schemes are independent and the divergence is intentional. - User-tagging on sessions is deferred to DT-005; sessions are untagged until that decision settles.
- A small empirical verification (a single dev deploy + observation of the Release Health tab) confirms session emission post-implementation. Not a prerequisite.
Note on the workbook
Section titled “Note on the workbook”The workbook’s DT-004 entry assumed Sentry’s session API had a mode toggle and a separate sampleRate. That framing was wrong on the JVM SDK — confirmed by reading sentry-java/sentry/src/main/java/io/sentry/SentryOptions.java at the 8.41.0 tag (no SessionMode enum exists in the JVM SDK source; only enableAutoSessionTracking plus timing knobs sessionTrackingIntervalMillis and sessionFlushTimeoutMillis). This decision-log entry carries the corrected design; the workbook stays as the historical record of the exploration (not edited per the project’s no-touch-workbook rule).
See workbooks/notebooks/operations-sentry/decisions/dt-004-session-tracking.md and the companion …/threads/dt-004-session-tracking/assessment.md for the end-to-end analysis that drove the sample-rate alignment and the framing of release health as a backend-side capability complementary to frontend session replay.
DT-005 — PII handling and payload scrubbing in beforeSend
Section titled “DT-005 — PII handling and payload scrubbing in beforeSend”What data is permitted in Sentry payloads from the backend, and what scrubbing must run before any event leaves the pod?
Decision
Section titled “Decision”Scrubbing runs in beforeSend / beforeSendTransaction callbacks registered in SentryInit in common-module. The policy:
- User identification — opaque.
user.idisHMAC-SHA-256(salt, JWT-subject-claim)truncated to 16 hex chars.user.email,user.username,user.ip_addressare unset.sendDefaultPii = false. - HTTP request body — capture-with-redaction. Bodies remain captured; a regex pass masks JWT-shaped substrings, AWS access-key patterns, and Sentry-DSN-like URLs with
***. - HTTP headers — deny by default, narrow allow-list.
X-Request-IdandX-Forwarded-Forpass through unchanged.X-Tenant-Idis removed from the headers and replaced with a Sentry tagtenant_hashwhose value isHMAC-SHA-256(salt, tenant-id)truncated. Every other header (includingAuthorization,Cookie,Set-Cookie) is stripped. AppError.contextlambdas are scrubbed by the same body-redaction regex set before they land as event extras.- DB statement attributes (
db.statement) get literal redaction. A regex pass replaces single-quoted string literals with'?'and numeric literals adjacent to operators with?. Exposed parameterises every statement today, so the redaction is effectively a safety net for any raw-SQL escape hatch. - Defensive backstop. A final regex pass over the serialised event JSON catches JWT, AWS access-key, and DSN-like patterns as a last resort.
- Salt scope — per partition (purpose). The salt is
{Infrastructure}-{purpose}-SentryScrubSaltin AWS Secrets Manager (e.g.Alpha001-prod-SentryScrubSalt). Same purpose → same salt across components (cross-component correlation works); different purposes deliberately uncorrelatable (privacy boundary). The salt is materialised by the newPartitionSecretsCDK stack ininfrastructure(see DT-005 implications below); 1Password mirroring is not required — the salt is not credential-grade. - Compliance baseline — US-first, with GDPR-aligned defaults adopted where they impose no diagnostic cost (already the case for everything above).
Implications
Section titled “Implications”- A new file under
common-module/.../runtime/observability/(likelyPiiScrubber.kt) implements the twobeforeSendcallbacks, anOpaqueId.opaqueId(plaintext)helper, and aHeadersAllowListfilter. Sentry org-side data scrubbers stay enabled as defence in depth; the code-side policy is authoritative. - The Helm chart sources
SENTRY_SCRUB_SALTfrom a K8s Secretbe-sentry-scrub-salt(materialised by ESO from{Infrastructure}-{purpose}-SentryScrubSalt), withoptional: truefor fail-soft startup. - Empty salt at runtime falls back to a deterministic placeholder (
"no-salt:<id>") — exception capture still works, but opaque-ID values are effectively plaintext. Acceptable forlocal/ test contexts; not desired in deployed environments. - The salt and the per-partition CDK stack are owned by the
infrastructurerepository in this project; see DT entries below for the stack design and naming.
See workbooks/notebooks/operations-sentry/decisions/dt-005-pii-scrubbing.md for the threat-model analysis that drove the opaque-ID HMAC choice, the headers allow-list rationale, and the DB-statement redaction policy. Note the refinement banner at the top of that file: the ceremony around the salt (originally a 1Password-mirrored formal credential) was dialed down once the threat model clarified — the salt is partition-scoped and not credential-grade.
DT-006 — Boundary capture topology and JVM-level last-resort reporting
Section titled “DT-006 — Boundary capture topology and JVM-level last-resort reporting”Where does the reportable()-driven Sentry capture call run, and what JVM-level safety net catches throwables that escape every boundary?
Decision
Section titled “Decision”Capture happens at each module entry boundary, via a shared low-level helper in common-module plus thin per-transport wrappers:
- Shared helper:
Throwable.captureViaReportable(boundary, vararg tags)iteratesreportable(), attaches the boundary tag and any extras (route, job label, etc.), applies the fingerprint, and callsSentry.captureExceptionper leaf. - Ktor (
StatusPages) wrapper is the only transport-specific instrumentation today; it tagsboundary: httpandroute: <path>. Installed byComponent.ktautomatically. - Batch / synchronous boundaries use
runBoundary(label) { … }andrunSuspendingBoundary(label) { … }(suspending variant for coroutine-launched work). Tagboundary: batch,job: <label>.operations-side audit during implementation determines which entry points insystem/batch/adopt them. - Coroutine escapes are caught by a global
CoroutineExceptionHandlerfactory installed alongside the SDK init. Tagboundary: coroutine. - JVM-level last resort. Sentry’s
UncaughtExceptionHandlerIntegrationis enabled (default in 8.x; made explicit). It captures unfiltered — any throwable that escapes every boundary is reported regardless ofreportable()classification, on the principle that an escape is itself bug-worthy. Events from this path carryvia: uncaught-handlerso they are distinguishable. - The Logback Sentry appender (covered by DT-008) is not treated as a boundary; it forwards on a different filter (log-level + exception presence) without
reportable()filtering.
Implications
Section titled “Implications”- Ktor’s existing
StatusPagesexception<Throwable>handler inComponent.kt:230gains the capture call before the existinglog.warn(...)and HTTP response emission. The log line is preserved (it powers the Logback appender path from DT-008). - Background-job audit in
operations/src/main/kotlin/cards/arda/operations/system/batch/: every entry point that launches work outside the request coroutine wraps the work inrunBoundaryorrunSuspendingBoundary. If the audit finds no out-of-request entry points, the helper ships fromcommon-moduleandoperationsadopts no calls in this project — the helper is available for future use. - Pre-init errors (anything raised before
Sentry.initreturns) bypass capture. Acceptable today; documented in the how-to.
See workbooks/notebooks/operations-sentry/decisions/dt-006-boundary-capture-and-last-resort.md. Note the partial-supersession banner: the original “no Sentry Logback appender” stance was reversed by DT-008; everything else remains in force.
DT-007 — Frontend tracePropagationTargets — explicit and env-aware
Section titled “DT-007 — Frontend tracePropagationTargets — explicit and env-aware”How does the frontend’s distributed-tracing propagation behaviour become explicit and env-aware, so that backend-host changes do not silently break trace continuity?
Decision
Section titled “Decision”Configure tracePropagationTargets explicitly in all three frontend Sentry init paths (arda-frontend-app/src/instrumentation-client.ts, arda-frontend-app/sentry.server.config.ts, arda-frontend-app/sentry.edge.config.ts). The allow-list includes:
- The same-origin patterns the SDK default already covers (
localhost,/^\/(?!\/)/). - The
/monitoringtunnel route used bywithSentryConfig’stunnelRoute. - The env-specific backend host, derived from
NEXT_PUBLIC_DEPLOY_ENVand the existing API-client host-discovery mechanism (the helper reuses that mechanism rather than introducing a parallel one).
Implications
Section titled “Implications”- Single-PR change in
arda-frontend-app. No backend dependency, nocommon-modulechange. - A small helper file (likely under
src/lib/sentry/) computes the allow-list per environment; the three init paths import and use it. - Distributed tracing between
arda-frontendandplatform-bealready works today through the BFF same-origin paths; this change locks the configuration down so future direct browser-to-backend calls (if any are added) continue to propagate cleanly.
See workbooks/notebooks/operations-sentry/decisions/dt-007-fe-trace-propagation-targets.md and the supporting end-to-end empirical analysis in …/threads/dt-004-session-tracking/assessment.md (which confirmed via Sentry MCP that trace propagation works today but is implicit and brittle).
DT-008 — Backend Sentry Logback appender for ERROR-level events and exception traces
Section titled “DT-008 — Backend Sentry Logback appender for ERROR-level events and exception traces”Should the backend forward log events to Sentry via the sentry-logback appender, in addition to the boundary-driven exception capture from DT-003 / DT-006?
Decision
Section titled “Decision”Yes — install the sentry-logback appender in each consuming component’s logback.xml, with common-module providing the dependency:
- Every log event at level
ERRORor above produces a Sentry event. - Every log event carrying an attached
Throwable, at any level, produces a Sentry event with the throwable’s stack trace. - Lower-level log events (
INFOand above) feed Sentry as breadcrumbs attached to events captured later in the same request.
The appender does not apply reportable() filtering — log-side capture is intentionally independent of the boundary-side classification. Duplication with boundary capture is accepted; Sentry’s default fingerprinting groups duplicate events into a single Issue.
Wiring mechanism — per-component XML. Each component adds the <appender name="SENTRY" class="io.sentry.logback.SentryAppender"> block plus an <appender-ref ref="SENTRY"/> on the root logger to its own logback.xml. The wiring is visible at the surface where logging is normally configured. The DSN-empty fail-soft behaviour is handled by SentryAppender itself — when Sentry.init runs with no DSN, the appender becomes a no-op (no Hub available, events silently dropped).
Implications
Section titled “Implications”common-moduleadds a runtime dependency onio.sentry:sentry-logback. No programmatic appender attach is performed bySentryInit; the SDK init module is unchanged from a Logback standpoint.- Each consuming component must add the XML wiring to its
logback.xmlas part of adoptingcommon-module’s Sentry observability. The how-to (process/craft/operations-and-monitoring/sentry-integration.md) carries the exact XML snippet plus a verification recipe. operations/src/main/resources/logback.xmlchanges as part of this project’s operations wave — the<appender>+<appender-ref>block is added next to the existingSTDOUT/PERF_STDOUTappenders.accounts-componentfollows the same pattern when it adopts the integration (tracked by PDEV-533).- The existing
app.log.warn("Error: …")line inStatusPagescontinues to fire — it produces a Sentry event via the Logback appender, in addition to the boundary-driven event from DT-003. Sentry’s fingerprinting should merge the two into one Issue; if it does not in practice, an MDC marker at the capture site can suppress the log-side event (refinement, not a blocker). - Verification adds a post-deploy check: confirm the appender is attached at runtime in the deployed pod (parse
logback.xmlor trigger a deliberateERRORlog and observe the Sentry event). - A failure mode to watch for: a component that adopts
common-module’s observability but forgets the XML snippet has working boundary-side capture (DT-003 / DT-006) but no log-forwarding (DT-008). The verification step catches this.
See workbooks/notebooks/operations-sentry/decisions/dt-008-be-log-forwarding.md. This decision supersedes the “no Logback appender installed” line in DT-006’s resolution; all other clauses of DT-006 remain in force. The rationale for accepting the duplication cost and reversing the DT-006 stance is in the dt-008 decision file.
Cross-cutting design topic — PartitionSecrets CDK stack (provides the DT-005 salt)
Section titled “Cross-cutting design topic — PartitionSecrets CDK stack (provides the DT-005 salt)”PartitionSecrets is the new CDK stack that materialises the Sentry scrub salt for DT-005. It is part of the same scope as the project’s other infrastructure changes; it does not have its own DT identifier because the design was settled during implementation planning rather than during the DT-001..DT-008 exploration phase. Recorded here so the decision-log is the complete project record.
Decision
Section titled “Decision”A new CDK stack PartitionSecrets in infrastructure/src/main/cdk/stacks/purpose/partition-secrets.ts is created per partition (purpose) alongside the existing per-purpose stacks (partition-authn, partition-bulk-stores, purpose-storage, purpose-compute, purpose-dns, purpose-ingress, image-storage). It holds an AWS Secrets Manager Secret per partition named {Infrastructure}-{purpose}-SentryScrubSalt (e.g. Alpha001-prod-SentryScrubSalt), with a JSON payload {"salt": "..."} containing a 64-character random string generated by AWS Secrets Manager via generateSecretString (default) or an explicit override.
The override mechanism: a new optional field sentryScrubSaltOverride?: string on PartitionInfo / Partition in infrastructure/src/main/cdk/platforms.ts. ENVIRONMENTS entries default to omitting the field; any partition that needs a known value adds it inline.
Implications
Section titled “Implications”- The stack is wired in
infrastructure/src/main/cdk/apps/Al1x/partition.ts’sbuildPartition()(stack id${partitionPrefix}-PartitionSecrets), alongside the other per-partition stack instantiations. NoaddDependencyto other stacks. - The CFN export name for each partition’s salt ARN is
{publishingPrefix}-API-SentryScrubSaltArn. The-API-infix follows the cross-repo export convention established by PDEV-531 (-API-for exports consumed outside the infrastructure repository;-I-reserved for cross-stack reads within CDK). ESO in EKS reads this export from outsideinfrastructure, so-API-is correct.stackTypes.publish()recognises-API-and emits two CloudFormation Outputs: a public raw-ARN form on-API-SentryScrubSaltArn(consumed verbatim by ESO) and a private marker-prefixed form on the bare-name (for CDK-to-CDKreadImports()). - The privacy boundary is per-purpose: cross-component correlation works only within the same purpose. Cross-purpose and cross-Infrastructure correlation is deliberately broken.
SandboxKyle002is deprecated and not configured.infrastructureandcommon-moduleare independent PR tracks; the salt-name and JSON-structure spec is the only shared contract between them.
The per-partition scope was settled in conversation during implementation planning; the workbook’s DT-005 entry carries a refinement banner pointing to the published specs for the post-banner design. There is no separate workbook decision file for PartitionSecrets.
Topic cross-reference
Section titled “Topic cross-reference”| ID | Topic | Repositories affected |
|---|---|---|
| DT-001 | SDK + OpenTelemetry agent coexist | common-module, operations |
| DT-002 | SDK init in common-module | common-module |
| DT-003 | Error filtering via AppError.reportable() | common-module |
| DT-004 | Session-based release health, sample rates aligned with frontend | common-module, operations |
| DT-005 | PII handling and payload scrubbing in beforeSend | common-module, operations, infrastructure |
| DT-006 | Boundary capture topology and JVM-level last-resort handler | common-module, operations |
| DT-007 | Frontend tracePropagationTargets — explicit and env-aware | arda-frontend-app |
| DT-008 | Backend Sentry Logback appender for ERROR-level + exception traces | common-module, operations |
| — | PartitionSecrets CDK stack (provides the DT-005 salt) | infrastructure |
Copyright: © Arda Systems 2025-2026, All rights reserved