Skip to content

Operations Sentry — Alternatives Considered

The Design Topics under workbooks/notebooks/operations-sentry/decisions/ carry the full exploration record (DT-001..DT-008). This file consolidates the most consequential not-taken alternatives for posterity.

Sentry OTel Java Agent as the primary instrumentation surface

Section titled “Sentry OTel Java Agent as the primary instrumentation surface”

Considered. Use the bundled sentry-otel-agent.jar to instrument Ktor routes, DB queries, and HTTP clients without any application-level code change. This was attractive because it requires no common-module changes for performance/APM signals — the agent picks up Ktor, Exposed, OkHttp, etc. by class-path scanning.

Not taken (as the primary surface). Adopted additionally as the source of auto-instrumented http.client and DB spans, which is well within its strengths. Rejected as the primary surface because:

  1. Error/exception capture from the agent’s perspective is whatever propagates as an unhandled throwable to the JVM. That misses the boundary semantics we wanted — AppError.Invocation.* would be reported as errors when they are caller-driven validation failures, polluting the issue feed.
  2. The agent’s OTLP exporter wired in by default required explicit suppression (OTEL_*_EXPORTER=none). A pure-agent path would have left this trap unmitigated.
  3. The agent does not implement session tracking; release health would still need the per-request plugin.

Outcome: hybrid. Manual BoundaryCapture + Sentry.startSession/endSession plugin for errors and sessions; auto-agent for routes and outbound HTTP. Best of both, modest complexity cost.

Considered. Hook Sentry into the StatusPages plugin at the route layer, mapping HTTP status to Sentry severity. Conceptually clean — capture happens at the response boundary.

Not taken. Two problems: (1) background paths (CSV upload, scheduled jobs, library callbacks) have no HTTP response so the policy can’t apply; (2) the API layer’s view of a throwable is post-classification — by the time StatusPages sees an AppError.Invocation.NotFound it’s already a 404 in flight, and “drop or send” is a coarse switch that requires re-classification at every consumer.

Outcome: pushed reportability into AppError itself as reportable(): List<Throwable>. The capture site becomes a for (t in err.reportable()) Sentry.captureException(t) loop, identical across boundary, Logback appender, and last-resort handler. The classification is owned once by the error type.

Plaintext user identifiers in Sentry user context

Section titled “Plaintext user identifiers in Sentry user context”

Considered. Send userId as-is (UUID strings, internal IDs). Sentry treats user.id as a free-form string and would happily store and index it.

Not taken. The platform stores user IDs that map to merchant identities and, depending on the entry point, to PII surfaces (Amazon-listed seller names, payment-tied identifiers). Sending them plaintext into a third-party data store is a compliance and privacy risk that isn’t worth the small ergonomic gain of clickable IDs in Sentry.

Outcome: PiiScrubber hashes the identifier with the per-partition SENTRY_SCRUB_SALT. The hash is deterministic — same input always produces the same hash — so a Sentry user looking at recurring issues for the same merchant sees a consistent opaque ID. Joining hash → real user requires database access plus the salt, which lives only on the pod. See DT-005.

Single SENTRY_SCRUB_SALT across all partitions

Section titled “Single SENTRY_SCRUB_SALT across all partitions”

Considered. One salt per environment dimension (one for dev, one for stage, etc.), or one global salt re-used everywhere.

Not taken. A per-partition salt provides better isolation: a leak of the dev salt can’t be combined with a prod hash to reverse-engineer prod identities. Cost is low — partition-scoped Secrets Manager items are already a pattern (DB credentials, API keys), and ESO provisioning is uniform.

Outcome: four salts, one per partition. Stored in Alpha00X-PartitionSecrets and projected via ESO. Documented in DT-005 and the how-to.

Considered. Hoped the agent would emit per-request sessions when configured with OTEL_TRACES_SAMPLER_ARG=traceidratio=....

Not taken — empirically. The agent emits spans, not sessions. Session tracking is a Sentry-specific concept (Release Health) that the OTel ecosystem does not have an analogue for. This was discovered during dev verification when zero sessions appeared on the Release Health tab despite traces flowing normally. See DT-004 supersession.

Outcome: manual SentryRequestSession Ktor application plugin.

Considered. Logback appender at INFO level so all structured logs flow to Sentry as breadcrumbs and ERROR-level logs become issues.

Not taken. Sentry quota and signal-to-noise. INFO-level logs include the request-log line for every served request, which would inflate event volume by ~100x with no proportional diagnostic value (the data is already in the trace span attached to the same request). Breadcrumbs are useful but the JVM SDK already captures Log4j/Logback-routed events as breadcrumbs internally when the scope is right.

Outcome: appender at ERROR level. Breadcrumb capture relies on the SDK’s internal Logback integration.

Considered. Spin out a new module inside common-module (sentry-common) so consumers can pull just the observability primitives without taking the rest of arda-common.

Not taken. arda-common is consumed wholesale already. Splitting now creates a versioning surface (which sentry-common works with which arda-common) for a benefit that doesn’t materialise — every Kotlin component already takes the whole library. If a third-party consumer of just the observability layer ever appears, the split becomes worthwhile; today it’s premature.

Outcome: primitives live under cards.arda.common.observability in the existing arda-common artifact (8.3.0).