Skip to content

PDEV-442 — Sentry organisation configuration for Arda services

Recommendation for how to evolve the arda-systems Sentry organisation as we instrument the JVM components (operations, future accounts-component). The design lands at two projects total: arda-frontend (existing) and platform-be (new — covers all back-end services). Component-level differentiation inside platform-be uses the OpenTelemetry service.name attribute, set by the Helm chart from the existing application.name helper. Pairs with pod_capacity.md § Recommendation #5, which covers the operations-side Helm wiring.

Arda’s deployed runtime today has the following layers, of which only three need Sentry instrumentation:

LayerTechSentry needed?Notes
Front-end SPANext.js (React) in browserYes — already instrumentedProject arda-frontend (existing)
Front-end BFFNext.js on Amplify SSR Compute (Lambda)Yes — same project as SPAShares codebase + deploy unit
CognitoAWS managedNoCloudTrail covers auditing
API GatewayAWS managedNoCloudWatch metrics
EKS pods — operationsKotlin/Ktor on Fargate JVMYes — newSubject of this work
EKS pods — accountsKotlin/Ktor on Fargate JVMYes — newSame pattern as operations
Aurora PostgresRDS Performance InsightsNoDB-level signals via PI / pg_stat_statements

Two partitions, four environments:

PartitionAWS accountEnvironments
Alpha001production-gradeprod, demo
Alpha002non-productionstage, dev

Repository / runtime mapping:

RepositoryRuntime presenceSentry project?
arda-frontend-appAmplify (SPA + BFF)arda-frontend (existing)
operationsEKS Fargate JVMplatform-be (new — shared with accounts-component)
accounts-componentEKS Fargate JVMplatform-be (same project as operations; differentiated via service.name tag)
common-moduleLibrary, no runtimeNone (init lives in host app)
infrastructureIaC, no runtimeNone
ux-prototypeStorybook, no production runtimeNone

One project: arda-frontend, covering both the SPA and the BFF. They share a codebase and a deploy unit, so this is correct. Environment values flow in via the SDK’s environment setting in the existing instrumentation.

Proposed structure — 2 projects in the arda-systems org

Section titled “Proposed structure — 2 projects in the arda-systems org”
ProjectPurposeSDKRuntime
arda-frontend (existing — keep)SPA + BFF@sentry/nextjsBrowser + Amplify Lambda
platform-be (new)All back-end JVM services (operations, future accounts-component, …)OTel Java agent (pod_capacity.md §#5)EKS Fargate JVM

Earlier drafts of this doc proposed one project per component (platform-operations, platform-accounts). We reverted to a single platform-be project after weighing the trade-offs against Arda’s current scale.

The single-project choice trades quota isolation between back-end components (a runaway operations logger could in principle drown out accounts visibility) for operational simplicity: one set of alert rules, one ownership file, one source-map upload pipeline, one Discover scope for cross-component queries. With a single back-end team today, the duplication cost of per-component projects is real and recurring while the quota-isolation risk is hypothetical — operations dominates event volume and the org quota is the binding constraint, not per-project caps.

Component-level differentiation inside the shared project is done via the OpenTelemetry service.name attribute (see § Component differentiation via service.name below). Sentry surfaces it as a tag in Issues, Performance, and Discover — the same dimension every alert rule and dashboard widget would scope by.

The frontend stays in its own project because its SDK, deploy unit, and ownership are genuinely separate; it would not benefit from sharing back-end’s alert rules or quotas.

This is not a one-way door. If/when team ownership diverges (e.g. accounts gets a dedicated owner) or event volume grows past shared-quota tolerance, splitting platform-be into per-component projects is straightforward: stand up the new project, point the new component’s DSN at it, leave existing issues in platform-be searchable. The current single-project decision should be revisited at either of those trigger points.

Component differentiation via service.name

Section titled “Component differentiation via service.name”

Set in the Helm chart’s templates/deployment.yaml, outside the Sentry-enabled gate (so it’s available for any future OTel-aware integration, not just Sentry):

- name: OTEL_SERVICE_NAME
value: {{ include "application.name" . | quote }}

The application.name helper already returns the component name (operations, future accounts) — same value used for SENTRY_RELEASE and K8s labels. Every chart instance therefore self-tags correctly with zero configuration.

Two conventions are load-bearing for the single-project design to remain operable as it grows:

  1. All alert rules must scope by service.name — the rule author has to include service.name:operations (or whichever component) in the rule conditions. Codify this in the alert-rule README inside Sentry’s project settings.
  2. The OTEL_SERVICE_NAME env var must be set by every back-end component’s Helm chart — chart templates should include the snippet above as part of the standard deployment pattern. Make this a chart-review checkpoint.
  • Don’t split arda-frontend into SPA + BFF projects. They share a codebase, deploy together, and stitch better in Sentry as one project. Revisit only if SPA event volume drowns out BFF triage, which it shouldn’t given the relative call rates.
  • Don’t fragment platform-be by component prematurely. See the trade-off discussion above. Wait for an operationally motivated split, not a theoretical one.
  • Don’t create projects per environment. Environment is a tag in Sentry; projects shouldn’t fragment by env.
  • Don’t create projects per partition. Same argument — partition is part of the environment tag.
  • Don’t add a Sentry project for common-module. It’s a library; Sentry instrumentation belongs in the host application.
  • Don’t add Sentry projects for infrastructure or ux-prototype. No runtime presence in deployed envs.
Tag valueWhere
alpha001-prodAlpha001 prod
alpha001-demoAlpha001 demo
alpha002-stageAlpha002 stage
alpha002-devAlpha002 dev

Why this shape:

  • One Sentry project, four environment values per project — the canonical Sentry pattern.
  • Matches the partition naming used everywhere else in Arda (1Password vaults Arda-{Env}OAM, kubectl contexts Alpha001 / Alpha002, CDK app names).
  • Lets queries scope to “everything prod-like” (environment:*-prod) or “everything on alpha001” (environment:alpha001-*) without needing a separate partition tag.
  • Sentry treats environment as a free-form tag, so future expansion (e.g., alpha003-prod) is zero-cost.

Set per pod via the SENTRY_ENVIRONMENT env var that the Helm chart already templates. For the JVM components this resolves from {{ .Values.global.infrastructure }}-{{ .Values.global.purpose }} (the existing Helm value names match). For the frontend, set the same shape via the Next.js Sentry config.

Suggestion: align the existing arda-frontend env values to the same convention as part of this rollout if they aren’t already — Sentry handles environment renames gracefully.

A Sentry DSN identifies the project, not the environment. Each project has exactly one DSN; that DSN is used by every pod in every environment for any component routed to that project.

DSNUsed by
arda-frontend DSNSPA + Amplify BFF — already in place
platform-be DSNoperations pod and, when it ships, accounts-component pod — in all four envs

DSNs are not authentication credentials. They identify the project; anyone with the DSN can send events to it but not read them. No rotation required. Storing them in 1Password is purely consistency with the existing operational pattern.

Sentry DSNs are different from most credentials we store in 1Password. A DSN identifies the Sentry project, and the project’s environment-disambiguation is done by the environment tag, not by the DSN. So the same DSN value is used by every pod of a given component across all four deployment environments — it is common by design, not by coincidence or by resource-sharing.

This distinguishes Sentry DSNs from credentials like ArdaApiKey or the operations DB passwords: those happen to be the same value today but could legitimately diverge per environment (e.g., a key rotation in prod that takes a week to propagate to demo). For those, the “same value in all four Arda-{Env}OAM vaults” pattern is the right default. For Sentry DSNs, by contrast, divergence would mean “different Sentry project,” which would defeat the whole point of the design.

Sentry DSNs therefore live in the workspace-wide Arda-SystemsOAM vault (not duplicated across the four partition vaults). The single project covering all back-end services means a single multi-field item:

VaultItemFields
Arda-SystemsOAMbe-sentry-dsndsn, project-slug (platform-be), sentry-org (arda-systems)

The pipeline from 1P to K8s is the standard 1P → amm.sh → AWS Secrets Manager → ESO → K8s pattern (see pod_capacity.md § Provisioning pipeline). The AWS SM secret is infrastructure-scoped at {Infrastructure}-SentryDsn; the K8s secret materialised by ESO inside each component’s pod namespace is be-sentry-dsn (key dsn). Both operations and a future accounts-component chart read from the same AWS SM secret — the namespace boundary makes the K8s secrets unique without needing a component qualifier in the resource name.

When would a Sentry value belong in a partition vault?

Section titled “When would a Sentry value belong in a partition vault?”

If we ever introduce per-partition Sentry configuration — for example, Sentry monitor URLs, deploy-hook tokens, or alert-rule webhook secrets that are environment-scoped — those would follow the standard four-vault pattern. The DSN is the explicit exception because it identifies a project that, by design, spans all environments.

ProjectRelease value
arda-frontendarda-frontend@{package.json version} (or arda-frontend@{git short SHA} if the existing setup uses commits)
platform-be{component}@{Chart.AppVersion} per chart instance — e.g. operations@1.2.3, future accounts@0.5.1 (rendered by Helm; see pod_capacity.md § Rec #5)

Sentry uses this to compute first-seen / regression info and to anchor releases to deploys. Set via SENTRY_RELEASE env var. With the single-project design, the {component}@… prefix in the release value is what disambiguates operations releases from accounts releases inside the shared project’s release timeline — the same way service.name disambiguates events. The chart helper application.name provides the prefix at zero configuration cost.

ProjectEnvtracesSampleRate
arda-frontenddev / stage1.0
arda-frontenddemo / prodexisting (likely 0.1–0.3)
platform-bedev / stage1.0 (full sampling for debugging)
platform-bedemo0.1 (production-facing demo env)
platform-beprod0.1 (CPU-conscious cap; agent overhead at this rate is ~1–3 % CPU on the 2 vCPU prod pod)

Sampling decisions are made at the head (where the trace starts — usually the SPA, occasionally the BFF for server-initiated work). Downstream services honor the parent decision when the sentry-trace header is present. Setting the BFF to 5 % in prod effectively samples the JVM tier at 5 % too for SPA-initiated traffic; the JVM’s own tracesSampleRate only matters for internal traffic (scheduled jobs, gRPC, intra-cluster calls).

Two Sentry teams worth creating up front:

  • frontend — owns arda-frontend
  • platform — owns platform-be

Today these are the same set of humans so this creates no friction. As the team grows, alert-rule routing and issue assignment become trivial — Sentry routes issues to the team that owns the project. If accounts-component eventually gets a separate owner, the single-project design’s first-revisit trigger fires (see § Why one project for the whole back end above).

  1. (done) Create the platform-be project in the arda-systems org. DSN captured.
  2. Pre-create the 4 environment values in platform-be via Sentry’s Environments page (alpha001-prod, alpha001-demo, alpha002-stage, alpha002-dev) so they appear correctly the first time events arrive. (Sentry auto-creates them on first event too; pre-creating is purely cosmetic so the dropdown is ordered correctly.)
  3. Migrate arda-frontend’s existing environment values to the {partition}-{env} convention if not already aligned. Sentry handles renames gracefully.
  4. (done) Add the DSN to the workspace-wide Arda-SystemsOAM 1Password vault as be-sentry-dsn (fields: dsn, project-slug = platform-be, sentry-org = arda-systems). Single entry total — Sentry DSN is common across environments by design.
  5. Infrastructure CDK / amm.sh wiring (separate ticket per infrastructure-improvements.md §4) — provision the {Infrastructure}-SentryDsn AWS Secrets Manager secret per infrastructure, value sourced from op://Arda-SystemsOAM/be-sentry-dsn/dsn via amm.sh and passed to CDK via cdk.CfnParameter.
  6. ESO sources the {Infrastructure}-SentryDsn AWS SM secret and materialises a K8s secret named be-sentry-dsn (key dsn) in each operations namespace. The ExternalSecret is declared in the operations chart and gated by oam.performance.sentry.enabled.
  7. Ship the Helm chart wiring (PDEV-488 #5) with oam.performance.sentry.enabled: true in all four envs day-one (blanket telemetry policy — see pod_capacity.md § Rec #5 per-env enablement), and OTEL_SERVICE_NAME set outside the Sentry gate (component differentiation works regardless of whether Sentry is on).
  8. No per-env flip-on step. Step 7 ships with enabled: true in all four envs; activation is uniform day-one. If a specific env later needs to be dialed back (quota, noise from a misbehaving integration), flip tracesSampleRate or enabled with a one-line values change.

If steps 5-7 land after a deploy (the upstream AWS SM secret isn’t yet in place when the chart rolls out), the pod still starts; Sentry runs in disabled mode until the secret materialises. The chart uses secretKeyRef: optional: true so the pod is fail-soft against the missing K8s secret. No outage, no rollback. See pod_capacity.md § Recommendation #5 Failure Modes for full behaviour.

Once the agent is loaded (oam.performance.sentry.enabled: true in any env), the team gets:

  • Distributed traces that span SPA → BFF → API Gateway → operations / accounts. The hand-correlation we did during the PDEV-442 investigation becomes a single trace view in Sentry, with each back-end span carrying its service.name tag for component attribution.
  • Stack traces on 5xx without an operations code change — the 37 % silent-500 path on kanban-card/details (PDEV-490 OP3) becomes visible immediately; OP3’s contract fix can land on its own merits but is no longer the only way to get observability on those errors.
  • JVM runtime metrics (heap, GC, thread counts) per pod, shipped by the OTel agent — complements the JFR work in PDEV-488 #4. The metrics are tagged by service.name so the same platform-be project surfaces per-component charts.
  • Release-anchored issue tracking — first-seen / regressed-in data tied to deploys via the {component}@{Chart.AppVersion} release tag (SENTRY_RELEASE), so we can see whether a particular Helm bump introduced a new error and which component it landed in.

Sentry is the primary observability surface for application-layer signals; CloudWatch / Performance Insights remain the home for infrastructure-layer signals. Three categories, decided once so we don’t relitigate per ticket:

Lands in Sentry (free with the OTel agent in PDEV-488 #5)

  • Operations request latency p50/p95/p99 — server spans (Ktor / Netty).
  • DB query latency + per-request slow-query attribution — JDBC spans with statement text and duration. Different from pg_stat_statements (which is aggregate / server-side); for “which endpoint is slow because of which query?”, the JDBC-span view is more useful.
  • Uncaught exceptions and 5xx attribution.
  • Distributed traces SPA → BFF → API Gateway → operations → DB.
  • JVM runtime metrics: heap, GC pause duration, GC frequency, thread counts, code cache, classloader. Emitted by the OTel agent’s JVM metrics module (otel.instrumentation.runtime-telemetry.enabled=true, default in recent agent versions).

The expected setup is one cross-project Sentry Performance Dashboard — widgets sourced from arda-frontend, platform-be, and (later) platform-accounts. Build it after PDEV-488 lands in dev and the data starts flowing; no separate ticket needed.

Possible in Sentry with caveats

  • JVM GC pauses. Use the OTel-emitted jvm.gc.duration histogram, not the raw -Xlog:gc* text log. PDEV-488 #3 still ships the GC log to CloudWatch for forensic depth, but the dashboard source is the OTel metric.
  • Continuous profiling. Sentry Continuous Profiling for JVM uses async-profiler, which does not work on EKS Fargate (same constraint that made us pick JFR for PDEV-488 #4). Profiling centralization is blocked until we leave Fargate (recommendation #6 in pod_capacity.md, deferred). Until then, JFR remains the profiling mechanism, and analysis stays in JMC / IntelliJ.

Stays outside Sentry

  • pg_stat_statements server-side aggregates → Aurora / RDS Performance Insights.
  • Aurora slow-query log lines → CloudWatch Logs.
  • Pod-level CPU / memory from metrics-server → CloudWatch Container Insights (or kubectl top).
  • HPA scaling events / K8s events → CloudWatch / kubectl get events.

These signals fire when the Sentry view points “down the stack” (DB or pod-resource bottlenecks); at our current scale they don’t warrant duplicating into Sentry via custom shipping.

  • Sentry billing / quota negotiation. The new projects will add to the org’s event quota. A one-line confirmation with the Sentry billing owner is worth doing before flipping prod, but no detailed cost modeling is in scope here.
  • Sentry Crons / Heartbeats for scheduled JVM tasks. Worth revisiting if operations introduces periodic jobs that warrant uptime monitoring; not needed today.
  • Sentry Continuous Profiling integration with JFR. The OTel agent in PDEV-488 #5 emits standard JVM metrics; Sentry’s profiling product can ingest JFR but requires additional configuration. Separate follow-up.
  • Source-map / debug-symbol upload for SPA / Kotlin. Frontend already uploads source maps as part of the Amplify build; Kotlin debug symbol upload via sentry-cli is a small follow-on to PDEV-488 #5 if symbolic stack traces are needed.