Skip to content

Design: Email Integration Phase 5a -- Component Library Updates

Phase 5a ships five additive helpers in common-module consumed by the Phase 5b Email module. This design document covers four of them; the fifth — the idempotency helpers — has its own carved-out design. The four covered here are:

  • AppError.Application (§ 1) — a new third top-level branch under AppError, peer to Internal and Invocation. Three concrete subtypes (PreconditionFailed, PolicyRejected, ConflictingState) carry the “well-formed call, healthy system, application state doesn’t permit this operation” category that today gets misclassified as Invocation.GeneralValidation or Internal.IncompatibleState.
  • Internal.IncompatibleState reclassification sweep (§ 2) — a methodology-driven sweep through the 62+ construction sites of Internal.IncompatibleState in common-module/lib/src/main, classifying each as kept (genuine bug-class), moved to Application.ConflictingState (recoverable application outcome), or moved to Invocation.GeneralValidation (caller error).
  • sanitizeHeader (§ 3) — value-cleaning primitive for inbound HTTP headers; composes downstream of the existing HeadersAllowList (name-based observability scoping) to provide value-based hard-rejection / silent-drop / clean for persistence.
  • TokenCipher + Hmac (§ 4) — application-layer encrypted-field primitive implementing the DQ-R1-019 two-axis envelope, plus a small Hmac helper that DRYs two existing JDK-Mac call sites.

The idempotency helpers (§ 5) live in their own design document. They were a deep enough design exercise to warrant separation: a RawIdempotencyStore operating natively on JsonElement with a typed wrapper IdempotencyStore<Req, Res> produced by an inline fun typedAs() extension. Schema-evolution becomes a per-caller responsibility via the typed wrapper’s Json configuration.

All four helpers in §§ 1-4 are mostly independent — you can read one section without reading the others. Reviewers wanting a phased pass can take §§ 1-4 in order; § 1 is the foundation everything else assumes (Application.ConflictingState is referenced from the idempotency design).

#DecisionChoice
DQ-R1-019Per-partition email server-token encryption keyTwo-axis envelope a{N}.k{SM-VERSION-ID}; SM-native versioning; MaterialRegistry populated from a single ESO-projected JSON map carrying every live key-material version. TokenCipher ships in common-module.
DQ-R1-027AppError.Application shapesealed class Application with three concrete subtypes; reportable() = emptyList() at branch root; no HTTP-status hints on subtypes.
DQ-R1-028Sweep methodology and PR sequencingDiscovery-then-classify; sweep lands as the final Phase 5a PR; common-module scope only; major bump.
DQ-R1-029sanitizeHeader placement and shapeNew lib/api/headers/ package; Result<String?> for accept / silent-drop / hard-reject; composes downstream of HeadersAllowList.
DQ-R1-030TokenCipher factory + decrypt-failure classification + Hmac DRYcompanion operator fun invoke(info, materials, currentVersionId) returning Result<TokenCipher>; auth-tag failure -> Internal.IncompatibleState; unknown versionId -> Transient.FailoverFailed; Hmac extracted and shared.

All Kotlin sketches in this document conform to the workspace kotlin-coding standards: Result<T> on every fallible method, single-exit, when over if, no !! / getOrThrow / getOrNull, DI for dependencies, @JvmInline value classes for primitive type-safety.

The existing AppError hierarchy splits caller error from system error:

  • Internal — bugs and operational signals (Implementation, Infrastructure, IncompatibleState, InternalService, InternalTimeout, Transient). reportable() returns listOf(this); these page on-call.
  • Invocation — caller’s fault (ArgumentValidation, NullArgument, NotFound, Duplicate, Authorization). Not bug-worthy; reportable() returns empty list.

Neither captures the third real category: the call was well-formed, the system is healthy, but the application’s current state does not allow this operation right now. Today these get squeezed into:

  • Invocation.GeneralValidation — which lies; the caller did nothing wrong, the application’s state did.
  • Internal.IncompatibleState — which lies in the other direction; the system is fine, just not in the state the caller assumed; pages on-call when it shouldn’t.

Per DQ-R1-027, AppError.Application is added as a third top-level branch with three concrete subtypes:

  • PreconditionFailed — the operation requires prior state the system doesn’t have (“cannot send before tenant is verified”, “cannot ship before order is paid”).
  • PolicyRejected — the operation is disallowed by policy (“tenant suspended”, “rate limit exceeded for partition”).
  • ConflictingState — the operation race-lost or its expectation drifted (“expected status pending, found committed”, “version conflict on optimistic update”).

Extends the existing cards.arda.common.lib.lang.errors.AppError.kt:

package cards.arda.common.lib.lang.errors
// existing imports + existing AppError sealed hierarchy unchanged ...
sealed class AppError(...) : Throwable(...) {
// ... existing Composite / Generic / Internal / Invocation branches ...
/**
* Expected application-domain outcomes that are neither caller errors nor
* system bugs. The call was well-formed, the system is healthy, but the
* application's current state does not allow this operation.
*
* [reportable] returns an empty list for every [Application] subtype:
* these are not bug-worthy and must not page on-call.
*
* REST mapping is the single responsibility of the L4 mapping table
* (typically `HttpErrorResponses.kt`); [Application] subtypes do NOT
* carry HTTP-status hints.
*/
sealed class Application(
override val message: String,
override val context: LazyMessage? = null,
override val cause: Throwable? = null,
) : AppError(message, cause, context) {
override fun reportable(): List<Throwable> = emptyList()
}
/**
* The operation requires prior state the system does not have.
* Example: "cannot send notification before tenant email configuration is verified".
*/
data class PreconditionFailed(
override val message: String,
override val context: LazyMessage? = null,
override val cause: Throwable? = null,
) : Application(message, context, cause)
/**
* The operation is disallowed by policy.
* Example: "tenant suspended", "rate limit exceeded for partition".
*/
data class PolicyRejected(
override val message: String,
override val context: LazyMessage? = null,
override val cause: Throwable? = null,
) : Application(message, context, cause)
/**
* The operation race-lost or its expectation drifted.
* Example: "expected status `pending`, found `committed`".
*/
data class ConflictingState(
override val message: String,
override val context: LazyMessage? = null,
override val cause: Throwable? = null,
) : Application(message, context, cause)
}

The L4 mapping table in HttpErrorResponses.kt gains entries:

  • Application.PreconditionFailed -> HTTP 409 Conflict (or 412 Precondition Failed depending on the resource semantics).
  • Application.PolicyRejected -> HTTP 403 Forbidden.
  • Application.ConflictingState -> HTTP 409 Conflict.

The exact mapping is the L4 layer’s responsibility; AppError.Application subtypes do not carry HTTP-status hints. A future L4 dispatcher (gRPC, SQS) maps differently without changing the AppError types.

SurfaceTest typeWhat it asserts
Application.PreconditionFailed ctorPure KotlinConstruction with message, message + context, message + cause, all three.
Application.PolicyRejected ctorPure KotlinSame shapes as PreconditionFailed.
Application.ConflictingState ctorPure KotlinSame shapes.
Application.reportable()Pure KotlinAll three subtypes return emptyList<Throwable>().
AppErrorReportableTest.kt extensionPure KotlinExisting tests pass; new tests in the same file cover the three subtypes.
L4 mapping (HttpErrorResponses.kt)Pure KotlinEach subtype maps to the documented HTTP status; no subtype falls through to a default 500.

2. Internal.IncompatibleState reclassification sweep

Section titled “2. Internal.IncompatibleState reclassification sweep”

DQ-R1-027 introduces AppError.Application; this sweep applies it across the codebase. The 62+ existing construction sites of Internal.IncompatibleState in common-module/lib/src/main each need a per-site judgement.

Per DQ-R1-028, the sweep:

  • Lands as the last Phase 5a PR (sequenced after the four Added-only helpers).
  • Is the major-bump PR — consumers doing exhaustive when over AppError.Internal see reclassified sites move out, which is a Changed-category release.
  • Covers common-module only. The matching sweep within operations is Phase 5b’s consumer adoption work.

Each site classifies into one of three buckets:

  • Bucket A — keep as Internal.IncompatibleState — genuine bug-class invariant violation. The system’s internal state contradicts an invariant the code expects. Examples: a Persistence op finds an entity with two head versions when the bitemporal invariant forbids it; a Universe finds an orphaned reference; a StateEngine finds a transition the state graph doesn’t allow.
  • Bucket B — move to Application.ConflictingState — recoverable application outcome. The caller asked for an operation against an expected state, and the system’s current state disagrees, but neither side is buggy. Examples: optimistic-update version mismatch; idempotency-key replay finds a different prior request body (Mismatch outcome — though that’s surfaced as an outcome, not an error); a draft commit finds the draft already discarded by another session.
  • Bucket C — move to Invocation.GeneralValidation — the caller passed input that the system can detect is invalid. Less common than the other two buckets, but exists. Example: a state-transition request that names a transition the state graph defines but the current state doesn’t have outbound to (caller knew the transitions, picked one wrong).

2.3 Discovery and classification methodology

Section titled “2.3 Discovery and classification methodology”

The sweep PR’s first task is discovery — not modification. Build the inventory of Internal.IncompatibleState call sites with a grep pass, then walk each:

Terminal window
# in common-module/lib/src/main, from the worktree root
grep -rn 'IncompatibleState(' lib/src/main/kotlin > scratch/incompatible-state-inventory.txt

For each site, the classifying engineer writes a one-line rationale next to the site number (in a scratch file, before any code change). The rationale answers two questions:

  1. What invariant is being checked? If the answer involves “the system’s own data shape” or “the code’s own contract”, that’s bucket A.
  2. Whose fault is the failure? If the caller asked for something against an expected state -> bucket B. If the system is internally inconsistent -> bucket A. If the caller passed input the call signature accepted but the body rejects -> bucket C.

The rationale per site is preserved in the PR description — the PR’s blast radius matters; reviewers need to see why each site moved (or didn’t).

Each reclassified site changes:

  • The IncompatibleState(...) constructor call -> the appropriate replacement.
  • The error message text may need adjustment (an Application.ConflictingState message reads more naturally as “Expected status pending, found committed” than as the more terse internal-error wording).
  • The context and cause parameters carry forward unchanged.

The migrated test sites get assertion updates:

  • Tests that asserted is Internal.IncompatibleState change to is Application.ConflictingState (or whichever bucket).
  • Tests that asserted reportable() = listOf(this) change to reportable() = emptyList() for migrated sites (the on-call paging change).
  • The operations IncompatibleState sweep — Phase 5b’s consumer adoption owns it.
  • infrastructure and arda-frontend-app — not consumers of AppError; no impact.

HeadersAllowList (existing, in lib/runtime/observability/) controls which headers are safe to log. sanitizeHeader controls what values are safe to read into business logic or persist. The two concerns are independent and compose at L4 inbound:

The composition pattern at L4 inbound has two stages. The allowlist filters by name first for observability scoping; sanitizeHeader then cleans or rejects by value before any survivor enters L3 / persistence.

PlantUML diagram

New package cards.arda.common.lib.api.headers:

package cards.arda.common.lib.api.headers
import cards.arda.common.lib.lang.errors.AppError
/**
* Clean an inbound HTTP header value for persistence or business-logic use.
*
* Returns:
* - [Result.success] of [String] -- cleaned value, safe to persist or pass to L3.
* - [Result.success] of `null` -- header is policy-rejected; caller drops it
* silently (no error). Use for headers that
* the application doesn't accept but that
* common HTTP clients may emit (e.g.,
* opportunistic correlation headers).
* - [Result.failure] -- value violates a hard constraint (control
* characters, oversize, charset). Caller
* MUST reject the request.
*
* Hard-rejection categories return [AppError.Invocation.GeneralValidation]
* with the offending header name in the message.
*/
fun sanitizeHeader(name: String, value: String): Result<String?> = ...

Cleaning rules (v1):

  • Trim leading and trailing whitespace.
  • Reject (Result.failure) if the trimmed value contains C0 or C1 control characters (ASCII 0x00-0x1F or 0x7F-0x9F), with the exception of HTAB (0x09).
  • Reject (Result.failure) if the trimmed value length exceeds the per-header cap (default 1024 chars; configurable in v2+).
  • Reject (Result.failure) if the value is not valid UTF-8.
  • Otherwise return Result.success(cleanedValue).

The function signature returns Result<String?>; the null channel is reserved for the future silent-drop path. v1 only emits Result.success(cleanedValue) (accept) or Result.failure (hard-reject) — the null return is never produced by v1. Callers may flatten Result<String?> to Result<String> defensively in v1, or treat null as “drop silently” once v2+ activates the path.

HeadersAllowList continues to operate exactly as today (observability scoping; no changes). At L4 inbound, callers compose them:

// L4 inbound handler -- inside the transaction
val filtered = HeadersAllowList.filter(rawHeaders)
val cleaned: Result<Map<String, String>> = filtered.headers
.toList()
.foldRight(Result.success(emptyMap<String, String>())) { (name, value), acc ->
acc.flatMap { soFar ->
sanitizeHeader(name, value).map { cleanedValue ->
// v1 never produces success(null); v2+ silent-drop collapses here
cleanedValue?.let { soFar + (name to it) } ?: soFar
}
}
}
// L4 maps a Result.failure to HTTP 400 via the standard error-mapping pipeline.
// cleaned.getOrElse { ... } is the L4 entry point's responsibility, not this helper's.

The two helpers are deliberately independent. HeadersAllowList is a Sentry-shaping concern; sanitizeHeader is a persistence-safety concern. Future work that needs only one (e.g., a non-HTTP transport adopting the value cleaning) reuses sanitizeHeader alone without dragging in observability defaults.

SurfaceTest typeWhat it asserts
Happy pathPure KotlinPlain ASCII values, mixed-case values, trim-edges values all return Result.success(cleaned).
Control charactersPure KotlinEmbedded NUL, ESC, BEL, DEL all return Result.failure(AppError.Invocation.GeneralValidation). HTAB (0x09) is preserved.
Length capPure Kotlin1024-char value accepted; 1025-char value rejected.
UTF-8Pure KotlinValid multi-byte UTF-8 accepted; invalid byte sequences rejected.
Composition examplePure KotlinA canned input map containing one allow-listed clean header, one allow-listed dirty header, one disallowed header passes through HeadersAllowList.filter + sanitizeHeader correctly.

DQ-R1-019 pinned the per-partition encryption-key design:

  • Two-axis envelope a{N}.k{SM-VERSION-ID}:<base64-payload>.
  • a{N} — algorithm version (v1 ships only a1 — AES-256-GCM + HKDF-SHA256). Code-indexed; never retired.
  • k{SM-VERSION-ID} — AWS Secrets Manager versionId of the source key material; runtime-indexed via a single ExternalSecret mount projecting a JSON map of every live key-material version into the in-memory MaterialRegistry. The cipher does not make any application-side calls to AWS Secrets Manager.
  • HKDF-SHA256 derivation from a 64-byte SM input (DQ-203 in the application-layer set).

TokenCipher is the Phase 5a primitive implementing this envelope. Hmac is a small adjacent helper that DRYs two existing JDK-Mac call sites and serves as TokenCipher’s internal HKDF building block.

Per DQ-R1-030:

  • Factory shape — companion operator fun invoke(...): Result<TokenCipher>. Constructor-shaped call site; Result<T> carries validation failures.
  • Auth-tag failure classification — AppError.Internal.IncompatibleState. Bug-worthy; pages on-call.
  • Hmac extraction — shared between TokenCipher, OpaqueId.kt, and S3AssetService.kt.

New package cards.arda.common.lib.crypto:

crypto/
├── TokenCipher.kt -- envelope cipher; public
├── Hmac.kt -- HmacSHA256 wrapper; public
├── EnvelopeAlgorithm.kt -- internal interface for algorithm-version dispatch
└── EnvelopeAlgorithmA1.kt -- internal v1 implementation

Hmac is exposed as a sibling helper because the two existing JDK-Mac call sites are public callers of the same primitive; HKDF stays internal to TokenCipher in v1 (DT-003 deferred).

package cards.arda.common.lib.crypto
import cards.arda.common.lib.lang.errors.AppError
/**
* Thin wrapper over [javax.crypto.Mac] for HmacSHA256.
*
* Existing call sites that this wrapper replaces:
* - cards.arda.common.lib.runtime.observability.OpaqueId (HMAC of the
* tenant identifier with the Sentry-scrub salt).
* - cards.arda.common.lib.infra.storage.S3AssetService (HMAC of the
* S3 object key with a per-bucket secret).
*
* v1 exposes HmacSHA256 only. Other HMAC algorithms can be added as
* additional public companion factories without breaking existing callers.
*/
class Hmac private constructor(...) {
fun mac(input: ByteArray): Result<ByteArray> = ...
companion object {
/** SHA-256 HMAC keyed with [key]. Returns failure on empty key. */
fun sha256(key: ByteArray): Result<Hmac> = ...
}
}
package cards.arda.common.lib.crypto
import cards.arda.common.lib.lang.errors.AppError
import java.util.UUID
/**
* Application-layer encrypted-field primitive matching the DQ-R1-019
* two-axis envelope: `a{N}.k{SM-VERSION-ID}:<base64-payload>`.
*
* The envelope's algorithm-version axis (`a{N}`) is code-indexed; v1 ships
* only `a1` (AES-256-GCM + HKDF-SHA256). The material-version axis
* (`k{SM-VERSION-ID}`) is runtime-indexed via the [MaterialRegistry]
* supplied at construction time.
*/
class TokenCipher private constructor(
private val info: String,
private val materials: MaterialRegistry,
private val currentVersionId: UUID,
private val algorithms: EnvelopeAlgorithmRegistry,
) {
fun encrypt(plaintext: ByteArray): Result<String> = ...
fun decrypt(envelope: String): Result<ByteArray> = ...
companion object {
/**
* Construct a [TokenCipher] for a given purpose.
*
* @param info HKDF `info` constant (per-purpose; "email-server-token",
* "card-claim-token", etc.). Non-empty.
* @param materials Registry mapping `versionId` -> 64-byte key material. Holds
* every key-material version the cipher must decrypt against.
* Pre-populated by the caller; may be mutated at runtime
* (e.g., by a caller-managed file watcher reacting to ESO
* refreshes). Must have at least one entry.
* @param currentVersionId The `versionId` to use when encrypting new envelopes. Must
* be present in [materials] at construction time.
*/
operator fun invoke(
info: String,
materials: MaterialRegistry,
currentVersionId: UUID,
): Result<TokenCipher> = ...
}
}
/** Registry mapping `versionId` -> 64-byte key material. */
class MaterialRegistry private constructor(...) {
fun get(versionId: UUID): ByteArray? = ...
fun add(versionId: UUID, material: ByteArray): Result<Unit> = ...
companion object {
fun of(initial: Map<UUID, ByteArray>): Result<MaterialRegistry> = ...
}
}

The cipher does not consult any external system at runtime. The MaterialRegistry is the single source of key material. The registry is populated at construction time by the caller — typically Phase 5b’s EmailConfigurationService parsing a JSON map projected by ESO from the partition’s EmailEncryptionKey AWS Secrets Manager secret. The caller may mutate the registry at runtime in response to ESO refresh events (file watcher, periodic re-read); the cipher reads only what is currently in the registry.

The envelope is a single string composed of three segments. The algorithm version and the SM versionId are joined with .; the prefix and the base64 payload are joined with :. Example:

a1.k01234567-89ab-cdef-0123-456789abcdef:<base64-of(IV || ciphertext || tag)>
SegmentExampleMeaning
Algorithm versiona1Envelope-algorithm version. v1 ships only a1 (AES-256-GCM + HKDF-SHA256).
SM versionIdk01234567-89ab-cdef-0123-456789abcdefAWS Secrets Manager versionId of the source material, k-prefixed. UUID format.
Payloadbase64 of (IV

Round-trip property: decrypt(encrypt(plaintext)) == plaintext for any plaintext, given the material referenced by the envelope’s versionId is reachable.

Parsing: split on :, then split the prefix on .. Reject anything that doesn’t match the expected three-segment shape with AppError.Invocation.GeneralValidation.

Encryption:

  1. Pick the current versionId from [MaterialRegistry] (caller-supplied; typically AWSCURRENT).
  2. HKDF-SHA256 over (material, info, salt = empty) -> 32-byte AES key.
  3. Generate random 12-byte IV.
  4. AES-256-GCM with the derived key, the IV, and the plaintext; output = 12-byte IV || ciphertext || 16-byte tag.
  5. base64-encode output (no padding); concatenate a1.k{versionId}: prefix.

Decryption:

  1. Parse the envelope; extract versionId and base64 payload.
  2. Look up versionId in [MaterialRegistry]. If missing, return Result.failure(AppError.Transient.FailoverFailed(...)) (see § 4.7).
  3. HKDF-SHA256 over (material, info, salt = empty) -> 32-byte AES key.
  4. Split payload into IV (12), ciphertext (variable), tag (16).
  5. AES-256-GCM decrypt with the derived key. If the auth-tag check fails, return Result.failure(AppError.Internal.IncompatibleState(...)) (see § 4.7).
  6. Return Result.success(plaintext).

Two failure modes on decrypt, classified into different AppError families because the operational response differs.

Auth-tag mismatchResult.failure(AppError.Internal.IncompatibleState(...)). Genuinely bug-class:

  • The versionId was found, so we’re using the correct key material.
  • The auth tag failed, so either (a) the ciphertext was corrupted in storage, (b) the IV or tag bytes were truncated / shifted, or (c) the envelope was tampered with.
  • None of these are normal operational outcomes. Auth-tag failure is rare; when it happens, an engineer needs to look. Internal.IncompatibleState’s reportable() returns listOf(this), so this surfaces in on-call alerting.

Unknown versionIdResult.failure(AppError.Transient.FailoverFailed(cause)) where cause is a synthetic Throwable whose message names the missing versionId (e.g. IllegalStateException("Key material for version $versionId not present in registry")). Bounded transient:

  • The pod’s MaterialRegistry has not yet observed a version that the secret store knows about. In production this happens during the brief window between an ESO refresh of the projected JSON map and the application’s reaction to that refresh (or, less commonly, during pod startup before the first projection has been read).
  • Existing retry layers — Postmark webhook delivery retries, the application’s outbound idempotency + retry on AppError.Transient.*, L4 client retries on 5xx — fire after timescales that exceed ESO’s reconciliation interval. By the time the retry attempt arrives, the registry has been refreshed and the decrypt succeeds.
  • Transient’s reportable() returns empty list, so this does NOT page on-call.
  • Class-name caveat: AppError.Transient.FailoverFailed was originally defined for Aurora failover scenarios; its name is observability noise here. Diagnostic information lives in the cause’s message and in structured logging at the catch site. A more specific subtype (Transient.PropagationLag or similar) is intentionally NOT added because adding to the sealed Transient hierarchy would force consumers’ exhaustive when blocks to update — a breaking change disproportionate to the observability gain. Reuse of the existing subtype is the deliberate trade-off.

If operational reality reveals spurious auth-tag failures (e.g., a real ESO sync producing brief windows of stale key material in a mount), the auth-tag classification can be revisited — but the starting position is bug-worthy.

4.8 Adjacent refactor — OpaqueId and S3AssetService

Section titled “4.8 Adjacent refactor — OpaqueId and S3AssetService”

Two existing call sites inline the JDK-Mac dance for HmacSHA256:

// OpaqueId.kt:67
val mac = Mac.getInstance("HmacSHA256").apply { init(SecretKeySpec(salt, "HmacSHA256")) }
// S3AssetService.kt:143-144
Mac.getInstance("HmacSHA256")
.apply { init(SecretKeySpec(key, "HmacSHA256")) }

Both migrate to:

val hmacResult = Hmac.sha256(key)
// ... use hmacResult.getOrThrow().mac(input) inside the existing Result chain

The behaviour at both sites is byte-identical before and after; the migration is a private call-site refactor with no external API change. The PR’s CHANGELOG entry stays Added-only (the new helper is the addition; the migration is internal).

SurfaceTest typeWhat it asserts
Hmac.sha256 ctorPure KotlinEmpty key returns Result.failure; non-empty key returns Result.success.
Hmac.mac round-tripPure KotlinKnown-answer test against a canned (key, input, expected) triple.
TokenCipher.invoke ctorPure KotlinEmpty info returns Result.failure; empty registry returns Result.failure; currentVersionId not in materials returns Result.failure; valid args return Result.success.
TokenCipher encrypt-decrypt round-tripPure KotlinRandom plaintexts of various lengths (0, 1, 16, 1024, 65536 bytes) round-trip cleanly within the same MaterialRegistry.
Material-version transitionPure KotlinEnvelope written with versionId=A decrypts cleanly when versionId=A is still in the registry.
Tampered envelopePure KotlinFlipping a single byte of the base64 payload causes decrypt to return Result.failure(AppError.Internal.IncompatibleState).
Unknown versionIdPure KotlinEnvelope referencing a versionId not in the registry returns Result.failure(AppError.Transient.FailoverFailed) whose cause’s message names the missing version.
Bad envelope shapePure KotlinEnvelopes missing the . or : separators return Result.failure(AppError.Invocation.GeneralValidation).
OpaqueId regressionPure KotlinExisting OpaqueIdTest continues to pass post-migration.
S3AssetService regressionPure KotlinExisting S3AssetService tests continue to pass post-migration.

Carved out into a separate design document due to depth and surface area. See idempotency-design.md for:

  • API sketches (RawIdempotencyStore, IdempotencyStore<Req, Res>, IdempotencyStoreFactory, IdempotencyKeyMinter).
  • Package layout (cards.arda.common.lib.runtime.idempotency).
  • DB schema and concurrency strategy (INSERT ON CONFLICT + follow-up SELECT, no row locks).
  • Canonical-JSON helper for stable hashing.
  • Test plan (ContainerizedPostgres lifecycle, Mismatch detection, InFlight, purgeExpired).

Phase 5b L3 services consume the typed view (IdempotencyStore<EmailSendRequest, EmailJob>) and the IdempotencyKeyMinter for outbound Postmark retry safety. The idempotency_record Flyway migration ships in Phase 5b’s consumer adoption (operations), not in common-module.

All five helpers follow the workspace kotlin-coding standards:

  • Every fallible method returns Result<T>; single-exit composition with flatMap / mapCatching.
  • No !!, getOrThrow, getOrNull. Tests are the exception (Result<T>.getOrThrow() is permissible to surface unexpected failures as test failures).
  • when over if for branching on type or status.
  • DI for all external dependencies (Postgres connection, AWS SDK clients, JsonConfig).
  • @JvmInline value classes for primitive type-safety (ConsumerNamespace, IdempotencyKey in the idempotency design; Hmac and TokenCipher are classes because they hold mutable internal state).

JsonConfig.standardJson (at cards.arda.common.lib.lang.serialization.Json.kt) is the canonical Json instance for all kotlinx serialization in common-module. Phase 5a’s error_payload projection and the idempotency canonical-JSON helper use it directly. For canonical hashing (where prettyPrint = false is needed), refine via:

private val canonicalJson = JsonConfig.refine { prettyPrint = false }

JsonConfig.refine(block) returns a fresh Json instance with the standard configuration applied first, then block applied. See Json.kt:31.

common-module publishes to the workspace’s GitHub Packages repository. Each Phase 5a PR’s CHANGELOG entry version bump becomes the published artefact version. Phase 5b’s gradle.properties bump consumes the published version once Phase 5a’s final PR has merged.

Consumer-side authentication uses the workspace GITHUB_TOKEN pattern (per workspace memory: GITHUB_TOKEN=$(gh auth token) npm install for npm; equivalent Gradle pattern for common-module).

Phase 5a’s helpers must compose cleanly with constraints set in earlier rounds of the project’s design. The four below are not restated as decisions in this document (they were settled before Phase 5a started) but every helper in this design honours them:

  • DQ-204 — STS role chain for outbound AWS calls (decision-log, DQ-R1-020). TokenCipher makes no outbound AWS calls: MaterialRegistry is populated by the caller from an ESO-projected JSON map (see § 4.4). common-module stays AWS-SDK-agnostic. AppError classification for STS-class failures (authorization-shaped) is the L1 / L2 caller’s responsibility in Phase 5b; Phase 5a’s Application.PolicyRejected is available for that classification but is not wired here.
  • DQ-206 — Outbound encryption-key handling (decision-log). Plaintext lives only on the call stack. TokenCipher.encrypt and TokenCipher.decrypt neither log nor cache plaintext; the MaterialRegistry caches derived key material keyed by SM versionId, never plaintext. The HKDF derivation runs in-stack per call; no key cache.
  • DQ-208 — Async-tx boundaries (decision-log). L3 services own transactions; common-module helpers must not open or close transactions on the caller’s behalf. The IdempotencyStoreFactory’s inTransaction(connection) / inConnection(connection) / withTx(tx) shape (mirroring DatabaseBackedMap) binds the store to the caller’s transaction without owning it.
  • Cross-Universe rule (information-model-design). Entities owned by different services must not share foreign keys or transactions. The shared idempotency_record table is partitioned by ConsumerNamespace (see idempotency-design.md § 3.1); the schema has no foreign keys to consumer-owned tables; per-consumer rows never cross service boundaries.

Five PRs land in common-module. Per DQ-R1-028, four are Added-only minors and one (the sweep) is Changed major. The sweep lands last so consumers absorb one combined gradle.properties bump.

#DeliverableSource designReleaseIndependence
1AppError.Application introduction (the three subtypes + reportable() override)§ 1Added; 9.2.0No predecessors. PR #2 depends on this.
3sanitizeHeader (lib/api/headers/)§ 3Added; 9.3.0No predecessors. Parallelisable with #1, #4, #5.
4TokenCipher + Hmac (lib/crypto/) + OpaqueId / S3AssetService migration§ 4Added; 9.4.0No predecessors. Parallelisable with #1, #3, #5.
5Idempotency helpers (lib/runtime/idempotency/)idempotency-design.mdAdded; 9.5.0No predecessors. Parallelisable with #1, #3, #4.
2Internal.IncompatibleState reclassification sweep§ 2Changed; 10.0.0Requires PR #1 merged. Lands last.

PRs #1, #3, #4, #5 are parallelisable in any order. PR #2 is sequenced last so it lands as the major-bump consolidation.

PR-by-PR base: each PR opens off origin/main. There is no integration branch in common-module; the five PRs are independent contributions that merge in their own order. The Phase 5b consumer adoption PR sees the cumulative effect via a single gradle.properties lift to 10.0.0.

RiskMitigation
Adding AppError.Application as a sealed-class peer of Internal/Invocation causes source-incompatibility for consumers doing exhaustive when over AppError.This is a known cost of sealed-class additions and the workspace already lives with it. The PR is Added-only because it adds a new branch; consumers update their when clauses opportunistically. (The sweep PR — which reclassifies sites into the new branch — is the Changed/major-bump release.)
The IncompatibleState sweep mis-classifies a site (kept when it should have moved, or vice versa).Per-site one-line rationale captured in the PR description; the sweep is reviewable as a checklist; mis-classifications are correctable in a follow-up PR (additive within the same release line).
TokenCipher auth-tag failures fire spuriously in production due to ESO sync gaps producing brief windows of stale key material.Operational dashboards (Sentry on Internal.IncompatibleState) surface the failure rate; if non-trivial, the classification revisits with operational data. The genuine “key not present yet” case is the Transient.FailoverFailed path (§ 4.7), not the auth-tag path; spurious ESO-sync windows should not produce auth-tag mismatches (the material itself is intact when projected).
The Hmac extraction breaks the OpaqueId / S3AssetService behaviour subtly.Existing tests for both files continue to pass after the migration; known-answer tests in HmacTest.kt confirm byte-equivalent output for canned inputs.
Phase 5b consumes the new helpers before common-module 10.0.0 is published.Phase 5b’s implementation merge is gated on the publication. The Phase 5a release sequencing ensures 10.0.0 is the final release of Phase 5a; Phase 5b’s gradle.properties bumps to that exact version.
  • goal.md — Phase 5a goal, success criteria, repository scope.
  • task-plan.md — six-PR execution plan with worktree strategy.
  • idempotency-design.md — carved-out idempotency-helpers design.
  • Inherited decisions (DQ-201..208, DQ-012, DQ-R1-019) live in ../../decision-log.md; see § 6.4 above for the four constraints this design honours implicitly.
  • decision-log.mdDQ-R1-027 through DQ-R1-031 for this phase; DQ-R1-019 for the encryption-envelope source.
  • cards.arda.common.lib.lang.errors.AppError (AppError.kt) — existing hierarchy that § 1 extends.
  • cards.arda.common.lib.lang.serialization.JsonConfig (Json.kt) — canonical Json instance; refine(block) helper at line 31.
  • cards.arda.common.lib.runtime.observability.HeadersAllowList (HeadersAllowList.kt) — composition counterpart for § 3.
  • cards.arda.common.lib.runtime.observability.OpaqueId (OpaqueId.kt:67) — HmacSHA256 migration target.
  • cards.arda.common.lib.infra.storage.S3AssetService (S3AssetService.kt:143) — second HmacSHA256 migration target.
  • cards.arda.common.lib.persistence.keystore.DatabaseBackedMap (DatabaseBackedMap.kt) — factory pattern (inTransaction / inConnection / withTx) the idempotency factory mirrors.
  • kotlin-codingResult<T>, single-exit, when over if, no !! / getOrThrow / getOrNull.
  • plantuml-guide — diagram conventions (validated; named colors; prose summary).