Design: Per-Partition Email Server-Token Encryption Key

The runtime data-encryption key that protects per-tenant Postmark server tokens in the database. Phase 4 creates this key per partition; Phase 5b’s EmailConfigurationService consumes it via HKDF derivation and uses it to AES-256-GCM-encrypt server tokens before INSERT and decrypt them on demand. This document closes the open sub-questions left by DQ-012, DQ-202, DQ-203, and pins the three Phase 4 sub-decisions that together resolve DQ-R1-019.

Decision Summary

#	Sub-question	Resolution
1	How is the secret initially generated and named?	CDK `aws_secretsmanager.Secret` with `generateSecretString({ passwordLength: 64 })`, `RemovalPolicy.RETAIN`, named `{fqn}-I-EmailEncryptionKey` per partition (single SM secret name; matches the established Arda convention where `-I-` is the marker even for ESO-consumed secrets). No version suffix in the resource name — version tracking is handled by AWS Secrets Manager’s native versioning.
2	What identifies a particular ciphertext’s encryption material?	A two-axis envelope `a{N}.k{SM-VERSION-ID}:<base64-payload>`. `a{N}` is the algorithm version (rare; code-indexed; bumps require a release). `k{SM-VERSION-ID}` is the AWS Secrets Manager `versionId` of the secret material used at write time (frequent; runtime-indexed; bumps on every rotation). Algorithm and material lifecycles are profoundly different; coupling them in one marker would churn the code-side dispatch table on every rotation.
3	How is the secret rotated?	Rotation = `aws secretsmanager update-secret` against the single SM secret. AWS creates a new versionId, moves `AWSCURRENT` to it, demotes the prior to `AWSPREVIOUS`. ESO mounts both `AWSCURRENT` and `AWSPREVIOUS` into the operations component pod via two `ExternalSecret` resources. The Email module’s `TokenCipher` holds both derived keys, dispatches on the envelope’s `k{versionId}` prefix. Migration is lazy + coroutine mop-up (next sub-section). Rare ciphertexts older than `AWSPREVIOUS` trigger a direct AWS SM SDK fetch via the `EmailEncryptionKeyFallbackRole` that the operations pod assumes from its IRSA-bound pod role (DQ-R1-020 STS chain; cache-miss path). Fetched material is cached for the pod’s lifetime.

Full rationale below. The Round R1-Phase4 decision-log entry for DQ-R1-019 (yet to be written) summarises this design.

Context

What’s already pinned by prior decisions

Decision	What it locks
DQ-012	Per-tenant Postmark server tokens are encrypted application-side with a partition-wide symmetric key before INSERT; key lives in AWS Secrets Manager; key delivered to pods via ESO as `extras.email.encryptionKey` (HOCON); read only by `EmailConfigurationService` (L3). No per-tenant SM writes. No KMS-CMK envelope.
DQ-202	On-disk format: AES-256-GCM versioned envelope, base64 in `email_configuration.server_token_encrypted` (text column). The “versioned” qualifier carries forward to envelope-format version.
DQ-203	The SM value is a 64-byte high-entropy `GeneratedSecret`. The application HKDF-derives the actual AES-256 key from that secret in `TokenCipher`’s constructor, with `info = "arda.email.serverToken.v1"`. The `v1` in the info string is the same `vN` discussed below.
DQ-205	Persist-first lifecycle; decryption only on demand by `getUnlockedConfiguration()`. Plaintext lives only in the in-memory call stack during a send.

Threat model boundary (already-resolved scope)

In scope: defense against DB exposure with read-only privilege — backups leaving the trust boundary, SQL-injection-style read leaks, analyst sessions. A SELECT * on email_configuration yields ciphertext only.
Out of scope: pod / process compromise (attacker has both key and DB); insider with both extras.email.encryptionKey HOCON access and DB write privilege; Postmark account compromise. Platform-level controls (IRSA, network segmentation, container hardening) own these.

This boundary informs every sub-decision below — in particular, why we do not invest in elaborate rotation machinery in v1: an attacker who can already read the DB cannot decrypt without also compromising the pod or the HOCON config, in which case rotation is moot until the pod-compromise vector is also closed.

Sub-questions left open for DQ-R1-019

Three: (1) name + creation of the SM secret in CDK; (2) the precise meaning of vN in the envelope; (3) rotation procedure.

Sub-decision 1 — Naming and creation

Recommendation

Resource name: {fqn}-I-EmailEncryptionKey per partition. The -I- marker matches the convention as practiced (see “Naming convention” below). No version suffix in the resource name — version tracking is delegated to AWS Secrets Manager’s native per-secret versioning.

CDK declaration (in stacks/purpose/partition-email.ts, Phase 4 deliverable):

new aws_secretsmanager.Secret(this, "EmailEncryptionKey", {
  secretName: `${fqn}-I-EmailEncryptionKey`,
  description:
    "Per-partition data-encryption key (HKDF input) for tenant Postmark " +
    "server tokens. Consumed by the operations component via two ESO " +
    "mounts (AWSCURRENT and AWSPREVIOUS). Rotation is `aws " +
    "secretsmanager update-secret`. DO NOT delete versions or the " +
    "secret itself without explicit operator action — any retired " +
    "version that still appears in an `email_configuration." +
    "server_token_encrypted` envelope's `k{...}` tag is unrecoverable " +
    "for that row.",
  generateSecretString: {
    passwordLength: 64,
    excludeCharacters: '"@/\\',
    excludePunctuation: false,
  },
  removalPolicy: cdk.RemovalPolicy.RETAIN,
});

One secret per partition. AWS SM versions carry the per-rotation history; no sibling CDK constructs per version. Rotation = aws secretsmanager update-secret against this single name.
No cross-partition sharing. No derivation from a parent secret. Each partition’s blast radius is bounded by its own SM secret.

CDK lifecycle invariants (no auto-rotation)

CDK’s generateSecretString synthesizes to CloudFormation’s GenerateSecretString property, which has “generate-if-missing” semantics — the random value is produced only when the secret resource doesn’t yet exist. Subsequent cdk deploy runs (and the amm.sh re-runs that drive them) are no-ops on this secret’s value. Specifically:

Event	Behavior on this secret
First `cdk deploy` (secret doesn’t exist)	CFN calls `secretsmanager:GetRandomPassword` and creates the secret
Subsequent `cdk deploy` (CDK config unchanged)	No-op; secret value unchanged
`amm.sh` re-run for a deployed partition	No-op on this secret
Operator rotation (`aws secretsmanager get-random-password` → `put-secret-value`)	Rotation per Sub-decision 3 — works; CDK does not fight it (CDK doesn’t track the secret’s value)
Next `cdk deploy` after manual rotation	Still no-op
Resource `Name` change	Resource replacement (delete + create) → new value generated. Avoided: stack name and resource name are immutable per the project’s CFN-name discipline.
Any field of `generateSecretString` changes (e.g., `passwordLength: 64 → 65`)	CFN may regenerate the value, silently rotating it without going through the Sub-decision 3 procedure. Do not modify these fields post-launch.

The generateSecretString configuration must be treated as immutable once the secret has been deployed to any partition. If a future algorithm change requires different key derivation parameters, bump a{N} (Sub-decision 2) instead — the algorithm registry carries the HKDF parameters, and a routine update-secret rotation produces the new material under the existing generation config.

The construct’s CDK ID at the call site ("EmailEncryptionKey" in the code block above) is similarly immutable: changing it would alter the CFN logical ID and force a resource replacement, regenerating the value and orphaning every prior k{...} envelope. The construct’s Name property ({fqn}-I-EmailEncryptionKey) is immutable for the same reason.

This invariant is load-bearing for the rotation model in Sub-decision 3: that model presumes the only way the SM secret’s value changes is via operator-driven update-secret. Accidental CDK-driven regeneration would create a new versionId without going through the dual-ESO-mount migration path, breaking every prior a1.k{...} envelope in flight.

Naming convention

The Arda repo uses -I- as the marker on every existing partition-scoped AWS resource name, including secrets consumed by ESO (which is technically a non-CDK consumer). Examples from operations/src/main/helm/templates/secrets.yaml:

{infra}-{partition}-I-ArdaApiKey
{infra}-{partition}-I-DocumintApiKey
{infra}-{partition}-I-GhcrPullSecret
{infra}-{partition}-I-PrimaryDb (RDS-managed)
{infra}-{partition}-I-EmailPostmarkAccountToken (Phase 4 sibling secret declared in cross-cutting-design.md)

So the -I- / -API- distinction documented in cdk-infrastructure.md § Export naming is interpreted in practice as: -I- for intra-partition / intra-account resources; -API- for cross-account or cross-partition exports. The encryption key is intra-partition (one partition’s operations pod reads one partition’s SM secret via its own IRSA role; no cross-account flow), so -I- is correct. The earlier draft of this design that proposed -API- is retracted.

Why `RemovalPolicy.RETAIN`

Losing the SM secret resource (not just a single version) loses every encrypted tenant token in the partition (irreversible). cdk destroy must not delete this. Operator-initiated deletion is a deliberate step through the AWS console / CLI, never CDK-triggered. Deletion of individual SM versions follows the same discipline — see Sub-decision 3’s retirement procedure.

What the secret value is

64 bytes of high-entropy material. This is not the AES-256 key directly — it is the HKDF input. The application derives a 32-byte AES-256 key from it via HKDF-SHA256 with info = "arda.email.serverToken.a{N}" at first use (DQ-203, with the version qualifier updated to track the algorithm axis a{N} introduced in Sub-decision 2). Choosing 64 bytes (rather than 32) leaves headroom for future algorithm changes that need additional derived keys (e.g., an HMAC key for envelope integrity) without changing the SM secret shape.

Sub-decision 2 — Two-axis envelope (`a{N}.k{SM-VERSION-ID}`)

Recommendation

The envelope carries two version markers rather than one:

a{N}.k{SM-VERSION-ID}:<base64(NONCE || CIPHERTEXT || TAG)>

a{N} — algorithm version. A short monotonic counter (a1, a2, …). Bumps rarely — only when the algorithm / KDF / nonce size / envelope layout / HKDF info string changes. Each bump requires a code release. Indexed in code.
k{SM-VERSION-ID} — secret material version. The AWS Secrets Manager versionId (a 36-char UUID like EXAMPLE1-90ab-cdef-fedc-ba0987654321) of the SM version that held AWSCURRENT at write time. Bumps on every rotation. Indexed at runtime, via SM.

The two axes are decoupled because their lifecycles are profoundly different:

Axis	Cadence	Triggers	Indexed where	Entries retired when
`a{N}`	Years apart	Algorithm/KDF/envelope change (a code release)	`EnvelopeAlgorithmRegistry` (compile-time dispatch table in `operations`)	Never — entries stay in code forever so historical envelopes remain decodable
`k{...}`	Rotation cadence (compliance / on-compromise / scheduled)	`aws secretsmanager update-secret`	`SecretMaterialRegistry` (runtime map populated from ESO; cache-miss falls back to AWS SM SDK)	After the migration pass drains every row tagged with that versionId AND the operator explicitly deletes the SM version

Coupling them under a single vN (the previous draft of this design) would churn the code-side algorithm registry on every routine rotation. The two-axis split keeps the rare/code-bound axis stable while the frequent/material axis tracks SM-native versioning.

Concrete envelope format (a1)

a1.k<uuid>:<base64(NONCE || CIPHERTEXT || TAG)>

Where:

<uuid> — the AWS Secrets Manager versionId of the secret version used at write time.
NONCE: 12 random bytes per write (AES-GCM standard; uniform-random 12-byte nonces are safe within ~2^32 writes per derived key, comfortable margin).
CIPHERTEXT: AES-256-GCM(key, NONCE, plaintext, aad=∅) — no associated data in a1. A future a2 could bind AAD to the row’s email_configuration_id.
TAG: 16 bytes, the AES-GCM authentication tag.

Key derivation for a1: key = HKDF-SHA256(material = SM-version-bytes, info = "arda.email.serverToken.a1", length = 32). The HKDF info string carries the algorithm axis (a1) so the derived AES key changes if the algorithm version bumps, even with constant material — a defense against weak-by-construction algorithm migrations.

When each axis bumps (worked examples)

Event	`a{N}` action	`k{...}` action	Re-encryption pass needed?
Phase 4 initial deploy	Establish `a1`	Establish `k{uuid-1}` (first AWSCURRENT)	No (no existing rows)
Routine secret rotation	unchanged (`a1`)	`k{uuid-1}` → `k{uuid-2}` via `update-secret`	Yes
Compliance-driven secret rotation	unchanged	`k{uuid-N}` → `k{uuid-N+1}`	Yes
Algorithm migration (e.g., to ChaCha20-Poly1305)	`a1` → `a2`	new SM versionId at the same time	Yes
Adding AAD to the envelope	`a1` → `a2`	new SM versionId	Yes
Pod config refresh, same SM secret	unchanged	unchanged	No
ESO refresh interval change	unchanged	unchanged	No

The two-axis design means a hypothetical “algorithm change without material rotation” (a1.k{X} → a2.k{X}) is structurally supported but operationally unusual — algorithm migrations almost always pair with a fresh rotation for hygiene. The framework supports either pattern with no code change.

Application-side dispatch (the two registries)

The envelope’s a{N}.k{...} is the join key into two parallel registries inside TokenCipher. Both must contain the corresponding entry for a ciphertext to be readable.

Index 1 — Algorithm registry (code-side, compile-time)

A sealed interface EnvelopeAlgorithm with one implementation per a{N}. Each implementation binds:

Envelope parser: how to decode the a{N}.k{...}:<payload> bytes (split NONCE / CIPHERTEXT / TAG; or any future layout).
KDF + info string: how to derive the symmetric key from the SM-version material (HKDF-SHA256 with info = "arda.email.serverToken.a{N}" for a1; may differ for a2+).
Cipher operations: encrypt/decrypt pair (AES-256-GCM for a1; potentially ChaCha20-Poly1305 etc. for a2+).
Nonce policy: 12 random bytes per write; regenerate per encryption.
AAD policy: empty for a1; could bind to email_configuration_id for a2+.

Conceptual shape (Phase 5b will implement):

sealed interface EnvelopeAlgorithm {
  val tag: String                                    // "a1"
  fun deriveKey(material: ByteArray): ByteArray      // HKDF-SHA256 with the algorithm's info string
  fun encrypt(key: ByteArray, plaintext: ByteArray, keyVersionId: String): String
  fun decrypt(key: ByteArray, envelope: String): ByteArray
}

object EnvelopeAlgorithmRegistry {
  private val byTag = mapOf(
    "a1" to AesGcmHkdfSha256A1,
    // "a2" to AesGcmHkdfSha256A2_WithAad,   // hypothetical: same cipher + AAD
    // "a3" to ChaCha20Poly1305A3,           // hypothetical: cipher change
  )
  fun resolve(tag: String): EnvelopeAlgorithm =
    byTag[tag] ?: throw UnknownAlgorithmVersion(tag)
}

Lifecycle: entries are never removed from the code. Once a1 is supported, the implementation stays in the binary forever — otherwise an archived backup or a forgotten row at a1 would become un-decryptable.

Index 2 — Secret material registry (runtime, mutable)

A startup-populated ConcurrentHashMap<String /* SM versionId */, ByteArray /* HKDF material */>. Two normal entry sources:

The AWSCURRENT ESO mount: K8s Secret email-encryption-key-current carries the material plus the SM versionId as separate fields. The application registers an entry keyed by the versionId.
The AWSPREVIOUS ESO mount: K8s Secret email-encryption-key-previous carries material + versionId. If no AWSPREVIOUS exists (initial deploy, before any rotation), the K8s Secret resolves empty and the application skips the registration.

Each ESO ExternalSecret is configured to project both the SecretString and the versionId metadata into its K8s Secret. The Helm chart’s HOCON template aggregates these into the application-visible list.

HOCON shape (concrete proposal for Phase 5b):

extras.email.encryptionKeys = [
  { versionId = ${?ARDA_EMAIL_KEY_CURRENT_VERSION_ID},  material = ${?ARDA_EMAIL_KEY_CURRENT_MATERIAL} }
  { versionId = ${?ARDA_EMAIL_KEY_PREVIOUS_VERSION_ID}, material = ${?ARDA_EMAIL_KEY_PREVIOUS_MATERIAL} }
]
extras.email.currentVersionId = ${?ARDA_EMAIL_KEY_CURRENT_VERSION_ID}
extras.email.secretArn        = ${?ARDA_EMAIL_KEY_SECRET_ARN}

The list is filtered at HOCON-load time to drop entries with empty/missing versionIds (handles AWSPREVIOUS-absent state). secretArn is needed for the SDK-fallback path.

How `TokenCipher` combines them

class TokenCipher(
  private val algorithms: EnvelopeAlgorithmRegistry,
  private val materials: ConcurrentHashMap<String, ByteArray>,   // SmVersionId → 64-byte HKDF material
  private val derivedKeys: ConcurrentHashMap<Pair<String, String>, ByteArray> = ConcurrentHashMap(),
                                                                  // (algoTag, versionId) → derived AES key
  private val currentVersionId: String,
  private val currentAlgorithmTag: String,                       // compile-time constant for the writer algorithm
  private val secretArn: String,
  private val sdkClient: SecretsManagerClient,
) {
  fun encrypt(plaintext: ByteArray): String {
    val algo = algorithms.resolve(currentAlgorithmTag)
    val key  = keyFor(algo, currentVersionId)
    return algo.encrypt(key, plaintext, currentVersionId)
  }

  fun decrypt(envelope: String): ByteArray {
    val (algoTag, keyVersionId) = parseHeader(envelope)            // splits "a1.k{...}:..."
    val algo = algorithms.resolve(algoTag)                         // → UnknownAlgorithmVersion if missing
    val key  = keyFor(algo, keyVersionId)                          // → RetiredSecretVersion if SM also unknown
    return algo.decrypt(key, envelope)
  }

  private fun keyFor(algo: EnvelopeAlgorithm, versionId: String): ByteArray =
    derivedKeys.computeIfAbsent(algo.tag to versionId) {
      val material = materials[versionId] ?: fetchFromSdk(versionId)
      algo.deriveKey(material)
    }

  private fun fetchFromSdk(versionId: String): ByteArray {
    val resp = sdkClient.getSecretValue { it.secretId(secretArn).versionId(versionId) }
    val material = resp.secretString().toByteArray(Charsets.UTF_8)
    materials[versionId] = material                                 // cache for the pod's lifetime
    return material
  }
}

Both indices contribute their checks:

Algorithm registry miss → UnknownAlgorithmVersion. Means an old deploy is reading a row written by a newer deploy that introduced a{N+1}. Should not happen if deploys go forward-only; surface as a deploy-version alarm.
Secret material registry miss with successful SDK fallback → cache populated; future reads of the same k{...} avoid the SDK call. The operations pod federates into its IRSA-bound pod role at startup, then performs sts:AssumeRole into the partition’s EmailEncryptionKeyFallbackRole (per DQ-R1-020) to obtain secretsmanager:GetSecretValue on the encryption-key SM ARN (no versionId restriction). The pod role itself does not carry the SM permission; permissions live on the purpose-specific fallback role.
SDK fallback returns 404 → RetiredSecretVersion. Means the operator deleted the SM version while rows still reference it. Surface as an alarm; runbook covers manual remediation.

Failure-mode matrix

State	Algorithm registry	Secret-material registry	SDK fallback can recover?	Effect
Normal `a1` operation	has `a1`	has `k{current}`	n/a	Encrypt + decrypt with `a1.k{current}`
Mid-rotation; AWSPREVIOUS exists	has `a1`	has `k{current}` and `k{previous}`	n/a	Reads of both succeed; writes use `k{current}`
Read of row at `k{very-old}` (older than AWSPREVIOUS)	has `a1`	does not have `k{very-old}`	Yes, if SM still has the version	One-off SDK call; cached for the pod’s lifetime; row then lazily migrated to `k{current}` (Sub-decision 3)
Read of row at `k{deleted}` (operator removed SM version)	has `a1`	does not have `k{deleted}`	No (404)	Throw `RetiredSecretVersion`; alarm; runbook
Read of row at `a99` (unknown algorithm tag)	no `a99`	n/a	n/a	Throw `UnknownAlgorithmVersion`; means stale deploy reading newer-deploy rows

Sub-decision 3 — Rotation and migration

Rotation procedure (Phase 5b onward, with tenant rows in flight)

Operator decides to rotate (triggers below).
Generate the new key material and write it to the SM secret via two AWS CLI calls (the AWS CLI’s update-secret subcommand does not accept --generate-random-password — that flag belongs to get-random-password):
Terminal window
```
NEW=$(aws secretsmanager get-random-password \
  --password-length 64 --exclude-characters '"@/\' \
  --require-each-included-type \
  --output text --query RandomPassword)
aws secretsmanager put-secret-value \
  --secret-id "{fqn}-I-EmailEncryptionKey" --secret-string "$NEW"
```
put-secret-value creates a new versionId, automatically promotes it to AWSCURRENT, and demotes the prior version to AWSPREVIOUS. (Future automation: an AWS SM Rotation Lambda performs the same two-step flow on a schedule.)
ESO refresh. Within the ClusterSecretStore’s refreshInterval (~1 minute today), ESO re-pulls both AWSCURRENT and AWSPREVIOUS. The two K8s Secrets update; the operations pod sees the new versionIds on its next HOCON refresh or restart.
Pod refresh. Either rolling-restart the operations component, or — if Phase 5b implements TokenCipher.reload() on a periodic supplier — wait for the next refresh tick. After the refresh, currentVersionId points to the new SM version; the previous version is still registered for read-dispatch; new writes encrypt as a1.k{new}:….
Migration runs automatically. Triggered on the next non-up-to-date read; see “Migration model” below. No manual job to dispatch.
Verify migration completion. SELECT COUNT(*) FROM email_configuration WHERE server_token_encrypted NOT LIKE 'a1.k${currentVersionId}:%' returns zero. (Phase 5b can expose this query as an admin endpoint, e.g., GET /admin/email-key-rotation/status.)
Optionally retire the old SM version. aws secretsmanager update-secret-version-stage ... --remove-from-version-id <old> removes the AWSPREVIOUS label; the version stays in SM history (still SDK-fetchable). For full removal, delete-secret --version-id triggers the 7-day SM-deletion window. The runbook explains both options; routine cadence is “leave the prior version stage-less but in history for at least one more rotation cycle”.

The procedure works without taking writes offline — the pod’s dispatch table holds both keys throughout the migration; reads of pre- and post-migrated rows both succeed; writes commit in the new format.

Migration model — synchronous on first stale read + coroutine mop-up

Triggered by EmailConfigurationService.getUnlockedConfiguration(id):

// inside L3 service
val row = repo.findById(id)
val plaintext = tokenCipher.decrypt(row.serverTokenEncrypted)
val (_, keyVersionId) = TokenCipher.parseHeader(row.serverTokenEncrypted)
if (keyVersionId != tokenCipher.currentVersionId) {
  // Out-of-date: synchronously re-encrypt this row's token before returning to the caller.
  val newEnvelope = tokenCipher.encrypt(plaintext)
  repo.updateServerToken(id, newEnvelope)
  // Mop up the rest of the partition asynchronously.
  migrationCoroutine.launchIfNotRunning(scope = appScope) {
    repo.findAllNotAtVersion(tokenCipher.currentVersionId).forEach { staleRow ->
      val stalePlaintext = tokenCipher.decrypt(staleRow.serverTokenEncrypted)
      val freshEnvelope  = tokenCipher.encrypt(stalePlaintext)
      repo.updateServerToken(staleRow.id, freshEnvelope)
    }
  }
}
return plaintext

Properties:

First non-up-to-date read triggers the mop-up. No scheduled job needed; whoever sends an email next initiates the partition-wide migration.
Synchronous self-healing on the read path. The current caller’s row is always re-encrypted before plaintext is returned, so subsequent reads of the same row are fast.
launchIfNotRunning guards the coroutine with a Mutex.tryLock() (or boolean flag), ensuring one in-flight mop-up per pod. Idempotent — re-entering after a crash is safe because every row is checked against currentVersionId before write.
Multi-pod coordination via idempotency. Each pod’s coroutine filters WHERE keyVersionId != currentVersionId; concurrent pods walking overlapping rows converge (already-migrated rows are skipped by the WHERE clause).
Coroutine scope. Uses the pod-lifetime appScope, not the request scope — the mop-up outlives the request that triggered it but dies cleanly on pod shutdown. The next non-up-to-date read on a restarted pod re-triggers.

Cache-miss path (rows older than AWSPREVIOUS)

If a read encounters k{very-old-uuid} that is neither AWSCURRENT nor AWSPREVIOUS:

TokenCipher.decrypt() looks up materials[very-old-uuid] → miss.
Falls through to fetchFromSdk(very-old-uuid). The operations component performs sts:AssumeRole into EmailEncryptionKeyFallbackRole (per DQ-R1-020), which holds secretsmanager:GetSecretValue on the encryption-key SM ARN (no versionId restriction). The pod’s IRSA-bound pod role does not carry the SM permission directly — the assume-role hop is mandatory.
If SM has the version: material is fetched, HKDF-derived, cached for the pod’s lifetime. The row is then migrated to currentVersionId per the lazy migration path above.
If SM has no such version (operator deleted it via delete-secret --version-id): throw RetiredSecretVersion. The operations component emits an alarm; the row is unrecoverable from the application’s perspective. Runbook covers manual remediation: (a) restore the SM version from AWS retention if still within the window, (b) clear the row from out-of-band records, or (c) treat the row as compromised and re-provision the tenant.

The SDK fallback is rare by construction: routine rotation always demotes prior to AWSPREVIOUS, which the pod has continuously mounted. Two consecutive rotations within the same un-migrated window are the only natural failure mode for the cache. Even then, the operator can preemptively label additional past versions with custom stages and mount them in Helm; the SDK fallback is the safety net, not the primary path.

Rotation triggers

Trigger	Action
Suspected key compromise	Immediate rotation. Migration pass proceeds normally. The operator should also consider `delete-secret --version-id` on the compromised version to ensure it cannot be recovered.
Insider with HOCON-config access departs	Routine rotation on off-boarding if the departing party had access to `extras.email.encryptionKeys` on a partition’s pod.
Compliance / hygiene cadence	No fixed schedule in this design. Compliance reviews (SOC 2 etc.) may later require a cadence (90-day, annual, etc.); the procedure applies regardless of trigger.
Algorithm migration (`a{N}` bump)	Same procedure; the code release introducing `a{N+1}` lands first, then a rotation creates a new SM versionId, then the migration pass re-encrypts existing rows under `a{N+1}.k{new}`.

Automated rotation: enabled by design

The dual-mount design is AWS Rotation Lambda-compatible. AWS SM Rotation Lambdas implement the standard four-step contract (createSecret, setSecret, testSecret, finishSecret); for an HKDF input, the createSecret step calls GetRandomPassword and PutSecretValue (with VersionStages: ["AWSPENDING"]), and finishSecret calls UpdateSecretVersionStage to promote AWSPENDING → AWSCURRENT (demoting the prior to AWSPREVIOUS). A Rotation Lambda that:

Generates a new random password and calls PutSecretValue to stage it as AWSPENDING, then UpdateSecretVersionStage to promote it to AWSCURRENT.
Optionally triggers an admin endpoint on the operations component to seed the lazy-migration coroutine (avoids waiting for the first tenant read).
Polls the completion query.
Optionally cleans up older stages once migration finishes.

…can be added at any time without changing the application code. Phase 4 ships only the initial SM secret; the Rotation Lambda is a future deliverable.

Phase 4 era (pre-Phase-5b, zero tenant rows)

If rotation is required between Phase 4’s deploy and Phase 5b’s first tenant onboarding (unlikely but possible — e.g., suspected compromise of the initial material):

Operator runs aws secretsmanager update-secret against the SM secret. AWS rotates versionIds and stages.
No application is consuming the secret yet, so no further coordination is needed. Phase 5b’s eventual deploy reads whichever is AWSCURRENT (the rotated version) and AWSPREVIOUS (the original) and proceeds normally.

Operational details

ESO refresh latency. Driven by the ClusterSecretStore’s refreshInterval (~1 minute today; confirm at Phase 5b time).
Pod refresh strategy. Phase 5b’s TokenCipher may load the key map at startup (requires pod restart on rotation) or via a refreshable supplier (no restart needed; ESO refresh + periodic TokenCipher.reload() covers it). Decide at Phase 5b implementation time; this design is unchanged either way.
Multi-pod coordination during rolling restart. Hot-swap dispatch makes mixed-key pods operationally fine: every pod holds both AWSCURRENT and AWSPREVIOUS keys throughout, so a read can land on any pod regardless of rotation state.
Multi-region partitions. Phase 4 partitions are single-region; cross-region replication of the SM secret is out of scope. AWS SM’s built-in cross-region replication can be enabled on the resource without changing this design if needed later.
SM version retention. SM keeps prior versions indefinitely until explicitly deleted (with a 7-day window) or until the per-secret limit of 100 is reached. At one rotation per quarter, we hit the limit in 25 years; no practical concern. If a future compliance cadence pushes rotation frequency to daily, the limit becomes relevant (~3 months) and the runbook should include periodic pruning.
AAD support, when a2+ introduces it. A future envelope can include AAD bound to the row’s email_configuration_id to make ciphertext non-portable across rows.

Implications

Phase 4 deliverables (this design’s outputs)

Per-partition aws_secretsmanager.Secret (4 partitions: prod, demo, dev, stage; kyle suspended per DQ-R1-021) in partition-email stacks. Name: {fqn}-I-EmailEncryptionKey. RemovalPolicy.RETAIN. Single resource per partition; versioning delegated to AWS Secrets Manager.
EmailEncryptionKeyFallbackRole per partition (per DQ-R1-020) — fresh purpose-specific IAM role with secretsmanager:GetSecretValue (no versionId restriction) on the encryption-key SM secret’s ARN. Trust policy = account principal + ArnLike on {fqn}-* so the partition pod role can sts:AssumeRole it (mirrors ImageUploadPreSigningRole pattern). The pod role itself is not extended with the SM permission; the SDK cache-miss fallback depends on the assume-role hop.
CFN exports consumed by the operations Helm chart (non-CDK consumer → -API- marker per cdk-infrastructure.md § Export naming; mirrors image-storage.ts):
- ${publishingPrefix}-API-EmailEncryptionKeyArn — SM secret ARN; consumed by both ESO ExternalSecret mounts (AWSCURRENT + AWSPREVIOUS reference the same ARN with different versionStage selectors).
- ${publishingPrefix}-API-EmailEncryptionKeyFallbackRoleArn — fallback role ARN; consumed by the operations component’s STS-AssumeRole call at the cache-miss path. The SM resource name ({fqn}-I-EmailEncryptionKey) keeps its -I- marker because that marker indicates intra-partition AWS scope per the partitionSecrets.cfn.yaml resource-naming convention — distinct from the CFN export-name marker.
cross-cutting-design.md Secret-handling table row for EmailEncryptionKey updated: two-axis envelope, two-ExternalSecret consumption (AWSCURRENT + AWSPREVIOUS) referencing the -API- SM-secret-ARN export, lazy + coroutine migration, STS-chained EmailEncryptionKeyFallbackRole (also -API- exported) for older versions. (The existing -I- on the SM resource name is correct; the earlier draft of this design that proposed -API- for the resource name is retracted. The -API- marker now applies to the CFN export names, not the SM resource name.)
DQ-R1-019 decision-log entry referencing this design.
Operator runbook entry (location TBD; under current-system/oam/postmark-service/ or a new current-system/runtime/email-encryption-key.md): rotation procedure, migration verification query, retirement procedure, SDK-fallback alarm playbook.

Phase 5b consumers (informed by this design)

Helm chart in the operations component declares two ExternalSecret resources (one each for AWSCURRENT and AWSPREVIOUS) referencing the SM secret by ARN. Both ExternalSecret definitions read their target ARN from the CFN export ${publishingPrefix}-API-EmailEncryptionKeyArn (Phase 4 deliverable), surfaced into Helm values by the deployment pipeline (likely amm.sh calling aws cloudformation list-exports and writing the value into a values overlay). Each projects into a corresponding Kubernetes Secret carrying both material (the 64-byte value) and versionId (the SM UUID) fields.
HOCON path extras.email.encryptionKeys is a list of {versionId, material} objects; extras.email.currentVersionId names the AWSCURRENT versionId; extras.email.secretArn names the SM secret ARN (from the -API- export) for the SDK-fallback path. The list is filtered at HOCON-load time to drop empty/missing entries (handles AWSPREVIOUS-absent state).
extras.email.fallbackRoleArn names the EmailEncryptionKeyFallbackRole ARN (from ${publishingPrefix}-API-EmailEncryptionKeyFallbackRoleArn); the operations component’s SecretsManagerClient is wired with an STS-AssumeRole credential provider targeting this role for the cache-miss SDK path.
EmailConfigurationService (L3) reads these at startup; constructs TokenCipher with the EnvelopeAlgorithmRegistry, the per-versionId materials map, and a SecretsManagerClient for the cache-miss SDK fallback.
TokenCipher’s API:
- encrypt(plaintext: ByteArray): String — returns a1.k{currentVersionId}:<base64...>.
- decrypt(envelope: String): ByteArray — parses prefix, dispatches on a{N}, fetches/derives key for k{...} (cache-then-SDK-fallback), decrypts; throws on tag failure / unknown algorithm version / retired secret version.
- currentVersionId(): String — for migration tooling.
Per-tenant token plaintext lives only on the call stack of getUnlockedConfiguration(), never persisted or logged.
The migration coroutine (per-pod Mutex.tryLock()-guarded) drives the lazy mop-up.

Phase 5a open question (consumes-or-not for crypto primitives)

The EnvelopeAlgorithm family (HKDF wrapper, AES-256-GCM wrapper, envelope codec) could live in:

common-module as a generic crypto-primitives helper consumed by operations (Phase 5a deliverable). Pro: any future component needing the same envelope format reuses the helper. Con: locks the envelope format into a library that’s harder to evolve.
operations as Email-module-local code. Pro: implementation co-evolves with the use case; no premature abstraction. Con: a future second consumer would duplicate.

Decided at Phase 5a planning time. The two-axis envelope design above is consumable from either location; this design doesn’t constrain the choice.

Open follow-ups

Items intentionally not decided here, queued for later:

The concrete shape of the HOCON extras.email.encryptionKeys list and the Helm/ESO mapping that produces it (Phase 5b deliverable).
Whether TokenCipher loads the key map at startup (pod-restart on rotation) or via a refreshable supplier (no restart needed) — Phase 5b implementation detail.
A concrete rotation schedule (driven by compliance requirements not yet identified; the procedure is design-ready when the schedule is set).
Cross-region replication of the SM secret (driven by multi-region partition deployments not yet planned).
Automated rotation infrastructure (AWS SM Rotation Lambda) — straightforward to add once Phase 5b ships; deferred until the operational case is clear.
AAD support (a2+) binding ciphertext to the email_configuration_id row.
Application-level audit logging on each decryption call (operational hygiene; depends on the structured-logging conventions Phase 5b adopts).
Phase 5a / 5b ownership of the EnvelopeAlgorithm family of classes (see above).

References

Phase 4 goal — Open Design Questions table row 3 (DQ-R1-019) summary.
Project decision log — DQ-012, DQ-202, DQ-203, DQ-205 (load-bearing prior decisions); forthcoming DQ-R1-019 entry references this design.
Cross-cutting design — § “Authentication” and the Secret-handling table; the EmailEncryptionKey row is updated alongside this design to reflect two-axis envelope + AWSCURRENT+AWSPREVIOUS mounts + lazy migration.
CDK Infrastructure reference — -API- vs -I- naming convention (Export naming section); the convention-as-practiced retains -I- for the encryption key.
CloudFrontSigningKeyGroup — the existing repo precedent for a rotatable signing key (not directly mirrored here; the email encryption key uses SM-native versioning rather than sibling secrets, but the RemovalPolicy.RETAIN + operator-driven retirement disciplines are the same).
Existing Helm template (in the operations component repo) at src/main/helm/templates/secrets.yaml — pattern of -I- marker + AWSCURRENT version for ESO ExternalSecret resources. Phase 5b’s encryption-key Helm spec extends this pattern by adding a second ExternalSecret for AWSPREVIOUS.

Design: Per-Partition Email Server-Token Encryption Key

Decision Summary

Context

What’s already pinned by prior decisions

Threat model boundary (already-resolved scope)

Sub-questions left open for DQ-R1-019

Sub-decision 1 — Naming and creation

Recommendation

CDK lifecycle invariants (no auto-rotation)

Naming convention

Why RemovalPolicy.RETAIN

What the secret value is

Sub-decision 2 — Two-axis envelope (a{N}.k{SM-VERSION-ID})

Recommendation

Concrete envelope format (a1)

When each axis bumps (worked examples)

Application-side dispatch (the two registries)

Index 1 — Algorithm registry (code-side, compile-time)

Index 2 — Secret material registry (runtime, mutable)

How TokenCipher combines them

Failure-mode matrix

Sub-decision 3 — Rotation and migration

Rotation procedure (Phase 5b onward, with tenant rows in flight)

Migration model — synchronous on first stale read + coroutine mop-up

Cache-miss path (rows older than AWSPREVIOUS)

Rotation triggers

Automated rotation: enabled by design

Phase 4 era (pre-Phase-5b, zero tenant rows)

Operational details

Implications

Phase 4 deliverables (this design’s outputs)

Phase 5b consumers (informed by this design)

Phase 5a open question (consumes-or-not for crypto primitives)

Open follow-ups

References

Why `RemovalPolicy.RETAIN`

Sub-decision 2 — Two-axis envelope (`a{N}.k{SM-VERSION-ID}`)

How `TokenCipher` combines them