Design: Per-Partition Email Server-Token Encryption Key
The runtime data-encryption key that protects per-tenant Postmark server tokens in the database. Phase 4 creates this key per partition; Phase 5b’s EmailConfigurationService consumes it via HKDF derivation and uses it to AES-256-GCM-encrypt server tokens before INSERT and decrypt them on demand. This document closes the open sub-questions left by DQ-012, DQ-202, DQ-203, and pins the three Phase 4 sub-decisions that together resolve DQ-R1-019.
Decision Summary
Section titled “Decision Summary”| # | Sub-question | Resolution |
|---|---|---|
| 1 | How is the secret initially generated and named? | CDK aws_secretsmanager.Secret with generateSecretString({ passwordLength: 64 }), RemovalPolicy.RETAIN, named {fqn}-I-EmailEncryptionKey per partition (single SM secret name; matches the established Arda convention where -I- is the marker even for ESO-consumed secrets). No version suffix in the resource name — version tracking is handled by AWS Secrets Manager’s native versioning. |
| 2 | What identifies a particular ciphertext’s encryption material? | A two-axis envelope a{N}.k{SM-VERSION-ID}:<base64-payload>. a{N} is the algorithm version (rare; code-indexed; bumps require a release). k{SM-VERSION-ID} is the AWS Secrets Manager versionId of the secret material used at write time (frequent; runtime-indexed; bumps on every rotation). Algorithm and material lifecycles are profoundly different; coupling them in one marker would churn the code-side dispatch table on every rotation. |
| 3 | How is the secret rotated? | Rotation = aws secretsmanager update-secret against the single SM secret. AWS creates a new versionId, moves AWSCURRENT to it, demotes the prior to AWSPREVIOUS. ESO mounts both AWSCURRENT and AWSPREVIOUS into the operations component pod via two ExternalSecret resources. The Email module’s TokenCipher holds both derived keys, dispatches on the envelope’s k{versionId} prefix. Migration is lazy + coroutine mop-up (next sub-section). Rare ciphertexts older than AWSPREVIOUS trigger a direct AWS SM SDK fetch via the EmailEncryptionKeyFallbackRole that the operations pod assumes from its IRSA-bound pod role (DQ-R1-020 STS chain; cache-miss path). Fetched material is cached for the pod’s lifetime. |
Full rationale below. The Round R1-Phase4 decision-log entry for DQ-R1-019 (yet to be written) summarises this design.
Context
Section titled “Context”What’s already pinned by prior decisions
Section titled “What’s already pinned by prior decisions”| Decision | What it locks |
|---|---|
| DQ-012 | Per-tenant Postmark server tokens are encrypted application-side with a partition-wide symmetric key before INSERT; key lives in AWS Secrets Manager; key delivered to pods via ESO as extras.email.encryptionKey (HOCON); read only by EmailConfigurationService (L3). No per-tenant SM writes. No KMS-CMK envelope. |
| DQ-202 | On-disk format: AES-256-GCM versioned envelope, base64 in email_configuration.server_token_encrypted (text column). The “versioned” qualifier carries forward to envelope-format version. |
| DQ-203 | The SM value is a 64-byte high-entropy GeneratedSecret. The application HKDF-derives the actual AES-256 key from that secret in TokenCipher’s constructor, with info = "arda.email.serverToken.v1". The v1 in the info string is the same vN discussed below. |
| DQ-205 | Persist-first lifecycle; decryption only on demand by getUnlockedConfiguration(). Plaintext lives only in the in-memory call stack during a send. |
Threat model boundary (already-resolved scope)
Section titled “Threat model boundary (already-resolved scope)”- In scope: defense against DB exposure with read-only privilege — backups leaving the trust boundary, SQL-injection-style read leaks, analyst sessions. A
SELECT *onemail_configurationyields ciphertext only. - Out of scope: pod / process compromise (attacker has both key and DB); insider with both
extras.email.encryptionKeyHOCON access and DB write privilege; Postmark account compromise. Platform-level controls (IRSA, network segmentation, container hardening) own these.
This boundary informs every sub-decision below — in particular, why we do not invest in elaborate rotation machinery in v1: an attacker who can already read the DB cannot decrypt without also compromising the pod or the HOCON config, in which case rotation is moot until the pod-compromise vector is also closed.
Sub-questions left open for DQ-R1-019
Section titled “Sub-questions left open for DQ-R1-019”Three: (1) name + creation of the SM secret in CDK; (2) the precise meaning of vN in the envelope; (3) rotation procedure.
Sub-decision 1 — Naming and creation
Section titled “Sub-decision 1 — Naming and creation”Recommendation
Section titled “Recommendation”- Resource name:
{fqn}-I-EmailEncryptionKeyper partition. The-I-marker matches the convention as practiced (see “Naming convention” below). No version suffix in the resource name — version tracking is delegated to AWS Secrets Manager’s native per-secret versioning. - CDK declaration (in
stacks/purpose/partition-email.ts, Phase 4 deliverable):new aws_secretsmanager.Secret(this, "EmailEncryptionKey", {secretName: `${fqn}-I-EmailEncryptionKey`,description:"Per-partition data-encryption key (HKDF input) for tenant Postmark " +"server tokens. Consumed by the operations component via two ESO " +"mounts (AWSCURRENT and AWSPREVIOUS). Rotation is `aws " +"secretsmanager update-secret`. DO NOT delete versions or the " +"secret itself without explicit operator action — any retired " +"version that still appears in an `email_configuration." +"server_token_encrypted` envelope's `k{...}` tag is unrecoverable " +"for that row.",generateSecretString: {passwordLength: 64,excludeCharacters: '"@/\\',excludePunctuation: false,},removalPolicy: cdk.RemovalPolicy.RETAIN,}); - One secret per partition. AWS SM versions carry the per-rotation history; no sibling CDK constructs per version. Rotation =
aws secretsmanager update-secretagainst this single name. - No cross-partition sharing. No derivation from a parent secret. Each partition’s blast radius is bounded by its own SM secret.
CDK lifecycle invariants (no auto-rotation)
Section titled “CDK lifecycle invariants (no auto-rotation)”CDK’s generateSecretString synthesizes to CloudFormation’s GenerateSecretString property, which has “generate-if-missing” semantics — the random value is produced only when the secret resource doesn’t yet exist. Subsequent cdk deploy runs (and the amm.sh re-runs that drive them) are no-ops on this secret’s value. Specifically:
| Event | Behavior on this secret |
|---|---|
First cdk deploy (secret doesn’t exist) | CFN calls secretsmanager:GetRandomPassword and creates the secret |
Subsequent cdk deploy (CDK config unchanged) | No-op; secret value unchanged |
amm.sh re-run for a deployed partition | No-op on this secret |
Operator rotation (aws secretsmanager get-random-password → put-secret-value) | Rotation per Sub-decision 3 — works; CDK does not fight it (CDK doesn’t track the secret’s value) |
Next cdk deploy after manual rotation | Still no-op |
Resource Name change | Resource replacement (delete + create) → new value generated. Avoided: stack name and resource name are immutable per the project’s CFN-name discipline. |
Any field of generateSecretString changes (e.g., passwordLength: 64 → 65) | CFN may regenerate the value, silently rotating it without going through the Sub-decision 3 procedure. Do not modify these fields post-launch. |
The generateSecretString configuration must be treated as immutable once the secret has been deployed to any partition. If a future algorithm change requires different key derivation parameters, bump a{N} (Sub-decision 2) instead — the algorithm registry carries the HKDF parameters, and a routine update-secret rotation produces the new material under the existing generation config.
The construct’s CDK ID at the call site ("EmailEncryptionKey" in the code block above) is similarly immutable: changing it would alter the CFN logical ID and force a resource replacement, regenerating the value and orphaning every prior k{...} envelope. The construct’s Name property ({fqn}-I-EmailEncryptionKey) is immutable for the same reason.
This invariant is load-bearing for the rotation model in Sub-decision 3: that model presumes the only way the SM secret’s value changes is via operator-driven update-secret. Accidental CDK-driven regeneration would create a new versionId without going through the dual-ESO-mount migration path, breaking every prior a1.k{...} envelope in flight.
Naming convention
Section titled “Naming convention”The Arda repo uses -I- as the marker on every existing partition-scoped AWS resource name, including secrets consumed by ESO (which is technically a non-CDK consumer). Examples from operations/src/main/helm/templates/secrets.yaml:
{infra}-{partition}-I-ArdaApiKey{infra}-{partition}-I-DocumintApiKey{infra}-{partition}-I-GhcrPullSecret{infra}-{partition}-I-PrimaryDb(RDS-managed){infra}-{partition}-I-EmailPostmarkAccountToken(Phase 4 sibling secret declared incross-cutting-design.md)
So the -I- / -API- distinction documented in cdk-infrastructure.md § Export naming is interpreted in practice as: -I- for intra-partition / intra-account resources; -API- for cross-account or cross-partition exports. The encryption key is intra-partition (one partition’s operations pod reads one partition’s SM secret via its own IRSA role; no cross-account flow), so -I- is correct. The earlier draft of this design that proposed -API- is retracted.
Why RemovalPolicy.RETAIN
Section titled “Why RemovalPolicy.RETAIN”Losing the SM secret resource (not just a single version) loses every encrypted tenant token in the partition (irreversible). cdk destroy must not delete this. Operator-initiated deletion is a deliberate step through the AWS console / CLI, never CDK-triggered. Deletion of individual SM versions follows the same discipline — see Sub-decision 3’s retirement procedure.
What the secret value is
Section titled “What the secret value is”64 bytes of high-entropy material. This is not the AES-256 key directly — it is the HKDF input. The application derives a 32-byte AES-256 key from it via HKDF-SHA256 with info = "arda.email.serverToken.a{N}" at first use (DQ-203, with the version qualifier updated to track the algorithm axis a{N} introduced in Sub-decision 2). Choosing 64 bytes (rather than 32) leaves headroom for future algorithm changes that need additional derived keys (e.g., an HMAC key for envelope integrity) without changing the SM secret shape.
Sub-decision 2 — Two-axis envelope (a{N}.k{SM-VERSION-ID})
Section titled “Sub-decision 2 — Two-axis envelope (a{N}.k{SM-VERSION-ID})”Recommendation
Section titled “Recommendation”The envelope carries two version markers rather than one:
a{N}.k{SM-VERSION-ID}:<base64(NONCE || CIPHERTEXT || TAG)>a{N}— algorithm version. A short monotonic counter (a1,a2, …). Bumps rarely — only when the algorithm / KDF / nonce size / envelope layout / HKDFinfostring changes. Each bump requires a code release. Indexed in code.k{SM-VERSION-ID}— secret material version. The AWS Secrets ManagerversionId(a 36-char UUID likeEXAMPLE1-90ab-cdef-fedc-ba0987654321) of the SM version that heldAWSCURRENTat write time. Bumps on every rotation. Indexed at runtime, via SM.
The two axes are decoupled because their lifecycles are profoundly different:
| Axis | Cadence | Triggers | Indexed where | Entries retired when |
|---|---|---|---|---|
a{N} | Years apart | Algorithm/KDF/envelope change (a code release) | EnvelopeAlgorithmRegistry (compile-time dispatch table in operations) | Never — entries stay in code forever so historical envelopes remain decodable |
k{...} | Rotation cadence (compliance / on-compromise / scheduled) | aws secretsmanager update-secret | SecretMaterialRegistry (runtime map populated from ESO; cache-miss falls back to AWS SM SDK) | After the migration pass drains every row tagged with that versionId AND the operator explicitly deletes the SM version |
Coupling them under a single vN (the previous draft of this design) would churn the code-side algorithm registry on every routine rotation. The two-axis split keeps the rare/code-bound axis stable while the frequent/material axis tracks SM-native versioning.
Concrete envelope format (a1)
Section titled “Concrete envelope format (a1)”a1.k<uuid>:<base64(NONCE || CIPHERTEXT || TAG)>Where:
<uuid>— the AWS Secrets ManagerversionIdof the secret version used at write time.NONCE: 12 random bytes per write (AES-GCM standard; uniform-random 12-byte nonces are safe within ~2^32 writes per derived key, comfortable margin).CIPHERTEXT: AES-256-GCM(key, NONCE, plaintext, aad=∅) — no associated data ina1. A futurea2could bind AAD to the row’semail_configuration_id.TAG: 16 bytes, the AES-GCM authentication tag.
Key derivation for a1: key = HKDF-SHA256(material = SM-version-bytes, info = "arda.email.serverToken.a1", length = 32). The HKDF info string carries the algorithm axis (a1) so the derived AES key changes if the algorithm version bumps, even with constant material — a defense against weak-by-construction algorithm migrations.
When each axis bumps (worked examples)
Section titled “When each axis bumps (worked examples)”| Event | a{N} action | k{...} action | Re-encryption pass needed? |
|---|---|---|---|
| Phase 4 initial deploy | Establish a1 | Establish k{uuid-1} (first AWSCURRENT) | No (no existing rows) |
| Routine secret rotation | unchanged (a1) | k{uuid-1} → k{uuid-2} via update-secret | Yes |
| Compliance-driven secret rotation | unchanged | k{uuid-N} → k{uuid-N+1} | Yes |
| Algorithm migration (e.g., to ChaCha20-Poly1305) | a1 → a2 | new SM versionId at the same time | Yes |
| Adding AAD to the envelope | a1 → a2 | new SM versionId | Yes |
| Pod config refresh, same SM secret | unchanged | unchanged | No |
| ESO refresh interval change | unchanged | unchanged | No |
The two-axis design means a hypothetical “algorithm change without material rotation” (a1.k{X} → a2.k{X}) is structurally supported but operationally unusual — algorithm migrations almost always pair with a fresh rotation for hygiene. The framework supports either pattern with no code change.
Application-side dispatch (the two registries)
Section titled “Application-side dispatch (the two registries)”The envelope’s a{N}.k{...} is the join key into two parallel registries inside TokenCipher. Both must contain the corresponding entry for a ciphertext to be readable.
Index 1 — Algorithm registry (code-side, compile-time)
Section titled “Index 1 — Algorithm registry (code-side, compile-time)”A sealed interface EnvelopeAlgorithm with one implementation per a{N}. Each implementation binds:
- Envelope parser: how to decode the
a{N}.k{...}:<payload>bytes (split NONCE / CIPHERTEXT / TAG; or any future layout). - KDF + info string: how to derive the symmetric key from the SM-version material (HKDF-SHA256 with
info = "arda.email.serverToken.a{N}"fora1; may differ fora2+). - Cipher operations: encrypt/decrypt pair (AES-256-GCM for
a1; potentially ChaCha20-Poly1305 etc. fora2+). - Nonce policy: 12 random bytes per write; regenerate per encryption.
- AAD policy: empty for
a1; could bind toemail_configuration_idfora2+.
Conceptual shape (Phase 5b will implement):
sealed interface EnvelopeAlgorithm { val tag: String // "a1" fun deriveKey(material: ByteArray): ByteArray // HKDF-SHA256 with the algorithm's info string fun encrypt(key: ByteArray, plaintext: ByteArray, keyVersionId: String): String fun decrypt(key: ByteArray, envelope: String): ByteArray}
object EnvelopeAlgorithmRegistry { private val byTag = mapOf( "a1" to AesGcmHkdfSha256A1, // "a2" to AesGcmHkdfSha256A2_WithAad, // hypothetical: same cipher + AAD // "a3" to ChaCha20Poly1305A3, // hypothetical: cipher change ) fun resolve(tag: String): EnvelopeAlgorithm = byTag[tag] ?: throw UnknownAlgorithmVersion(tag)}Lifecycle: entries are never removed from the code. Once a1 is supported, the implementation stays in the binary forever — otherwise an archived backup or a forgotten row at a1 would become un-decryptable.
Index 2 — Secret material registry (runtime, mutable)
Section titled “Index 2 — Secret material registry (runtime, mutable)”A startup-populated ConcurrentHashMap<String /* SM versionId */, ByteArray /* HKDF material */>. Two normal entry sources:
- The AWSCURRENT ESO mount: K8s Secret
email-encryption-key-currentcarries the material plus the SM versionId as separate fields. The application registers an entry keyed by the versionId. - The AWSPREVIOUS ESO mount: K8s Secret
email-encryption-key-previouscarries material + versionId. If no AWSPREVIOUS exists (initial deploy, before any rotation), the K8s Secret resolves empty and the application skips the registration.
Each ESO ExternalSecret is configured to project both the SecretString and the versionId metadata into its K8s Secret. The Helm chart’s HOCON template aggregates these into the application-visible list.
HOCON shape (concrete proposal for Phase 5b):
extras.email.encryptionKeys = [ { versionId = ${?ARDA_EMAIL_KEY_CURRENT_VERSION_ID}, material = ${?ARDA_EMAIL_KEY_CURRENT_MATERIAL} } { versionId = ${?ARDA_EMAIL_KEY_PREVIOUS_VERSION_ID}, material = ${?ARDA_EMAIL_KEY_PREVIOUS_MATERIAL} }]extras.email.currentVersionId = ${?ARDA_EMAIL_KEY_CURRENT_VERSION_ID}extras.email.secretArn = ${?ARDA_EMAIL_KEY_SECRET_ARN}The list is filtered at HOCON-load time to drop entries with empty/missing versionIds (handles AWSPREVIOUS-absent state). secretArn is needed for the SDK-fallback path.
How TokenCipher combines them
Section titled “How TokenCipher combines them”class TokenCipher( private val algorithms: EnvelopeAlgorithmRegistry, private val materials: ConcurrentHashMap<String, ByteArray>, // SmVersionId → 64-byte HKDF material private val derivedKeys: ConcurrentHashMap<Pair<String, String>, ByteArray> = ConcurrentHashMap(), // (algoTag, versionId) → derived AES key private val currentVersionId: String, private val currentAlgorithmTag: String, // compile-time constant for the writer algorithm private val secretArn: String, private val sdkClient: SecretsManagerClient,) { fun encrypt(plaintext: ByteArray): String { val algo = algorithms.resolve(currentAlgorithmTag) val key = keyFor(algo, currentVersionId) return algo.encrypt(key, plaintext, currentVersionId) }
fun decrypt(envelope: String): ByteArray { val (algoTag, keyVersionId) = parseHeader(envelope) // splits "a1.k{...}:..." val algo = algorithms.resolve(algoTag) // → UnknownAlgorithmVersion if missing val key = keyFor(algo, keyVersionId) // → RetiredSecretVersion if SM also unknown return algo.decrypt(key, envelope) }
private fun keyFor(algo: EnvelopeAlgorithm, versionId: String): ByteArray = derivedKeys.computeIfAbsent(algo.tag to versionId) { val material = materials[versionId] ?: fetchFromSdk(versionId) algo.deriveKey(material) }
private fun fetchFromSdk(versionId: String): ByteArray { val resp = sdkClient.getSecretValue { it.secretId(secretArn).versionId(versionId) } val material = resp.secretString().toByteArray(Charsets.UTF_8) materials[versionId] = material // cache for the pod's lifetime return material }}Both indices contribute their checks:
- Algorithm registry miss →
UnknownAlgorithmVersion. Means an old deploy is reading a row written by a newer deploy that introduceda{N+1}. Should not happen if deploys go forward-only; surface as a deploy-version alarm. - Secret material registry miss with successful SDK fallback → cache populated; future reads of the same
k{...}avoid the SDK call. The operations pod federates into its IRSA-bound pod role at startup, then performssts:AssumeRoleinto the partition’sEmailEncryptionKeyFallbackRole(per DQ-R1-020) to obtainsecretsmanager:GetSecretValueon the encryption-key SM ARN (no versionId restriction). The pod role itself does not carry the SM permission; permissions live on the purpose-specific fallback role. - SDK fallback returns 404 →
RetiredSecretVersion. Means the operator deleted the SM version while rows still reference it. Surface as an alarm; runbook covers manual remediation.
Failure-mode matrix
Section titled “Failure-mode matrix”| State | Algorithm registry | Secret-material registry | SDK fallback can recover? | Effect |
|---|---|---|---|---|
Normal a1 operation | has a1 | has k{current} | n/a | Encrypt + decrypt with a1.k{current} |
| Mid-rotation; AWSPREVIOUS exists | has a1 | has k{current} and k{previous} | n/a | Reads of both succeed; writes use k{current} |
Read of row at k{very-old} (older than AWSPREVIOUS) | has a1 | does not have k{very-old} | Yes, if SM still has the version | One-off SDK call; cached for the pod’s lifetime; row then lazily migrated to k{current} (Sub-decision 3) |
Read of row at k{deleted} (operator removed SM version) | has a1 | does not have k{deleted} | No (404) | Throw RetiredSecretVersion; alarm; runbook |
Read of row at a99 (unknown algorithm tag) | no a99 | n/a | n/a | Throw UnknownAlgorithmVersion; means stale deploy reading newer-deploy rows |
Sub-decision 3 — Rotation and migration
Section titled “Sub-decision 3 — Rotation and migration”Rotation procedure (Phase 5b onward, with tenant rows in flight)
Section titled “Rotation procedure (Phase 5b onward, with tenant rows in flight)”-
Operator decides to rotate (triggers below).
-
Generate the new key material and write it to the SM secret via two AWS CLI calls (the AWS CLI’s
update-secretsubcommand does not accept--generate-random-password— that flag belongs toget-random-password):Terminal window NEW=$(aws secretsmanager get-random-password \--password-length 64 --exclude-characters '"@/\' \--require-each-included-type \--output text --query RandomPassword)aws secretsmanager put-secret-value \--secret-id "{fqn}-I-EmailEncryptionKey" --secret-string "$NEW"put-secret-valuecreates a newversionId, automatically promotes it toAWSCURRENT, and demotes the prior version toAWSPREVIOUS. (Future automation: an AWS SM Rotation Lambda performs the same two-step flow on a schedule.) -
ESO refresh. Within the
ClusterSecretStore’srefreshInterval(~1 minute today), ESO re-pulls bothAWSCURRENTandAWSPREVIOUS. The two K8s Secrets update; the operations pod sees the new versionIds on its next HOCON refresh or restart. -
Pod refresh. Either rolling-restart the operations component, or — if Phase 5b implements
TokenCipher.reload()on a periodic supplier — wait for the next refresh tick. After the refresh,currentVersionIdpoints to the new SM version; the previous version is still registered for read-dispatch; new writes encrypt asa1.k{new}:…. -
Migration runs automatically. Triggered on the next non-up-to-date read; see “Migration model” below. No manual job to dispatch.
-
Verify migration completion.
SELECT COUNT(*) FROM email_configuration WHERE server_token_encrypted NOT LIKE 'a1.k${currentVersionId}:%'returns zero. (Phase 5b can expose this query as an admin endpoint, e.g.,GET /admin/email-key-rotation/status.) -
Optionally retire the old SM version.
aws secretsmanager update-secret-version-stage ... --remove-from-version-id <old>removes theAWSPREVIOUSlabel; the version stays in SM history (still SDK-fetchable). For full removal,delete-secret --version-idtriggers the 7-day SM-deletion window. The runbook explains both options; routine cadence is “leave the prior version stage-less but in history for at least one more rotation cycle”.
The procedure works without taking writes offline — the pod’s dispatch table holds both keys throughout the migration; reads of pre- and post-migrated rows both succeed; writes commit in the new format.
Migration model — synchronous on first stale read + coroutine mop-up
Section titled “Migration model — synchronous on first stale read + coroutine mop-up”Triggered by EmailConfigurationService.getUnlockedConfiguration(id):
// inside L3 serviceval row = repo.findById(id)val plaintext = tokenCipher.decrypt(row.serverTokenEncrypted)val (_, keyVersionId) = TokenCipher.parseHeader(row.serverTokenEncrypted)if (keyVersionId != tokenCipher.currentVersionId) { // Out-of-date: synchronously re-encrypt this row's token before returning to the caller. val newEnvelope = tokenCipher.encrypt(plaintext) repo.updateServerToken(id, newEnvelope) // Mop up the rest of the partition asynchronously. migrationCoroutine.launchIfNotRunning(scope = appScope) { repo.findAllNotAtVersion(tokenCipher.currentVersionId).forEach { staleRow -> val stalePlaintext = tokenCipher.decrypt(staleRow.serverTokenEncrypted) val freshEnvelope = tokenCipher.encrypt(stalePlaintext) repo.updateServerToken(staleRow.id, freshEnvelope) } }}return plaintextProperties:
- First non-up-to-date read triggers the mop-up. No scheduled job needed; whoever sends an email next initiates the partition-wide migration.
- Synchronous self-healing on the read path. The current caller’s row is always re-encrypted before plaintext is returned, so subsequent reads of the same row are fast.
launchIfNotRunningguards the coroutine with aMutex.tryLock()(or boolean flag), ensuring one in-flight mop-up per pod. Idempotent — re-entering after a crash is safe because every row is checked againstcurrentVersionIdbefore write.- Multi-pod coordination via idempotency. Each pod’s coroutine filters
WHERE keyVersionId != currentVersionId; concurrent pods walking overlapping rows converge (already-migrated rows are skipped by the WHERE clause). - Coroutine scope. Uses the pod-lifetime
appScope, not the request scope — the mop-up outlives the request that triggered it but dies cleanly on pod shutdown. The next non-up-to-date read on a restarted pod re-triggers.
Cache-miss path (rows older than AWSPREVIOUS)
Section titled “Cache-miss path (rows older than AWSPREVIOUS)”If a read encounters k{very-old-uuid} that is neither AWSCURRENT nor AWSPREVIOUS:
TokenCipher.decrypt()looks upmaterials[very-old-uuid]→ miss.- Falls through to
fetchFromSdk(very-old-uuid). The operations component performssts:AssumeRoleintoEmailEncryptionKeyFallbackRole(per DQ-R1-020), which holdssecretsmanager:GetSecretValueon the encryption-key SM ARN (no versionId restriction). The pod’s IRSA-bound pod role does not carry the SM permission directly — the assume-role hop is mandatory. - If SM has the version: material is fetched, HKDF-derived, cached for the pod’s lifetime. The row is then migrated to
currentVersionIdper the lazy migration path above. - If SM has no such version (operator deleted it via
delete-secret --version-id): throwRetiredSecretVersion. The operations component emits an alarm; the row is unrecoverable from the application’s perspective. Runbook covers manual remediation: (a) restore the SM version from AWS retention if still within the window, (b) clear the row from out-of-band records, or (c) treat the row as compromised and re-provision the tenant.
The SDK fallback is rare by construction: routine rotation always demotes prior to AWSPREVIOUS, which the pod has continuously mounted. Two consecutive rotations within the same un-migrated window are the only natural failure mode for the cache. Even then, the operator can preemptively label additional past versions with custom stages and mount them in Helm; the SDK fallback is the safety net, not the primary path.
Rotation triggers
Section titled “Rotation triggers”| Trigger | Action |
|---|---|
| Suspected key compromise | Immediate rotation. Migration pass proceeds normally. The operator should also consider delete-secret --version-id on the compromised version to ensure it cannot be recovered. |
| Insider with HOCON-config access departs | Routine rotation on off-boarding if the departing party had access to extras.email.encryptionKeys on a partition’s pod. |
| Compliance / hygiene cadence | No fixed schedule in this design. Compliance reviews (SOC 2 etc.) may later require a cadence (90-day, annual, etc.); the procedure applies regardless of trigger. |
Algorithm migration (a{N} bump) | Same procedure; the code release introducing a{N+1} lands first, then a rotation creates a new SM versionId, then the migration pass re-encrypts existing rows under a{N+1}.k{new}. |
Automated rotation: enabled by design
Section titled “Automated rotation: enabled by design”The dual-mount design is AWS Rotation Lambda-compatible. AWS SM Rotation Lambdas implement the standard four-step contract (createSecret, setSecret, testSecret, finishSecret); for an HKDF input, the createSecret step calls GetRandomPassword and PutSecretValue (with VersionStages: ["AWSPENDING"]), and finishSecret calls UpdateSecretVersionStage to promote AWSPENDING → AWSCURRENT (demoting the prior to AWSPREVIOUS). A Rotation Lambda that:
- Generates a new random password and calls
PutSecretValueto stage it asAWSPENDING, thenUpdateSecretVersionStageto promote it toAWSCURRENT. - Optionally triggers an admin endpoint on the operations component to seed the lazy-migration coroutine (avoids waiting for the first tenant read).
- Polls the completion query.
- Optionally cleans up older stages once migration finishes.
…can be added at any time without changing the application code. Phase 4 ships only the initial SM secret; the Rotation Lambda is a future deliverable.
Phase 4 era (pre-Phase-5b, zero tenant rows)
Section titled “Phase 4 era (pre-Phase-5b, zero tenant rows)”If rotation is required between Phase 4’s deploy and Phase 5b’s first tenant onboarding (unlikely but possible — e.g., suspected compromise of the initial material):
- Operator runs
aws secretsmanager update-secretagainst the SM secret. AWS rotates versionIds and stages. - No application is consuming the secret yet, so no further coordination is needed. Phase 5b’s eventual deploy reads whichever is
AWSCURRENT(the rotated version) andAWSPREVIOUS(the original) and proceeds normally.
Operational details
Section titled “Operational details”- ESO refresh latency. Driven by the
ClusterSecretStore’srefreshInterval(~1 minute today; confirm at Phase 5b time). - Pod refresh strategy. Phase 5b’s
TokenCiphermay load the key map at startup (requires pod restart on rotation) or via a refreshable supplier (no restart needed; ESO refresh + periodicTokenCipher.reload()covers it). Decide at Phase 5b implementation time; this design is unchanged either way. - Multi-pod coordination during rolling restart. Hot-swap dispatch makes mixed-key pods operationally fine: every pod holds both
AWSCURRENTandAWSPREVIOUSkeys throughout, so a read can land on any pod regardless of rotation state. - Multi-region partitions. Phase 4 partitions are single-region; cross-region replication of the SM secret is out of scope. AWS SM’s built-in cross-region replication can be enabled on the resource without changing this design if needed later.
- SM version retention. SM keeps prior versions indefinitely until explicitly deleted (with a 7-day window) or until the per-secret limit of 100 is reached. At one rotation per quarter, we hit the limit in 25 years; no practical concern. If a future compliance cadence pushes rotation frequency to daily, the limit becomes relevant (~3 months) and the runbook should include periodic pruning.
- AAD support, when
a2+ introduces it. A future envelope can include AAD bound to the row’semail_configuration_idto make ciphertext non-portable across rows.
Implications
Section titled “Implications”Phase 4 deliverables (this design’s outputs)
Section titled “Phase 4 deliverables (this design’s outputs)”- Per-partition
aws_secretsmanager.Secret(4 partitions:prod,demo,dev,stage; kyle suspended per DQ-R1-021) inpartition-emailstacks. Name:{fqn}-I-EmailEncryptionKey.RemovalPolicy.RETAIN. Single resource per partition; versioning delegated to AWS Secrets Manager. EmailEncryptionKeyFallbackRoleper partition (per DQ-R1-020) — fresh purpose-specific IAM role withsecretsmanager:GetSecretValue(no versionId restriction) on the encryption-key SM secret’s ARN. Trust policy = account principal +ArnLikeon{fqn}-*so the partition pod role cansts:AssumeRoleit (mirrorsImageUploadPreSigningRolepattern). The pod role itself is not extended with the SM permission; the SDK cache-miss fallback depends on the assume-role hop.- CFN exports consumed by the operations Helm chart (non-CDK consumer →
-API-marker percdk-infrastructure.md§ Export naming; mirrorsimage-storage.ts):${publishingPrefix}-API-EmailEncryptionKeyArn— SM secret ARN; consumed by both ESOExternalSecretmounts (AWSCURRENT + AWSPREVIOUS reference the same ARN with differentversionStageselectors).${publishingPrefix}-API-EmailEncryptionKeyFallbackRoleArn— fallback role ARN; consumed by the operations component’s STS-AssumeRole call at the cache-miss path. The SM resource name ({fqn}-I-EmailEncryptionKey) keeps its-I-marker because that marker indicates intra-partition AWS scope per thepartitionSecrets.cfn.yamlresource-naming convention — distinct from the CFN export-name marker.
cross-cutting-design.mdSecret-handling table row forEmailEncryptionKeyupdated: two-axis envelope, two-ExternalSecret consumption (AWSCURRENT + AWSPREVIOUS) referencing the-API-SM-secret-ARN export, lazy + coroutine migration, STS-chainedEmailEncryptionKeyFallbackRole(also-API-exported) for older versions. (The existing-I-on the SM resource name is correct; the earlier draft of this design that proposed-API-for the resource name is retracted. The-API-marker now applies to the CFN export names, not the SM resource name.)- DQ-R1-019 decision-log entry referencing this design.
- Operator runbook entry (location TBD; under
current-system/oam/postmark-service/or a newcurrent-system/runtime/email-encryption-key.md): rotation procedure, migration verification query, retirement procedure, SDK-fallback alarm playbook.
Phase 5b consumers (informed by this design)
Section titled “Phase 5b consumers (informed by this design)”- Helm chart in the operations component declares two
ExternalSecretresources (one each forAWSCURRENTandAWSPREVIOUS) referencing the SM secret by ARN. Both ExternalSecret definitions read their target ARN from the CFN export${publishingPrefix}-API-EmailEncryptionKeyArn(Phase 4 deliverable), surfaced into Helm values by the deployment pipeline (likelyamm.shcallingaws cloudformation list-exportsand writing the value into a values overlay). Each projects into a corresponding Kubernetes Secret carrying bothmaterial(the 64-byte value) andversionId(the SM UUID) fields. - HOCON path
extras.email.encryptionKeysis a list of{versionId, material}objects;extras.email.currentVersionIdnames the AWSCURRENT versionId;extras.email.secretArnnames the SM secret ARN (from the-API-export) for the SDK-fallback path. The list is filtered at HOCON-load time to drop empty/missing entries (handles AWSPREVIOUS-absent state). extras.email.fallbackRoleArnnames theEmailEncryptionKeyFallbackRoleARN (from${publishingPrefix}-API-EmailEncryptionKeyFallbackRoleArn); the operations component’sSecretsManagerClientis wired with an STS-AssumeRole credential provider targeting this role for the cache-miss SDK path.EmailConfigurationService(L3) reads these at startup; constructsTokenCipherwith theEnvelopeAlgorithmRegistry, the per-versionId materials map, and aSecretsManagerClientfor the cache-miss SDK fallback.TokenCipher’s API:encrypt(plaintext: ByteArray): String— returnsa1.k{currentVersionId}:<base64...>.decrypt(envelope: String): ByteArray— parses prefix, dispatches ona{N}, fetches/derives key fork{...}(cache-then-SDK-fallback), decrypts; throws on tag failure / unknown algorithm version / retired secret version.currentVersionId(): String— for migration tooling.
- Per-tenant token plaintext lives only on the call stack of
getUnlockedConfiguration(), never persisted or logged. - The migration coroutine (per-pod
Mutex.tryLock()-guarded) drives the lazy mop-up.
Phase 5a open question (consumes-or-not for crypto primitives)
Section titled “Phase 5a open question (consumes-or-not for crypto primitives)”The EnvelopeAlgorithm family (HKDF wrapper, AES-256-GCM wrapper, envelope codec) could live in:
common-moduleas a generic crypto-primitives helper consumed byoperations(Phase 5a deliverable). Pro: any future component needing the same envelope format reuses the helper. Con: locks the envelope format into a library that’s harder to evolve.operationsas Email-module-local code. Pro: implementation co-evolves with the use case; no premature abstraction. Con: a future second consumer would duplicate.
Decided at Phase 5a planning time. The two-axis envelope design above is consumable from either location; this design doesn’t constrain the choice.
Open follow-ups
Section titled “Open follow-ups”Items intentionally not decided here, queued for later:
- The concrete shape of the HOCON
extras.email.encryptionKeyslist and the Helm/ESO mapping that produces it (Phase 5b deliverable). - Whether
TokenCipherloads the key map at startup (pod-restart on rotation) or via a refreshable supplier (no restart needed) — Phase 5b implementation detail. - A concrete rotation schedule (driven by compliance requirements not yet identified; the procedure is design-ready when the schedule is set).
- Cross-region replication of the SM secret (driven by multi-region partition deployments not yet planned).
- Automated rotation infrastructure (AWS SM Rotation Lambda) — straightforward to add once Phase 5b ships; deferred until the operational case is clear.
- AAD support (
a2+) binding ciphertext to theemail_configuration_idrow. - Application-level audit logging on each decryption call (operational hygiene; depends on the structured-logging conventions Phase 5b adopts).
- Phase 5a / 5b ownership of the
EnvelopeAlgorithmfamily of classes (see above).
References
Section titled “References”- Phase 4 goal — Open Design Questions table row 3 (
DQ-R1-019) summary. - Project decision log —
DQ-012,DQ-202,DQ-203,DQ-205(load-bearing prior decisions); forthcomingDQ-R1-019entry references this design. - Cross-cutting design — § “Authentication” and the Secret-handling table; the
EmailEncryptionKeyrow is updated alongside this design to reflect two-axis envelope + AWSCURRENT+AWSPREVIOUS mounts + lazy migration. - CDK Infrastructure reference —
-API-vs-I-naming convention (Export naming section); the convention-as-practiced retains-I-for the encryption key. CloudFrontSigningKeyGroup— the existing repo precedent for a rotatable signing key (not directly mirrored here; the email encryption key uses SM-native versioning rather than sibling secrets, but theRemovalPolicy.RETAIN+ operator-driven retirement disciplines are the same).- Existing Helm template (in the operations component repo) at
src/main/helm/templates/secrets.yaml— pattern of-I-marker +AWSCURRENTversion for ESOExternalSecretresources. Phase 5b’s encryption-key Helm spec extends this pattern by adding a secondExternalSecretforAWSPREVIOUS.
Copyright: (c) Arda Systems 2025-2026, All rights reserved
Copyright: © Arda Systems 2025-2026, All rights reserved