PDEV-442 — Infrastructure changes scope
Infrastructure-layer changes needed to land the slow-responses
remediation in production. Application-layer changes are scoped in
sibling analyses (pod_capacity.md, operations-bottlenecks.md,
code-dig.md, sentry-configuration.md); this file is the union of
everything that lives in the infrastructure repo plus the manual
console work that can’t.
[!important] Amplify is out of scope for infrastructure improvements and not in the at-a-glance table below. Three reasons:
Compute config not configurable. AWS Amplify Hosting exposes no path — CloudFormation, CLI, or console — to configure SSR runtime memory, vCPU, reserved concurrency, or provisioned concurrency on the auto-generated Amplify Hosting Compute Lambda. Verified against the full
AWS::Amplify::AppCFN property reference (onlyJobConfig.BuildComputeTypeis exposed, and that controls the build CI instance, not the SSR runtime). Community workarounds via directaws lambda update-function-configurationagainst the Amplify-managed Lambda are non-contractual and reset on next deploy.
CacheConfignot a useful lever.AWS::Amplify::App.CacheConfigis configurable (AMPLIFY_MANAGEDdefault vs.AMPLIFY_MANAGED_NO_COOKIES), but it controls only whether the edge cache key includes cookies. Switching toNO_COOKIESwould cross-serve cached SSR responses to different authenticated users — a correctness/security regression in an auth-gated multi-tenant app. The default (AMPLIFY_MANAGED) is correct and the only safe choice; there is no useful tuning available within the property’s value space. Static-asset caching at the CDN tier is unaffected byCacheConfigand already works as expected.Application-level levers go elsewhere. ISR adoption, BFF JS bundle reduction, BFF route-handler parallelisation, and per-row N+1 elimination are all application changes (Next.js, React, route handlers) — they live under PDEV-489, not the infrastructure repo.
A future “migrate SSR off Amplify Hosting onto a custom Lambda + CloudFront stack” project would unlock these controls as first-class CDK resources, but is out of scope for PDEV-442.
At a glance
Section titled “At a glance”| # | Change | Repo / surface | Linear | Status |
|---|---|---|---|---|
| 1 | EKS metrics-server managed addon | infrastructure (EKS construct) | PDEV-491 | not started |
| 2 | Aurora custom parameter group — pg_stat_statements, slow-query log | infrastructure (RDS construct) | PDEV-479 | not started |
| 3 | Aurora instance class + max_connections review | infrastructure (RDS construct) | PDEV-479 | not started |
| 4 | Infrastructure-scoped Sentry DSN secret ({Infra}-SentryDsn) | infrastructure (new CDK construct + amm.sh extension) | new ticket | not started |
| 5 | RDS Proxy decision (yes / no / defer) | infrastructure (new construct) or no-op | PDEV-499 | Triage (gated on PDEV-488 in stage) |
Tags audit across these resources is intentionally not scoped here — it’s a workspace-wide effort that doesn’t belong piecemeal inside the slow-responses project.
1. EKS metrics-server (PDEV-491)
Section titled “1. EKS metrics-server (PDEV-491)”Prerequisite for the HPA shipped under PDEV-488. Without
metrics-server, kubectl top pod returns no data and the HPA’s
CPU-target metric is unavailable, so HPA scaling is silently
disabled regardless of the chart-level enabled: true flag.
- Action: install the
metrics-servermanaged addon via the existing EKS cluster CDK construct. Both Alpha001 and Alpha002 clusters. - Verification:
kubectl --context Alpha001 top pod -Areturns numbers; HPA onoperationsreportscurrent/targetinstead of<unknown>/60%. - Already tracked in PDEV-491. No new ticket needed.
2. Aurora custom parameter group (PDEV-479)
Section titled “2. Aurora custom parameter group (PDEV-479)”The current Aurora cluster uses the default DB cluster parameter
group, which means pg_stat_statements is not pre-loaded and
slow-query logging is off. Both of these need IaC, not console
toggles — Aurora parameter changes that touch
shared_preload_libraries (which pg_stat_statements does) force
an instance restart.
- Action: define a custom DB cluster parameter group in the
RDS CDK construct, attach to all four clusters (dev, stage,
demo, prod). Parameters:
shared_preload_libraries = pg_stat_statements(static — needs restart).pg_stat_statements.track = alllog_min_duration_statement = 500(ms — tune after observing).log_statement = ddl(catch schema changes; cheap).
- Constraint — outage window. Static parameters require an instance restart. Bundle with the sizing change in #4 (prod only) so we take the restart once where it matters; non-prod restarts are routine.
CREATE EXTENSION pg_stat_statements;is not in scope for this ticket. It runs against thepostgresadmin DB on each Aurora cluster as part of thepostgres-database-initializerinit container, tracked separately under PDEV-498 (sub-issue of PDEV-479). The statement is idempotent (IF NOT EXISTS) so it’s safe to run on every operations deploy. Loading the library at the cluster level (this ticket) is the prerequisite; the extension creation lights up the view in thepostgresDB so PDEV-490’s authors can query the data.- Verification: after the cluster restart and a subsequent
operations deploy that picks up the new initializer image,
SELECT count(*) FROM pg_stat_statements;against thepostgresadmin DB returns non-zero. CloudWatch slow-query log group populates within an hour at the chosen threshold.
3. Aurora instance class + max_connections (PDEV-479)
Section titled “3. Aurora instance class + max_connections (PDEV-479)”Pairs with #3 (prod cluster) and bundles into the same restart
window. Resize is prod only; dev / stage / demo stay on their
current instance classes — they don’t see the fan-out and
restart cost without benefit. The parameter group from #3 still
attaches to all four clusters (cluster-level parameters cost
nothing extra and keep pg_stat_statements available
everywhere).
- Instance class review (prod only). Today the prod cluster
runs on
db.t3.medium. Under HPA fan-out (2 → up to 8 operations pods, plus accounts, plus BFF),db.t3burstable credits will be a ceiling. Evaluatedb.r6g.large(ordb.r7g.largeif available in the region) — Graviton, non-burstable, baseline CPU adequate for the projected workload. Decision goes in a short decision-log entry under PDEV-479. max_connections. Aurora’s default is a function ofDBInstanceClassMemory. With the projected prod pod count × pool size 20 → ~160 application connections, plus headroom for ad-hoc / lambda / RDS Proxy if we add it (#6), the default ceiling may be tight. Setmax_connections = 500in the parameter group (#3) for the prod cluster — gives ~3× headroom over projected steady-state, comfortably above HPA-max + ad-hoc.- Per-cluster sizing parameterisation. Instance class +
max_connectionsare cluster-specific decisions. Define them in each cluster’s CDK instance file (e.g.src/main/cdk/instances/Alpha001/prod.ts) and pass them as parameters into the shared CDK stack / construct that provisions the Aurora cluster. Don’t hardcode either value inside the shared construct — keep all per-env divergence at the instance-file boundary, consistent with the rest of the CDK architecture. - Verification: post-change, RDS Performance Insights shows
CPU baseline-not-burst, and
SHOW max_connections;returns500on the prod cluster (and the default on dev / stage / demo).
4. Infrastructure-scoped Sentry DSN secret (new ticket)
Section titled “4. Infrastructure-scoped Sentry DSN secret (new ticket)”PDEV-488 #5 wires the Sentry DSN through an ESO ExternalSecret
sourcing {Infrastructure}-SentryDsn from AWS Secrets Manager.
Per the established Arda convention (see pod_capacity.md
§ Provisioning pipeline), the runtime never reads 1Password
directly — amm.sh is the bridge that reads from 1P and passes
the value into a CDK stack parameter; CDK declares the SM
resource, and the existing ESO + secretReader IRSA role
consumes it.
One SM secret per infrastructure, not per partition: the same
DSN serves both partitions of each infrastructure (and would serve
future components like accounts in the same infrastructure if a
distinct DSN per component is ever needed). Provisioned once per
infrastructure (Alpha001-SentryDsn, Alpha002-SentryDsn).
The -I- and -API- markers used elsewhere in the codebase are
CloudFormation export-name conventions only — they appear on
the exported names like Alpha001-I-SentryDsnArn, but never on
the AWS resource names themselves (Alpha001-SentryDsn). This
matches the existing pattern in partitionSecrets.cfn.yaml
(secret Alpha001-prod-ArdaApiKey, export
Alpha001-prod-I-ArdaApiKeyArn / Alpha001-prod-API-ArdaApiKeyArn).
Action
Section titled “Action”-
CDK stack, new under
src/main/cdk/stacks/infrastructure/, class nameInfrastructureSecretsStack(consistent with the existing PascalCase, scope-prefixed convention —InfrastructureEksStack,NetworkingInfrastructureStack,InfrastructureIngress). Wire it insrc/main/cdk/apps/Al1x/infra.ts(called frominstances/{Infrastructure}/infra.ts, already invoked byamm.shlines 304-342) alongside the existingNetworkingInfrastructureStackandInfrastructureEksStack.The SM secret is not a baked-at-synth value; it’s passed at deploy via a CDK / CloudFormation parameter so the synthesized template never embeds the DSN plaintext.
-
New OAM construct
ExternallySuppliedSecretundersrc/main/cdk/constructs/oam/externally-supplied-secret.ts, alongside the existingPredefinedSecretandGeneratedSecret. Encapsulates theCfnParameter+cdk.SecretValue.cfnParameterplumbing so consumers don’t have to know about the underlying SDK call (matching the existing OAM convention — those two constructs also wrap a single SDK call). Shape:constructs/oam/externally-supplied-secret.ts import * as cdk from "aws-cdk-lib";import { Construct } from "constructs";import * as secretsmanager from "aws-cdk-lib/aws-secretsmanager";import * as misc from "arda/utils/misc";export interface Configuration {prefix: string;secretName: string;secretStringParameter: cdk.CfnParameter;}export interface Props extends Configuration {}export interface Built {readonly secret: secretsmanager.ISecret;}export class ExternallySuppliedSecret extends Construct {public readonly build: Built;static validateProps(id: string, props: Props): Error[] {return [];}constructor(scope: Construct, id: string, props: Props) {const errors = ExternallySuppliedSecret.validateProps(id, props);if (errors.length > 0) {throw new misc.MultiError(`Props for ${id} are invalid:`, errors);}super(scope, id);const secret = new secretsmanager.Secret(this, "Secret", {secretName: `${props.prefix}-${props.secretName}`,secretStringValue: cdk.SecretValue.cfnParameter(props.secretStringParameter,),});this.build = { secret };}}Usage inside
InfrastructureSecretsStack:const sentryDsn = new cdk.CfnParameter(this, "SentryDsn", {type: "String",noEcho: true,minLength: 1,description: "Sentry DSN for this infrastructure (passed by amm.sh from 1Password)",});const sentryDsnSecret = new ExternallySuppliedSecret(this, "SentryDsn", {prefix: il.id,secretName: "SentryDsn",secretStringParameter: sentryDsn,});new cdk.CfnOutput(this, "SentryDsnArn", {value: sentryDsnSecret.build.secret.secretArn,exportName: `${il.id}-I-SentryDsnArn`,}); -
amm.shextension, added as an infrastructure-level step before the existing infrastructure CDK deploy (lines 304-342; the secret needs to exist before the partition loop’s chart installs reference the ExternalSecret):Terminal window echo ">>>>>>>>> Step N: Resolve infrastructure secrets from 1Password"if [[ -z "${SENTRY_DSN:-}" ]]; thenSENTRY_DSN="$(op read 'op://Arda-SystemsOAM/be-sentry-dsn/dsn')"if [[ "${GITHUB_ACTIONS:-}" == "true" ]]; thenecho "::add-mask::${SENTRY_DSN}"fiexport SENTRY_DSNfi# Append the parameter to the existing infrastructure CDK invocation# (or, if cdk deploy --all is used, scope the override to the secret# stack via the Stack:Param=Value form, e.g. SecretsStack:SentryDsn=...).infrastructure_cdk_arguments+=("--parameters" "SentryDsn=${SENTRY_DSN}")time npx cdk "${infrastructure_cdk_arguments[@]}"The
::add-mask::registration is required because 1P-sourced values are not auto-redacted by GitHub Actions the way${{ secrets.* }}values are; same pattern as the partition loop’sAMAZON_CREATORS_API_JSONmasking. -
No IAM change. The existing
{Infrastructure}-ReadSecretsmanaged policy (attached to the{Infrastructure}-SecretsManagerReadRoleused by ESO viasecretReaderIRSA) grantsGetSecretValueonarn:aws:secretsmanager:${region}:${account}:secret:${il.id}-*. That wildcard already covers${Infrastructure}-SentryDsnalongside the partition-scoped keys.
Notes for the implementer
Section titled “Notes for the implementer”- The new resource lands in CDK, not the legacy
src/main/cfn/templates. CDK is the preferred IaC technology in this repo; the partition-scopedpartitionSecrets.cfn.yamlis the legacy exception, not a model to copy. cdk.SecretValue.cfnParameter(...)is the correct API for “value supplied at deploy”.cdk.SecretValue.unsafePlainTextbakes the value into the synthesized template (visible incdk.out/*.template.json) and should not be used for secret-grade values sourced from outside the CDK app.- CDK parameters are passed via
cdk deploy --parameters [Stack:]Name=Value. If the deploy is--all, scope the parameter to the stack that owns the secret (Stack:Name=Valuesyntax) to avoid the CLI failing on unknown parameters in other stacks.
Verification
Section titled “Verification”- After infrastructure-stack deploy:
returns the DSN URL (do this against a profile that doesn’t log to CI output, since the response will print the secret).
Terminal window aws secretsmanager get-secret-value \--secret-id "Alpha001-SentryDsn" \--query 'SecretString' --output text - After operations chart redeploy with
oam.performance.sentry.enabled: true:reportsTerminal window kubectl --context Alpha001 -n stage-operations \get externalsecret be-sentry-dsn -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'True(ESO reconciled). - And the materialised K8s secret exists:
Terminal window kubectl --context Alpha001 -n stage-operations get secret be-sentry-dsn
Failure-mode reminder
Section titled “Failure-mode reminder”If this ticket lags behind PDEV-488: the operations pod still
starts, because the chart declares SENTRY_DSN’s secretKeyRef
with optional: true. The Sentry agent reads an unset DSN and
disables itself with a single WARN line — no outage, no rollback.
Sentry simply stays silent for that environment until the
infrastructure-stack deploy lands. See pod_capacity.md § Rec #5
“Failure modes” for the full behaviour table.
5. RDS Proxy decision (PDEV-499 — evaluation)
Section titled “5. RDS Proxy decision (PDEV-499 — evaluation)”When HPA fans operations out from 2 to up to 8 pods, the application connection count grows linearly. The choice is:
- (a) Bump
max_connectionson Aurora (#4) to absorb the load. Simple. Pays per-connection memory tax on the DB. - (b) Stand up an RDS Proxy in front of Aurora. Multiplexes application connections onto a smaller pool of physical connections. Lower DB-side memory pressure; adds one hop of latency (~1 ms) and one more thing to operate. Has its own IAM and secrets requirements.
Evaluate once we have pod-level connection metrics from PDEV-488
running in stage (Sentry’s JVM metrics include HikariCP gauges via
OTel). If application connections are bumping against the pool
ceiling and Aurora reports max_connections near the cap, RDS
Proxy is worth the operational cost. Otherwise (a) is fine.
- Action — this project: create the ticket, defer the decision to “after PDEV-488 is live in stage.” Do not pre-commit to either path.
Ordering and outage windows
Section titled “Ordering and outage windows”- PDEV-491 (#2) can ship first — read-only addon, no outage.
- PDEV-488 can ship anytime; HPA gating means it doesn’t matter whether metrics-server is installed first (the HPA stays disabled until it is).
- PDEV-479 (#3 + #4) is the only planned-outage window — Aurora parameter group + sizing change requires an instance restart. Schedule once. Notify product before.
- Infrastructure-scoped Sentry DSN secret (#4) can ship anytime
relative to PDEV-488. If it lags, the operations chart’s
secretKeyRef: optional: truekeeps the pod fail-soft and Sentry stays silent for that environment until the infrastructure-stack deploy lands. No outage, no rollback. If it ships first, the secret simply waits unused until PDEV-488’sExternalSecretreconciles against it. - Amplify (#1a + #1b) can ship independently. 1a is console work; 1b is the next AMM template release.
- RDS Proxy (#5, PDEV-499) is deferred until we have data from PDEV-488 in stage.
Out of scope
Section titled “Out of scope”- CDK / resource tags audit. Aligning
partition/purpose/componenttags across the bumped resources is worth doing but is a workspace-wide concern, not a slow-responses concern. Separate effort. - Karpenter / EC2 migration for operations. Recommendation #6
in
pod_capacity.md, deferred. - CloudFront / API Gateway tuning. None of the PDEV-442 findings pointed there.
- Aurora reader-instance routing. Possible future optimization for the bitemporal read workload; not part of this project.
Copyright: © Arda Systems 2025-2026, All rights reserved