PDEV-442 — Infrastructure changes scope

Infrastructure-layer changes needed to land the slow-responses remediation in production. Application-layer changes are scoped in sibling analyses (pod_capacity.md, operations-bottlenecks.md, code-dig.md, sentry-configuration.md); this file is the union of everything that lives in the infrastructure repo plus the manual console work that can’t.

[!important] Amplify is out of scope for infrastructure improvements and not in the at-a-glance table below. Three reasons:

Compute config not configurable. AWS Amplify Hosting exposes no path — CloudFormation, CLI, or console — to configure SSR runtime memory, vCPU, reserved concurrency, or provisioned concurrency on the auto-generated Amplify Hosting Compute Lambda. Verified against the full AWS::Amplify::App CFN property reference (only JobConfig.BuildComputeType is exposed, and that controls the build CI instance, not the SSR runtime). Community workarounds via direct aws lambda update-function-configuration against the Amplify-managed Lambda are non-contractual and reset on next deploy.

CacheConfig not a useful lever. AWS::Amplify::App.CacheConfig is configurable (AMPLIFY_MANAGED default vs. AMPLIFY_MANAGED_NO_COOKIES), but it controls only whether the edge cache key includes cookies. Switching to NO_COOKIES would cross-serve cached SSR responses to different authenticated users — a correctness/security regression in an auth-gated multi-tenant app. The default (AMPLIFY_MANAGED) is correct and the only safe choice; there is no useful tuning available within the property’s value space. Static-asset caching at the CDN tier is unaffected by CacheConfig and already works as expected.

Application-level levers go elsewhere. ISR adoption, BFF JS bundle reduction, BFF route-handler parallelisation, and per-row N+1 elimination are all application changes (Next.js, React, route handlers) — they live under PDEV-489, not the infrastructure repo.

A future “migrate SSR off Amplify Hosting onto a custom Lambda + CloudFront stack” project would unlock these controls as first-class CDK resources, but is out of scope for PDEV-442.

At a glance

#	Change	Repo / surface	Linear	Status
1	EKS `metrics-server` managed addon	`infrastructure` (EKS construct)	PDEV-491	not started
2	Aurora custom parameter group — `pg_stat_statements`, slow-query log	`infrastructure` (RDS construct)	PDEV-479	not started
3	Aurora instance class + `max_connections` review	`infrastructure` (RDS construct)	PDEV-479	not started
4	Infrastructure-scoped Sentry DSN secret (`{Infra}-SentryDsn`)	`infrastructure` (new CDK construct + `amm.sh` extension)	new ticket	not started
5	RDS Proxy decision (yes / no / defer)	`infrastructure` (new construct) or no-op	PDEV-499	Triage (gated on PDEV-488 in stage)

Tags audit across these resources is intentionally not scoped here — it’s a workspace-wide effort that doesn’t belong piecemeal inside the slow-responses project.

1. EKS metrics-server (PDEV-491)

Prerequisite for the HPA shipped under PDEV-488. Without metrics-server, kubectl top pod returns no data and the HPA’s CPU-target metric is unavailable, so HPA scaling is silently disabled regardless of the chart-level enabled: true flag.

Action: install the metrics-server managed addon via the existing EKS cluster CDK construct. Both Alpha001 and Alpha002 clusters.
Verification: kubectl --context Alpha001 top pod -A returns numbers; HPA on operations reports current/target instead of <unknown>/60%.
Already tracked in PDEV-491. No new ticket needed.

2. Aurora custom parameter group (PDEV-479)

The current Aurora cluster uses the default DB cluster parameter group, which means pg_stat_statements is not pre-loaded and slow-query logging is off. Both of these need IaC, not console toggles — Aurora parameter changes that touch shared_preload_libraries (which pg_stat_statements does) force an instance restart.

Action: define a custom DB cluster parameter group in the RDS CDK construct, attach to all four clusters (dev, stage, demo, prod). Parameters:
- shared_preload_libraries = pg_stat_statements (static — needs restart).
- pg_stat_statements.track = all
- log_min_duration_statement = 500 (ms — tune after observing).
- log_statement = ddl (catch schema changes; cheap).
Constraint — outage window. Static parameters require an instance restart. Bundle with the sizing change in #4 (prod only) so we take the restart once where it matters; non-prod restarts are routine.
CREATE EXTENSION pg_stat_statements; is not in scope for this ticket. It runs against the postgres admin DB on each Aurora cluster as part of the postgres-database-initializer init container, tracked separately under PDEV-498 (sub-issue of PDEV-479). The statement is idempotent (IF NOT EXISTS) so it’s safe to run on every operations deploy. Loading the library at the cluster level (this ticket) is the prerequisite; the extension creation lights up the view in the postgres DB so PDEV-490’s authors can query the data.
Verification: after the cluster restart and a subsequent operations deploy that picks up the new initializer image, SELECT count(*) FROM pg_stat_statements; against the postgres admin DB returns non-zero. CloudWatch slow-query log group populates within an hour at the chosen threshold.

3. Aurora instance class + `max_connections` (PDEV-479)

Pairs with #3 (prod cluster) and bundles into the same restart window. Resize is prod only; dev / stage / demo stay on their current instance classes — they don’t see the fan-out and restart cost without benefit. The parameter group from #3 still attaches to all four clusters (cluster-level parameters cost nothing extra and keep pg_stat_statements available everywhere).

Instance class review (prod only). Today the prod cluster runs on db.t3.medium. Under HPA fan-out (2 → up to 8 operations pods, plus accounts, plus BFF), db.t3 burstable credits will be a ceiling. Evaluate db.r6g.large (or db.r7g.large if available in the region) — Graviton, non-burstable, baseline CPU adequate for the projected workload. Decision goes in a short decision-log entry under PDEV-479.
max_connections. Aurora’s default is a function of DBInstanceClassMemory. With the projected prod pod count × pool size 20 → ~160 application connections, plus headroom for ad-hoc / lambda / RDS Proxy if we add it (#6), the default ceiling may be tight. Set max_connections = 500 in the parameter group (#3) for the prod cluster — gives ~3× headroom over projected steady-state, comfortably above HPA-max + ad-hoc.
Per-cluster sizing parameterisation. Instance class + max_connections are cluster-specific decisions. Define them in each cluster’s CDK instance file (e.g. src/main/cdk/instances/Alpha001/prod.ts) and pass them as parameters into the shared CDK stack / construct that provisions the Aurora cluster. Don’t hardcode either value inside the shared construct — keep all per-env divergence at the instance-file boundary, consistent with the rest of the CDK architecture.
Verification: post-change, RDS Performance Insights shows CPU baseline-not-burst, and SHOW max_connections; returns 500 on the prod cluster (and the default on dev / stage / demo).

4. Infrastructure-scoped Sentry DSN secret (new ticket)

PDEV-488 #5 wires the Sentry DSN through an ESO ExternalSecret sourcing {Infrastructure}-SentryDsn from AWS Secrets Manager. Per the established Arda convention (see pod_capacity.md § Provisioning pipeline), the runtime never reads 1Password directly — amm.sh is the bridge that reads from 1P and passes the value into a CDK stack parameter; CDK declares the SM resource, and the existing ESO + secretReader IRSA role consumes it.

PlantUML diagram

One SM secret per infrastructure, not per partition: the same DSN serves both partitions of each infrastructure (and would serve future components like accounts in the same infrastructure if a distinct DSN per component is ever needed). Provisioned once per infrastructure (Alpha001-SentryDsn, Alpha002-SentryDsn).

The -I- and -API- markers used elsewhere in the codebase are CloudFormation export-name conventions only — they appear on the exported names like Alpha001-I-SentryDsnArn, but never on the AWS resource names themselves (Alpha001-SentryDsn). This matches the existing pattern in partitionSecrets.cfn.yaml (secret Alpha001-prod-ArdaApiKey, export Alpha001-prod-I-ArdaApiKeyArn / Alpha001-prod-API-ArdaApiKeyArn).

Action

CDK stack, new under src/main/cdk/stacks/infrastructure/, class name InfrastructureSecretsStack (consistent with the existing PascalCase, scope-prefixed convention — InfrastructureEksStack, NetworkingInfrastructureStack, InfrastructureIngress). Wire it in src/main/cdk/apps/Al1x/infra.ts (called from instances/{Infrastructure}/infra.ts, already invoked by amm.sh lines 304-342) alongside the existing NetworkingInfrastructureStack and InfrastructureEksStack.

The SM secret is not a baked-at-synth value; it’s passed at deploy via a CDK / CloudFormation parameter so the synthesized template never embeds the DSN plaintext.

New OAM construct ExternallySuppliedSecret under src/main/cdk/constructs/oam/externally-supplied-secret.ts, alongside the existing PredefinedSecret and GeneratedSecret. Encapsulates the CfnParameter + cdk.SecretValue.cfnParameter plumbing so consumers don’t have to know about the underlying SDK call (matching the existing OAM convention — those two constructs also wrap a single SDK call). Shape:

import * as cdk from "aws-cdk-lib";
import { Construct } from "constructs";
import * as secretsmanager from "aws-cdk-lib/aws-secretsmanager";
import * as misc from "arda/utils/misc";

export interface Configuration {
  prefix: string;
  secretName: string;
  secretStringParameter: cdk.CfnParameter;
}
export interface Props extends Configuration {}
export interface Built {
  readonly secret: secretsmanager.ISecret;
}

export class ExternallySuppliedSecret extends Construct {
  public readonly build: Built;
  static validateProps(id: string, props: Props): Error[] {
    return [];
  }
  constructor(scope: Construct, id: string, props: Props) {
    const errors = ExternallySuppliedSecret.validateProps(id, props);
    if (errors.length > 0) {
      throw new misc.MultiError(`Props for ${id} are invalid:`, errors);
    }
    super(scope, id);
    const secret = new secretsmanager.Secret(this, "Secret", {
      secretName: `${props.prefix}-${props.secretName}`,
      secretStringValue: cdk.SecretValue.cfnParameter(
        props.secretStringParameter,
      ),
    });
    this.build = { secret };
  }
}

Usage inside InfrastructureSecretsStack:

const sentryDsn = new cdk.CfnParameter(this, "SentryDsn", {
  type: "String",
  noEcho: true,
  minLength: 1,
  description: "Sentry DSN for this infrastructure (passed by amm.sh from 1Password)",
});

const sentryDsnSecret = new ExternallySuppliedSecret(this, "SentryDsn", {
  prefix: il.id,
  secretName: "SentryDsn",
  secretStringParameter: sentryDsn,
});

new cdk.CfnOutput(this, "SentryDsnArn", {
  value: sentryDsnSecret.build.secret.secretArn,
  exportName: `${il.id}-I-SentryDsnArn`,
});

amm.sh extension, added as an infrastructure-level step before the existing infrastructure CDK deploy (lines 304-342; the secret needs to exist before the partition loop’s chart installs reference the ExternalSecret):

echo ">>>>>>>>> Step N: Resolve infrastructure secrets from 1Password"
if [[ -z "${SENTRY_DSN:-}" ]]; then
  SENTRY_DSN="$(op read 'op://Arda-SystemsOAM/be-sentry-dsn/dsn')"
  if [[ "${GITHUB_ACTIONS:-}" == "true" ]]; then
    echo "::add-mask::${SENTRY_DSN}"
  fi
  export SENTRY_DSN
fi

# Append the parameter to the existing infrastructure CDK invocation
# (or, if cdk deploy --all is used, scope the override to the secret
# stack via the Stack:Param=Value form, e.g. SecretsStack:SentryDsn=...).
infrastructure_cdk_arguments+=("--parameters" "SentryDsn=${SENTRY_DSN}")

time npx cdk "${infrastructure_cdk_arguments[@]}"

The ::add-mask:: registration is required because 1P-sourced values are not auto-redacted by GitHub Actions the way ${{ secrets.* }} values are; same pattern as the partition loop’s AMAZON_CREATORS_API_JSON masking.

No IAM change. The existing {Infrastructure}-ReadSecrets managed policy (attached to the {Infrastructure}-SecretsManagerReadRole used by ESO via secretReader IRSA) grants GetSecretValue on arn:aws:secretsmanager:${region}:${account}:secret:${il.id}-*. That wildcard already covers ${Infrastructure}-SentryDsn alongside the partition-scoped keys.

Notes for the implementer

The new resource lands in CDK, not the legacy src/main/cfn/ templates. CDK is the preferred IaC technology in this repo; the partition-scoped partitionSecrets.cfn.yaml is the legacy exception, not a model to copy.
cdk.SecretValue.cfnParameter(...) is the correct API for “value supplied at deploy”. cdk.SecretValue.unsafePlainText bakes the value into the synthesized template (visible in cdk.out/*.template.json) and should not be used for secret-grade values sourced from outside the CDK app.
CDK parameters are passed via cdk deploy --parameters [Stack:]Name=Value. If the deploy is --all, scope the parameter to the stack that owns the secret (Stack:Name=Value syntax) to avoid the CLI failing on unknown parameters in other stacks.

Verification

After infrastructure-stack deploy:
Terminal window
```
aws secretsmanager get-secret-value \
  --secret-id "Alpha001-SentryDsn" \
  --query 'SecretString' --output text
```
returns the DSN URL (do this against a profile that doesn’t log to CI output, since the response will print the secret).

After operations chart redeploy with oam.performance.sentry.enabled: true:

kubectl --context Alpha001 -n stage-operations \
  get externalsecret be-sentry-dsn -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'

reports True (ESO reconciled).

And the materialised K8s secret exists:

kubectl --context Alpha001 -n stage-operations get secret be-sentry-dsn

Failure-mode reminder

If this ticket lags behind PDEV-488: the operations pod still starts, because the chart declares SENTRY_DSN’s secretKeyRef with optional: true. The Sentry agent reads an unset DSN and disables itself with a single WARN line — no outage, no rollback. Sentry simply stays silent for that environment until the infrastructure-stack deploy lands. See pod_capacity.md § Rec #5 “Failure modes” for the full behaviour table.

5. RDS Proxy decision (PDEV-499 — evaluation)

When HPA fans operations out from 2 to up to 8 pods, the application connection count grows linearly. The choice is:

(a) Bump max_connections on Aurora (#4) to absorb the load. Simple. Pays per-connection memory tax on the DB.
(b) Stand up an RDS Proxy in front of Aurora. Multiplexes application connections onto a smaller pool of physical connections. Lower DB-side memory pressure; adds one hop of latency (~1 ms) and one more thing to operate. Has its own IAM and secrets requirements.

Evaluate once we have pod-level connection metrics from PDEV-488 running in stage (Sentry’s JVM metrics include HikariCP gauges via OTel). If application connections are bumping against the pool ceiling and Aurora reports max_connections near the cap, RDS Proxy is worth the operational cost. Otherwise (a) is fine.

Action — this project: create the ticket, defer the decision to “after PDEV-488 is live in stage.” Do not pre-commit to either path.

Ordering and outage windows

PDEV-491 (#2) can ship first — read-only addon, no outage.
PDEV-488 can ship anytime; HPA gating means it doesn’t matter whether metrics-server is installed first (the HPA stays disabled until it is).
PDEV-479 (#3 + #4) is the only planned-outage window — Aurora parameter group + sizing change requires an instance restart. Schedule once. Notify product before.
Infrastructure-scoped Sentry DSN secret (#4) can ship anytime relative to PDEV-488. If it lags, the operations chart’s secretKeyRef: optional: true keeps the pod fail-soft and Sentry stays silent for that environment until the infrastructure-stack deploy lands. No outage, no rollback. If it ships first, the secret simply waits unused until PDEV-488’s ExternalSecret reconciles against it.
Amplify (#1a + #1b) can ship independently. 1a is console work; 1b is the next AMM template release.
RDS Proxy (#5, PDEV-499) is deferred until we have data from PDEV-488 in stage.

Out of scope

CDK / resource tags audit. Aligning partition / purpose / component tags across the bumped resources is worth doing but is a workspace-wide concern, not a slow-responses concern. Separate effort.
Karpenter / EC2 migration for operations. Recommendation #6 in pod_capacity.md, deferred.
CloudFront / API Gateway tuning. None of the PDEV-442 findings pointed there.
Aurora reader-instance routing. Possible future optimization for the bitemporal read workload; not part of this project.