Skip to content

PDEV-442 — Infrastructure changes scope

Infrastructure-layer changes needed to land the slow-responses remediation in production. Application-layer changes are scoped in sibling analyses (pod_capacity.md, operations-bottlenecks.md, code-dig.md, sentry-configuration.md); this file is the union of everything that lives in the infrastructure repo plus the manual console work that can’t.

[!important] Amplify is out of scope for infrastructure improvements and not in the at-a-glance table below. Three reasons:

  1. Compute config not configurable. AWS Amplify Hosting exposes no path — CloudFormation, CLI, or console — to configure SSR runtime memory, vCPU, reserved concurrency, or provisioned concurrency on the auto-generated Amplify Hosting Compute Lambda. Verified against the full AWS::Amplify::App CFN property reference (only JobConfig.BuildComputeType is exposed, and that controls the build CI instance, not the SSR runtime). Community workarounds via direct aws lambda update-function-configuration against the Amplify-managed Lambda are non-contractual and reset on next deploy.

  2. CacheConfig not a useful lever. AWS::Amplify::App.CacheConfig is configurable (AMPLIFY_MANAGED default vs. AMPLIFY_MANAGED_NO_COOKIES), but it controls only whether the edge cache key includes cookies. Switching to NO_COOKIES would cross-serve cached SSR responses to different authenticated users — a correctness/security regression in an auth-gated multi-tenant app. The default (AMPLIFY_MANAGED) is correct and the only safe choice; there is no useful tuning available within the property’s value space. Static-asset caching at the CDN tier is unaffected by CacheConfig and already works as expected.

  3. Application-level levers go elsewhere. ISR adoption, BFF JS bundle reduction, BFF route-handler parallelisation, and per-row N+1 elimination are all application changes (Next.js, React, route handlers) — they live under PDEV-489, not the infrastructure repo.

A future “migrate SSR off Amplify Hosting onto a custom Lambda + CloudFront stack” project would unlock these controls as first-class CDK resources, but is out of scope for PDEV-442.

#ChangeRepo / surfaceLinearStatus
1EKS metrics-server managed addoninfrastructure (EKS construct)PDEV-491not started
2Aurora custom parameter group — pg_stat_statements, slow-query loginfrastructure (RDS construct)PDEV-479not started
3Aurora instance class + max_connections reviewinfrastructure (RDS construct)PDEV-479not started
4Infrastructure-scoped Sentry DSN secret ({Infra}-SentryDsn)infrastructure (new CDK construct + amm.sh extension)new ticketnot started
5RDS Proxy decision (yes / no / defer)infrastructure (new construct) or no-opPDEV-499Triage (gated on PDEV-488 in stage)

Tags audit across these resources is intentionally not scoped here — it’s a workspace-wide effort that doesn’t belong piecemeal inside the slow-responses project.

Prerequisite for the HPA shipped under PDEV-488. Without metrics-server, kubectl top pod returns no data and the HPA’s CPU-target metric is unavailable, so HPA scaling is silently disabled regardless of the chart-level enabled: true flag.

  • Action: install the metrics-server managed addon via the existing EKS cluster CDK construct. Both Alpha001 and Alpha002 clusters.
  • Verification: kubectl --context Alpha001 top pod -A returns numbers; HPA on operations reports current/target instead of <unknown>/60%.
  • Already tracked in PDEV-491. No new ticket needed.

2. Aurora custom parameter group (PDEV-479)

Section titled “2. Aurora custom parameter group (PDEV-479)”

The current Aurora cluster uses the default DB cluster parameter group, which means pg_stat_statements is not pre-loaded and slow-query logging is off. Both of these need IaC, not console toggles — Aurora parameter changes that touch shared_preload_libraries (which pg_stat_statements does) force an instance restart.

  • Action: define a custom DB cluster parameter group in the RDS CDK construct, attach to all four clusters (dev, stage, demo, prod). Parameters:
    • shared_preload_libraries = pg_stat_statements (static — needs restart).
    • pg_stat_statements.track = all
    • log_min_duration_statement = 500 (ms — tune after observing).
    • log_statement = ddl (catch schema changes; cheap).
  • Constraint — outage window. Static parameters require an instance restart. Bundle with the sizing change in #4 (prod only) so we take the restart once where it matters; non-prod restarts are routine.
  • CREATE EXTENSION pg_stat_statements; is not in scope for this ticket. It runs against the postgres admin DB on each Aurora cluster as part of the postgres-database-initializer init container, tracked separately under PDEV-498 (sub-issue of PDEV-479). The statement is idempotent (IF NOT EXISTS) so it’s safe to run on every operations deploy. Loading the library at the cluster level (this ticket) is the prerequisite; the extension creation lights up the view in the postgres DB so PDEV-490’s authors can query the data.
  • Verification: after the cluster restart and a subsequent operations deploy that picks up the new initializer image, SELECT count(*) FROM pg_stat_statements; against the postgres admin DB returns non-zero. CloudWatch slow-query log group populates within an hour at the chosen threshold.

3. Aurora instance class + max_connections (PDEV-479)

Section titled “3. Aurora instance class + max_connections (PDEV-479)”

Pairs with #3 (prod cluster) and bundles into the same restart window. Resize is prod only; dev / stage / demo stay on their current instance classes — they don’t see the fan-out and restart cost without benefit. The parameter group from #3 still attaches to all four clusters (cluster-level parameters cost nothing extra and keep pg_stat_statements available everywhere).

  • Instance class review (prod only). Today the prod cluster runs on db.t3.medium. Under HPA fan-out (2 → up to 8 operations pods, plus accounts, plus BFF), db.t3 burstable credits will be a ceiling. Evaluate db.r6g.large (or db.r7g.large if available in the region) — Graviton, non-burstable, baseline CPU adequate for the projected workload. Decision goes in a short decision-log entry under PDEV-479.
  • max_connections. Aurora’s default is a function of DBInstanceClassMemory. With the projected prod pod count × pool size 20 → ~160 application connections, plus headroom for ad-hoc / lambda / RDS Proxy if we add it (#6), the default ceiling may be tight. Set max_connections = 500 in the parameter group (#3) for the prod cluster — gives ~3× headroom over projected steady-state, comfortably above HPA-max + ad-hoc.
  • Per-cluster sizing parameterisation. Instance class + max_connections are cluster-specific decisions. Define them in each cluster’s CDK instance file (e.g. src/main/cdk/instances/Alpha001/prod.ts) and pass them as parameters into the shared CDK stack / construct that provisions the Aurora cluster. Don’t hardcode either value inside the shared construct — keep all per-env divergence at the instance-file boundary, consistent with the rest of the CDK architecture.
  • Verification: post-change, RDS Performance Insights shows CPU baseline-not-burst, and SHOW max_connections; returns 500 on the prod cluster (and the default on dev / stage / demo).

4. Infrastructure-scoped Sentry DSN secret (new ticket)

Section titled “4. Infrastructure-scoped Sentry DSN secret (new ticket)”

PDEV-488 #5 wires the Sentry DSN through an ESO ExternalSecret sourcing {Infrastructure}-SentryDsn from AWS Secrets Manager. Per the established Arda convention (see pod_capacity.md § Provisioning pipeline), the runtime never reads 1Password directly — amm.sh is the bridge that reads from 1P and passes the value into a CDK stack parameter; CDK declares the SM resource, and the existing ESO + secretReader IRSA role consumes it.

PlantUML diagram

One SM secret per infrastructure, not per partition: the same DSN serves both partitions of each infrastructure (and would serve future components like accounts in the same infrastructure if a distinct DSN per component is ever needed). Provisioned once per infrastructure (Alpha001-SentryDsn, Alpha002-SentryDsn).

The -I- and -API- markers used elsewhere in the codebase are CloudFormation export-name conventions only — they appear on the exported names like Alpha001-I-SentryDsnArn, but never on the AWS resource names themselves (Alpha001-SentryDsn). This matches the existing pattern in partitionSecrets.cfn.yaml (secret Alpha001-prod-ArdaApiKey, export Alpha001-prod-I-ArdaApiKeyArn / Alpha001-prod-API-ArdaApiKeyArn).

  • CDK stack, new under src/main/cdk/stacks/infrastructure/, class name InfrastructureSecretsStack (consistent with the existing PascalCase, scope-prefixed convention — InfrastructureEksStack, NetworkingInfrastructureStack, InfrastructureIngress). Wire it in src/main/cdk/apps/Al1x/infra.ts (called from instances/{Infrastructure}/infra.ts, already invoked by amm.sh lines 304-342) alongside the existing NetworkingInfrastructureStack and InfrastructureEksStack.

    The SM secret is not a baked-at-synth value; it’s passed at deploy via a CDK / CloudFormation parameter so the synthesized template never embeds the DSN plaintext.

  • New OAM construct ExternallySuppliedSecret under src/main/cdk/constructs/oam/externally-supplied-secret.ts, alongside the existing PredefinedSecret and GeneratedSecret. Encapsulates the CfnParameter + cdk.SecretValue.cfnParameter plumbing so consumers don’t have to know about the underlying SDK call (matching the existing OAM convention — those two constructs also wrap a single SDK call). Shape:

    constructs/oam/externally-supplied-secret.ts
    import * as cdk from "aws-cdk-lib";
    import { Construct } from "constructs";
    import * as secretsmanager from "aws-cdk-lib/aws-secretsmanager";
    import * as misc from "arda/utils/misc";
    export interface Configuration {
    prefix: string;
    secretName: string;
    secretStringParameter: cdk.CfnParameter;
    }
    export interface Props extends Configuration {}
    export interface Built {
    readonly secret: secretsmanager.ISecret;
    }
    export class ExternallySuppliedSecret extends Construct {
    public readonly build: Built;
    static validateProps(id: string, props: Props): Error[] {
    return [];
    }
    constructor(scope: Construct, id: string, props: Props) {
    const errors = ExternallySuppliedSecret.validateProps(id, props);
    if (errors.length > 0) {
    throw new misc.MultiError(`Props for ${id} are invalid:`, errors);
    }
    super(scope, id);
    const secret = new secretsmanager.Secret(this, "Secret", {
    secretName: `${props.prefix}-${props.secretName}`,
    secretStringValue: cdk.SecretValue.cfnParameter(
    props.secretStringParameter,
    ),
    });
    this.build = { secret };
    }
    }

    Usage inside InfrastructureSecretsStack:

    const sentryDsn = new cdk.CfnParameter(this, "SentryDsn", {
    type: "String",
    noEcho: true,
    minLength: 1,
    description: "Sentry DSN for this infrastructure (passed by amm.sh from 1Password)",
    });
    const sentryDsnSecret = new ExternallySuppliedSecret(this, "SentryDsn", {
    prefix: il.id,
    secretName: "SentryDsn",
    secretStringParameter: sentryDsn,
    });
    new cdk.CfnOutput(this, "SentryDsnArn", {
    value: sentryDsnSecret.build.secret.secretArn,
    exportName: `${il.id}-I-SentryDsnArn`,
    });
  • amm.sh extension, added as an infrastructure-level step before the existing infrastructure CDK deploy (lines 304-342; the secret needs to exist before the partition loop’s chart installs reference the ExternalSecret):

    Terminal window
    echo ">>>>>>>>> Step N: Resolve infrastructure secrets from 1Password"
    if [[ -z "${SENTRY_DSN:-}" ]]; then
    SENTRY_DSN="$(op read 'op://Arda-SystemsOAM/be-sentry-dsn/dsn')"
    if [[ "${GITHUB_ACTIONS:-}" == "true" ]]; then
    echo "::add-mask::${SENTRY_DSN}"
    fi
    export SENTRY_DSN
    fi
    # Append the parameter to the existing infrastructure CDK invocation
    # (or, if cdk deploy --all is used, scope the override to the secret
    # stack via the Stack:Param=Value form, e.g. SecretsStack:SentryDsn=...).
    infrastructure_cdk_arguments+=("--parameters" "SentryDsn=${SENTRY_DSN}")
    time npx cdk "${infrastructure_cdk_arguments[@]}"

    The ::add-mask:: registration is required because 1P-sourced values are not auto-redacted by GitHub Actions the way ${{ secrets.* }} values are; same pattern as the partition loop’s AMAZON_CREATORS_API_JSON masking.

  • No IAM change. The existing {Infrastructure}-ReadSecrets managed policy (attached to the {Infrastructure}-SecretsManagerReadRole used by ESO via secretReader IRSA) grants GetSecretValue on arn:aws:secretsmanager:${region}:${account}:secret:${il.id}-*. That wildcard already covers ${Infrastructure}-SentryDsn alongside the partition-scoped keys.

  • The new resource lands in CDK, not the legacy src/main/cfn/ templates. CDK is the preferred IaC technology in this repo; the partition-scoped partitionSecrets.cfn.yaml is the legacy exception, not a model to copy.
  • cdk.SecretValue.cfnParameter(...) is the correct API for “value supplied at deploy”. cdk.SecretValue.unsafePlainText bakes the value into the synthesized template (visible in cdk.out/*.template.json) and should not be used for secret-grade values sourced from outside the CDK app.
  • CDK parameters are passed via cdk deploy --parameters [Stack:]Name=Value. If the deploy is --all, scope the parameter to the stack that owns the secret (Stack:Name=Value syntax) to avoid the CLI failing on unknown parameters in other stacks.
  • After infrastructure-stack deploy:
    Terminal window
    aws secretsmanager get-secret-value \
    --secret-id "Alpha001-SentryDsn" \
    --query 'SecretString' --output text
    returns the DSN URL (do this against a profile that doesn’t log to CI output, since the response will print the secret).
  • After operations chart redeploy with oam.performance.sentry.enabled: true:
    Terminal window
    kubectl --context Alpha001 -n stage-operations \
    get externalsecret be-sentry-dsn -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
    reports True (ESO reconciled).
  • And the materialised K8s secret exists:
    Terminal window
    kubectl --context Alpha001 -n stage-operations get secret be-sentry-dsn

If this ticket lags behind PDEV-488: the operations pod still starts, because the chart declares SENTRY_DSN’s secretKeyRef with optional: true. The Sentry agent reads an unset DSN and disables itself with a single WARN line — no outage, no rollback. Sentry simply stays silent for that environment until the infrastructure-stack deploy lands. See pod_capacity.md § Rec #5 “Failure modes” for the full behaviour table.

5. RDS Proxy decision (PDEV-499 — evaluation)

Section titled “5. RDS Proxy decision (PDEV-499 — evaluation)”

When HPA fans operations out from 2 to up to 8 pods, the application connection count grows linearly. The choice is:

  • (a) Bump max_connections on Aurora (#4) to absorb the load. Simple. Pays per-connection memory tax on the DB.
  • (b) Stand up an RDS Proxy in front of Aurora. Multiplexes application connections onto a smaller pool of physical connections. Lower DB-side memory pressure; adds one hop of latency (~1 ms) and one more thing to operate. Has its own IAM and secrets requirements.

Evaluate once we have pod-level connection metrics from PDEV-488 running in stage (Sentry’s JVM metrics include HikariCP gauges via OTel). If application connections are bumping against the pool ceiling and Aurora reports max_connections near the cap, RDS Proxy is worth the operational cost. Otherwise (a) is fine.

  • Action — this project: create the ticket, defer the decision to “after PDEV-488 is live in stage.” Do not pre-commit to either path.
  1. PDEV-491 (#2) can ship first — read-only addon, no outage.
  2. PDEV-488 can ship anytime; HPA gating means it doesn’t matter whether metrics-server is installed first (the HPA stays disabled until it is).
  3. PDEV-479 (#3 + #4) is the only planned-outage window — Aurora parameter group + sizing change requires an instance restart. Schedule once. Notify product before.
  4. Infrastructure-scoped Sentry DSN secret (#4) can ship anytime relative to PDEV-488. If it lags, the operations chart’s secretKeyRef: optional: true keeps the pod fail-soft and Sentry stays silent for that environment until the infrastructure-stack deploy lands. No outage, no rollback. If it ships first, the secret simply waits unused until PDEV-488’s ExternalSecret reconciles against it.
  5. Amplify (#1a + #1b) can ship independently. 1a is console work; 1b is the next AMM template release.
  6. RDS Proxy (#5, PDEV-499) is deferred until we have data from PDEV-488 in stage.
  • CDK / resource tags audit. Aligning partition / purpose / component tags across the bumped resources is worth doing but is a workspace-wide concern, not a slow-responses concern. Separate effort.
  • Karpenter / EC2 migration for operations. Recommendation #6 in pod_capacity.md, deferred.
  • CloudFront / API Gateway tuning. None of the PDEV-442 findings pointed there.
  • Aurora reader-instance routing. Possible future optimization for the bitemporal read workload; not part of this project.