amm.sh Failure Mode Analysis

Deep analysis of the amm.sh deployment orchestration script, focused on potential failure modes, data-loss risks, and scenarios that could leave the target AWS environment in an inconsistent state.

Analysis date: 2026-04-08

Analysed revision: 51952d5 on branch jmpicnic/438 (663 lines).

Script-Wide Behaviour

The script runs with set -eu (line 5) — any unset variable or non-zero exit code aborts immediately. This is good for safety but creates its own failure modes: any unexpected non-zero return (including benign ones like “no changes to deploy”) halts the entire run without cleanup.

Debug mode (RUNNER_DEBUG=1) enables set -xv, which prints every command — including secret values — to stdout/stderr.

Critical — Data Loss / Overwrite Risks

C1. `amplify start-job` fires unconditionally on every run

Location: lines 590–594

Every time the script runs for a CFN-managed partition (e.g., Alpha001:demo), it triggers aws amplify start-job --job-type RELEASE. The --branch-name parameter comes from the AMPLIFY_BRANCH_NAMES map (lines 223–229), which is hardcoded to "main" for every partition. This kicks off a full Amplify build and deploy of main, even if nothing changed. If someone is testing a specific deployed version, this overwrites it with whatever is on the main branch. There is no “only deploy if changed” guard. The --job-reason "Initial deploy after CloudFormation" suggests this was intended as a one-time post-creation step, but nothing prevents it from running on subsequent invocations.

Mitigating factor: The main branch is a protected branch in the affected repositories (arda-frontend-app, kyle-frontend-app), so only reviewed and merged code is ever deployed. The risk is not untested code, but rather an unnecessary redeploy that disrupts an environment where a specific version is intentionally pinned or being validated.

Recommendation: Guard start-job behind a first-deploy check (e.g., query whether the branch already has a successful job) or make it opt-in via a --deploy-frontend flag.

C2. Secret values are shared across partitions

Location: lines 192–199

ARDA_SIGNUP_KEY, HUBSPOT_CLIENT_KEY, HUBSPOT_PAT, and PYLON_WIDGET_KEY are fetched once from a single 1Password vault (Arda-StageOAM / Arda-ProdOAM) and then deployed to every partition in the loop (lines 506–514). If the script is run with ./amm.sh Alpha001 demo prod, both demo and prod get the same HubSpot key. If the vaults are staging vaults, production gets staging secrets.

Only ARDA_API_KEY is partition-aware (lines 502–504 call resolve_arda_api_key per partition). All other secrets are global.

Recommendation: Make each secret partition-aware via the existing PARTITION_VAULT_MAP pattern, or validate that the vault matches the partition before deploying.

C3. Secrets passed as CloudFormation parameter overrides

Location: lines 506–514

Secret values are passed as --parameter-overrides in plain text. These appear in:

CloudFormation stack events (visible in the AWS console)
CloudTrail logs
The script’s own set -xv debug output (line 3)

The RUNNER_DEBUG flag on line 3 enables set -xv which prints every command including secrets to stdout/stderr.

Recommendation: Use NoEcho: true on CFN parameters (already done in the template), but also consider using SSM SecureString or Secrets Manager references instead of passing values on the command line. Disable set -xv around secret-handling sections.

High — Partial Failure Leaves Inconsistent State

H1. No transactional boundaries in the partition loop

Location: lines 397–659

The partition loop deploys CDK stacks, Helm charts, CloudFormation templates, and Amplify apps in sequence. If any step fails (e.g., CDK deploy succeeds but secrets deploy fails), the partition is left in a half-deployed state. There is no rollback mechanism. Re-running the script from the top repeats all infrastructure and Helm steps, including successful ones, wasting time and risking drift.

Recommendation: Add a --from-step flag to resume from a specific step after a partial failure. Alternatively, make each step independently idempotent so re-runs are safe and fast.

H2. CDK `--require-approval never` with `deploy --all`

Location: line 249

CDK is invoked with --require-approval never and --all, meaning every stack in the CDK app is deployed without human confirmation. If a developer adds a new stack with destructive changes (e.g., replacing a database), amm.sh will deploy it automatically on the next run.

Recommendation: Use --require-approval broadening (the CDK default) for non-CI runs. Only use never in CI after explicit approval in the PR review.

H3. Target group binding cleanup deletes unknown bindings

Location: lines 453–497

The script iterates through targetgroupbinding resources in the namespace, and deletes any that don’t match the expected HTTP/HTTPS ARNs. If someone manually created a target group binding for debugging or a new feature, it gets silently deleted.

Recommendation: Log which bindings are being deleted and require a --cleanup-bindings flag for the deletion step, or at minimum emit a warning before deleting.

H4. Helm `upgrade --install` can silently downgrade

Location: lines 366–378, 384–395, 421–434

Helm chart versions are hardcoded (--version 1.13.4 for LBC, --version 0.19.1 for External Secrets, --version 4.13.0 for nginx). If the cluster has a newer version installed (e.g., from a manual upgrade), running amm.sh silently downgrades it. The --atomic flag means a failed downgrade rolls back, but a successful downgrade sticks.

Recommendation: Query the installed chart version before upgrading. If the installed version is newer than the pinned version, warn and skip (or require an explicit --force-helm-version flag).

H5. `null` env vars from Amplify crashes jq

Location: line 628–654

If the Amplify app has no environment variables set (new app, or wiped), aws amplify get-app --query "app.environmentVariables" --output json returns null. The jq merge on line 654 would then evaluate null | . + {...}, which produces an error. Under set -e, this aborts the script mid-partition.

Recommendation: Default EXISTING_ENV to {} when the result is null or empty:

EXISTING_ENV="$(aws amplify get-app ... || echo '{}')"
[[ "${EXISTING_ENV}" == "null" ]] && EXISTING_ENV="{}"

Medium — Operational Risks

M1. No `--no-fail-on-empty-changeset` on CloudFormation deploys

Location: lines 267, 506, 550, 563, 601

None of the aws cloudformation deploy calls use --no-fail-on-empty-changeset. In AWS CLI v1, deploy returns exit code 255 when there are no changes. Under set -e, this aborts the script. AWS CLI v2 changed this to return 0 by default, so the behavior is version-dependent.

Recommendation: Add --no-fail-on-empty-changeset to all cloudformation deploy calls for consistent behavior across CLI versions.

M2. `cdk bootstrap` runs on every invocation

Location: line 284

CDK bootstrap is designed to be idempotent, but running it every time adds ~30 seconds and makes unnecessary S3/CloudFormation API calls. If a bootstrap version change is introduced in a CDK update, this could unexpectedly modify the CDKToolkit stack.

Recommendation: Check whether the CDKToolkit stack exists and is up-to-date before bootstrapping. Skip if already current.

M3. `aws sso login` runs inside the partition loop

Location: line 408

On non-CI runs, aws sso login is called before each partition iteration. If SSO tokens expire mid-run (e.g., during a long CDK deploy), the next partition gets a fresh login prompt. But if the user is not at the terminal, the script blocks indefinitely. On CI (GITHUB_ACTIONS=true), this is skipped.

Recommendation: Check SSO token validity once at the start. If the token lifetime is shorter than the expected run duration, warn the user.

M4. `ARDA_API_KEY` env var lingers across partitions

Location: lines 502–504

ARDA_API_KEY is set per-partition via resolve_arda_api_key, but it’s exported to the environment. If the first partition sets it and the second partition’s resolve_arda_api_key call fails, set -e catches it. But the env var from the previous partition lingers. If the guard on line 502 (-z "${ARDA_API_KEY:-}") were ever changed to use the existing value as a fallback instead of erroring, the wrong API key would be deployed to the second partition.

Recommendation: Unset ARDA_API_KEY at the start of each partition iteration to prevent cross-partition leakage.

M5. `AWS_DEFAULT_PROFILE` vs `AWS_PROFILE` confusion

Location: lines 189, 203, 127–131

The script checks AWS_DEFAULT_PROFILE on line 189 but AWS SDK v3 requires AWS_PROFILE (not AWS_DEFAULT_PROFILE). The script sets AWS_DEFAULT_PROFILE on line 203, which may not be recognized by all SDK v3 calls. The --profile flag (lines 127–131) sets AWS_PROFILE, creating two different code paths with different profile variable names.

Recommendation: Standardize on AWS_PROFILE throughout. Set both variables for backward compatibility if needed, but prefer AWS_PROFILE.

Low — Robustness Gaps

L1. No input validation on `infrastructure` parameter

Location: line 146

The infrastructure value is used directly in file paths (lines 254, 404), CloudFormation stack names, and CloudWatch log entries without sanitization. A value like ../../../etc would construct invalid but potentially confusing paths.

Recommendation: Validate that the infrastructure name matches a known pattern (alphanumeric + limited punctuation).

L2. `log_run_metadata` can fail silently

Location: lines 13–83

The logging function catches jq absence but not AWS CloudWatch API failures. If put-log-events fails (wrong log stream sequence token, permissions), the error is not fatal (no set -e inside the function) but the deployment metadata is lost.

Recommendation: Log a warning to stderr if put-log-events fails so the operator is aware that audit data was not recorded.

L3. `kubectl config set-context` without namespace

Location: line 289

Sets the kubectl context to the infrastructure name but doesn’t scope to a namespace. Subsequent kubectl commands in the partition loop use -n explicitly, but any accidental bare kubectl command would target the default namespace.

Recommendation: Minor — current usage is safe since all partition-scoped commands include -n.

L4. Step numbering in logs is ambiguous

The step variable increments globally across the infrastructure phase and all partition phases. The if/else branches for CFN-managed vs. manually-created apps both use ${step}.2.5.* step numbers, making it hard to distinguish which code path ran when reviewing logs.

Recommendation: Include the code-path name in the step label (e.g., Step N.2.5 [cfn-managed] vs. Step N.2.5 [manual-app]).

Summary

#	Severity	Issue	Risk
C1	Medium	`amplify start-job` fires on every run	Unnecessary redeploy of `main` (protected branch — no untested code)
C2	Critical	Most secrets are not partition-aware	Staging secrets deployed to production
C3	Critical	Secrets passed as CFN parameter overrides	Secrets visible in CloudTrail/console
H1	High	No transactional boundaries	Half-deployed partitions on failure
H2	High	`--require-approval never --all`	Auto-deploys destructive CDK changes
H3	High	Unknown target group bindings deleted	Silently removes manual debug resources
H4	High	Hardcoded Helm versions can downgrade	Silently reverts cluster components
H5	High	`null` env vars from Amplify crashes jq	Script aborts mid-partition
M1	Medium	No `--no-fail-on-empty-changeset`	CLI-version-dependent abort
M2	Medium	`cdk bootstrap` every run	Unnecessary; could mutate CDKToolkit
M3	Medium	`aws sso login` in loop	Blocks if user not at terminal
M4	Medium	`ARDA_API_KEY` env var lingers	Wrong key if fallback logic changes
M5	Medium	`AWS_DEFAULT_PROFILE` vs `AWS_PROFILE`	SDK v3 may not resolve credentials
L1	Low	No input validation on infrastructure	Path traversal in stack names
L2	Low	CloudWatch logging fails silently	Deployment metadata lost
L3	Low	kubectl context without namespace	Risk of accidental default-ns operations
L4	Low	Confusing step numbering	Hard to audit logs

The most actionable items are C1 (guard amplify start-job behind a first-deploy check), H5 (handle null from get-app), and H4 (pin Helm versions with a minimum-version guard rather than exact version).

amm.sh Failure Mode Analysis

Script-Wide Behaviour

Critical — Data Loss / Overwrite Risks

C1. amplify start-job fires unconditionally on every run

C2. Secret values are shared across partitions

C3. Secrets passed as CloudFormation parameter overrides

High — Partial Failure Leaves Inconsistent State

H1. No transactional boundaries in the partition loop

H2. CDK --require-approval never with deploy --all

H3. Target group binding cleanup deletes unknown bindings

H4. Helm upgrade --install can silently downgrade

H5. null env vars from Amplify crashes jq

Medium — Operational Risks

M1. No --no-fail-on-empty-changeset on CloudFormation deploys

M2. cdk bootstrap runs on every invocation

M3. aws sso login runs inside the partition loop

M4. ARDA_API_KEY env var lingers across partitions

M5. AWS_DEFAULT_PROFILE vs AWS_PROFILE confusion