Skip to content

amm.sh Failure Mode Analysis

Deep analysis of the amm.sh deployment orchestration script, focused on potential failure modes, data-loss risks, and scenarios that could leave the target AWS environment in an inconsistent state.

Analysis date: 2026-04-08

Analysed revision: 51952d5 on branch jmpicnic/438 (663 lines).

The script runs with set -eu (line 5) — any unset variable or non-zero exit code aborts immediately. This is good for safety but creates its own failure modes: any unexpected non-zero return (including benign ones like “no changes to deploy”) halts the entire run without cleanup.

Debug mode (RUNNER_DEBUG=1) enables set -xv, which prints every command — including secret values — to stdout/stderr.


C1. amplify start-job fires unconditionally on every run

Section titled “C1. amplify start-job fires unconditionally on every run”

Location: lines 590–594

Every time the script runs for a CFN-managed partition (e.g., Alpha001:demo), it triggers aws amplify start-job --job-type RELEASE. The --branch-name parameter comes from the AMPLIFY_BRANCH_NAMES map (lines 223–229), which is hardcoded to "main" for every partition. This kicks off a full Amplify build and deploy of main, even if nothing changed. If someone is testing a specific deployed version, this overwrites it with whatever is on the main branch. There is no “only deploy if changed” guard. The --job-reason "Initial deploy after CloudFormation" suggests this was intended as a one-time post-creation step, but nothing prevents it from running on subsequent invocations.

Mitigating factor: The main branch is a protected branch in the affected repositories (arda-frontend-app, kyle-frontend-app), so only reviewed and merged code is ever deployed. The risk is not untested code, but rather an unnecessary redeploy that disrupts an environment where a specific version is intentionally pinned or being validated.

Recommendation: Guard start-job behind a first-deploy check (e.g., query whether the branch already has a successful job) or make it opt-in via a --deploy-frontend flag.

C2. Secret values are shared across partitions

Section titled “C2. Secret values are shared across partitions”

Location: lines 192–199

ARDA_SIGNUP_KEY, HUBSPOT_CLIENT_KEY, HUBSPOT_PAT, and PYLON_WIDGET_KEY are fetched once from a single 1Password vault (Arda-StageOAM / Arda-ProdOAM) and then deployed to every partition in the loop (lines 506–514). If the script is run with ./amm.sh Alpha001 demo prod, both demo and prod get the same HubSpot key. If the vaults are staging vaults, production gets staging secrets.

Only ARDA_API_KEY is partition-aware (lines 502–504 call resolve_arda_api_key per partition). All other secrets are global.

Recommendation: Make each secret partition-aware via the existing PARTITION_VAULT_MAP pattern, or validate that the vault matches the partition before deploying.

C3. Secrets passed as CloudFormation parameter overrides

Section titled “C3. Secrets passed as CloudFormation parameter overrides”

Location: lines 506–514

Secret values are passed as --parameter-overrides in plain text. These appear in:

  • CloudFormation stack events (visible in the AWS console)
  • CloudTrail logs
  • The script’s own set -xv debug output (line 3)

The RUNNER_DEBUG flag on line 3 enables set -xv which prints every command including secrets to stdout/stderr.

Recommendation: Use NoEcho: true on CFN parameters (already done in the template), but also consider using SSM SecureString or Secrets Manager references instead of passing values on the command line. Disable set -xv around secret-handling sections.


High — Partial Failure Leaves Inconsistent State

Section titled “High — Partial Failure Leaves Inconsistent State”

H1. No transactional boundaries in the partition loop

Section titled “H1. No transactional boundaries in the partition loop”

Location: lines 397–659

The partition loop deploys CDK stacks, Helm charts, CloudFormation templates, and Amplify apps in sequence. If any step fails (e.g., CDK deploy succeeds but secrets deploy fails), the partition is left in a half-deployed state. There is no rollback mechanism. Re-running the script from the top repeats all infrastructure and Helm steps, including successful ones, wasting time and risking drift.

Recommendation: Add a --from-step flag to resume from a specific step after a partial failure. Alternatively, make each step independently idempotent so re-runs are safe and fast.

H2. CDK --require-approval never with deploy --all

Section titled “H2. CDK --require-approval never with deploy --all”

Location: line 249

CDK is invoked with --require-approval never and --all, meaning every stack in the CDK app is deployed without human confirmation. If a developer adds a new stack with destructive changes (e.g., replacing a database), amm.sh will deploy it automatically on the next run.

Recommendation: Use --require-approval broadening (the CDK default) for non-CI runs. Only use never in CI after explicit approval in the PR review.

H3. Target group binding cleanup deletes unknown bindings

Section titled “H3. Target group binding cleanup deletes unknown bindings”

Location: lines 453–497

The script iterates through targetgroupbinding resources in the namespace, and deletes any that don’t match the expected HTTP/HTTPS ARNs. If someone manually created a target group binding for debugging or a new feature, it gets silently deleted.

Recommendation: Log which bindings are being deleted and require a --cleanup-bindings flag for the deletion step, or at minimum emit a warning before deleting.

H4. Helm upgrade --install can silently downgrade

Section titled “H4. Helm upgrade --install can silently downgrade”

Location: lines 366–378, 384–395, 421–434

Helm chart versions are hardcoded (--version 1.13.4 for LBC, --version 0.19.1 for External Secrets, --version 4.13.0 for nginx). If the cluster has a newer version installed (e.g., from a manual upgrade), running amm.sh silently downgrades it. The --atomic flag means a failed downgrade rolls back, but a successful downgrade sticks.

Recommendation: Query the installed chart version before upgrading. If the installed version is newer than the pinned version, warn and skip (or require an explicit --force-helm-version flag).

Location: line 628–654

If the Amplify app has no environment variables set (new app, or wiped), aws amplify get-app --query "app.environmentVariables" --output json returns null. The jq merge on line 654 would then evaluate null | . + {...}, which produces an error. Under set -e, this aborts the script mid-partition.

Recommendation: Default EXISTING_ENV to {} when the result is null or empty:

Terminal window
EXISTING_ENV="$(aws amplify get-app ... || echo '{}')"
[[ "${EXISTING_ENV}" == "null" ]] && EXISTING_ENV="{}"

M1. No --no-fail-on-empty-changeset on CloudFormation deploys

Section titled “M1. No --no-fail-on-empty-changeset on CloudFormation deploys”

Location: lines 267, 506, 550, 563, 601

None of the aws cloudformation deploy calls use --no-fail-on-empty-changeset. In AWS CLI v1, deploy returns exit code 255 when there are no changes. Under set -e, this aborts the script. AWS CLI v2 changed this to return 0 by default, so the behavior is version-dependent.

Recommendation: Add --no-fail-on-empty-changeset to all cloudformation deploy calls for consistent behavior across CLI versions.

M2. cdk bootstrap runs on every invocation

Section titled “M2. cdk bootstrap runs on every invocation”

Location: line 284

CDK bootstrap is designed to be idempotent, but running it every time adds ~30 seconds and makes unnecessary S3/CloudFormation API calls. If a bootstrap version change is introduced in a CDK update, this could unexpectedly modify the CDKToolkit stack.

Recommendation: Check whether the CDKToolkit stack exists and is up-to-date before bootstrapping. Skip if already current.

M3. aws sso login runs inside the partition loop

Section titled “M3. aws sso login runs inside the partition loop”

Location: line 408

On non-CI runs, aws sso login is called before each partition iteration. If SSO tokens expire mid-run (e.g., during a long CDK deploy), the next partition gets a fresh login prompt. But if the user is not at the terminal, the script blocks indefinitely. On CI (GITHUB_ACTIONS=true), this is skipped.

Recommendation: Check SSO token validity once at the start. If the token lifetime is shorter than the expected run duration, warn the user.

M4. ARDA_API_KEY env var lingers across partitions

Section titled “M4. ARDA_API_KEY env var lingers across partitions”

Location: lines 502–504

ARDA_API_KEY is set per-partition via resolve_arda_api_key, but it’s exported to the environment. If the first partition sets it and the second partition’s resolve_arda_api_key call fails, set -e catches it. But the env var from the previous partition lingers. If the guard on line 502 (-z "${ARDA_API_KEY:-}") were ever changed to use the existing value as a fallback instead of erroring, the wrong API key would be deployed to the second partition.

Recommendation: Unset ARDA_API_KEY at the start of each partition iteration to prevent cross-partition leakage.

M5. AWS_DEFAULT_PROFILE vs AWS_PROFILE confusion

Section titled “M5. AWS_DEFAULT_PROFILE vs AWS_PROFILE confusion”

Location: lines 189, 203, 127–131

The script checks AWS_DEFAULT_PROFILE on line 189 but AWS SDK v3 requires AWS_PROFILE (not AWS_DEFAULT_PROFILE). The script sets AWS_DEFAULT_PROFILE on line 203, which may not be recognized by all SDK v3 calls. The --profile flag (lines 127–131) sets AWS_PROFILE, creating two different code paths with different profile variable names.

Recommendation: Standardize on AWS_PROFILE throughout. Set both variables for backward compatibility if needed, but prefer AWS_PROFILE.


L1. No input validation on infrastructure parameter

Section titled “L1. No input validation on infrastructure parameter”

Location: line 146

The infrastructure value is used directly in file paths (lines 254, 404), CloudFormation stack names, and CloudWatch log entries without sanitization. A value like ../../../etc would construct invalid but potentially confusing paths.

Recommendation: Validate that the infrastructure name matches a known pattern (alphanumeric + limited punctuation).

Location: lines 13–83

The logging function catches jq absence but not AWS CloudWatch API failures. If put-log-events fails (wrong log stream sequence token, permissions), the error is not fatal (no set -e inside the function) but the deployment metadata is lost.

Recommendation: Log a warning to stderr if put-log-events fails so the operator is aware that audit data was not recorded.

L3. kubectl config set-context without namespace

Section titled “L3. kubectl config set-context without namespace”

Location: line 289

Sets the kubectl context to the infrastructure name but doesn’t scope to a namespace. Subsequent kubectl commands in the partition loop use -n explicitly, but any accidental bare kubectl command would target the default namespace.

Recommendation: Minor — current usage is safe since all partition-scoped commands include -n.

The step variable increments globally across the infrastructure phase and all partition phases. The if/else branches for CFN-managed vs. manually-created apps both use ${step}.2.5.* step numbers, making it hard to distinguish which code path ran when reviewing logs.

Recommendation: Include the code-path name in the step label (e.g., Step N.2.5 [cfn-managed] vs. Step N.2.5 [manual-app]).


#SeverityIssueRisk
C1Mediumamplify start-job fires on every runUnnecessary redeploy of main (protected branch — no untested code)
C2CriticalMost secrets are not partition-awareStaging secrets deployed to production
C3CriticalSecrets passed as CFN parameter overridesSecrets visible in CloudTrail/console
H1HighNo transactional boundariesHalf-deployed partitions on failure
H2High--require-approval never --allAuto-deploys destructive CDK changes
H3HighUnknown target group bindings deletedSilently removes manual debug resources
H4HighHardcoded Helm versions can downgradeSilently reverts cluster components
H5Highnull env vars from Amplify crashes jqScript aborts mid-partition
M1MediumNo --no-fail-on-empty-changesetCLI-version-dependent abort
M2Mediumcdk bootstrap every runUnnecessary; could mutate CDKToolkit
M3Mediumaws sso login in loopBlocks if user not at terminal
M4MediumARDA_API_KEY env var lingersWrong key if fallback logic changes
M5MediumAWS_DEFAULT_PROFILE vs AWS_PROFILESDK v3 may not resolve credentials
L1LowNo input validation on infrastructurePath traversal in stack names
L2LowCloudWatch logging fails silentlyDeployment metadata lost
L3Lowkubectl context without namespaceRisk of accidental default-ns operations
L4LowConfusing step numberingHard to audit logs

The most actionable items are C1 (guard amplify start-job behind a first-deploy check), H5 (handle null from get-app), and H4 (pin Helm versions with a minimum-version guard rather than exact version).



Copyright: (c) Arda Systems 2025-2026, All rights reserved