amm.sh Failure Mode Analysis
Deep analysis of the amm.sh deployment orchestration script, focused on
potential failure modes, data-loss risks, and scenarios that could leave the
target AWS environment in an inconsistent state.
Analysis date: 2026-04-08
Analysed revision:
51952d5
on branch jmpicnic/438 (663 lines).
Script-Wide Behaviour
Section titled “Script-Wide Behaviour”The script runs with set -eu (line 5) — any unset variable or non-zero
exit code aborts immediately. This is good for safety but creates its own
failure modes: any unexpected non-zero return (including benign ones like
“no changes to deploy”) halts the entire run without cleanup.
Debug mode (RUNNER_DEBUG=1) enables set -xv, which prints every command
— including secret values — to stdout/stderr.
Critical — Data Loss / Overwrite Risks
Section titled “Critical — Data Loss / Overwrite Risks”C1. amplify start-job fires unconditionally on every run
Section titled “C1. amplify start-job fires unconditionally on every run”Location: lines 590–594
Every time the script runs for a CFN-managed partition (e.g.,
Alpha001:demo), it triggers aws amplify start-job --job-type RELEASE.
The --branch-name parameter comes from the AMPLIFY_BRANCH_NAMES map
(lines 223–229), which is hardcoded to "main" for every partition. This
kicks off a full Amplify build and deploy of main, even if nothing
changed. If someone is testing a specific deployed version, this overwrites
it with whatever is on the main branch. There is no “only deploy if
changed” guard. The --job-reason "Initial deploy after CloudFormation"
suggests this was intended as a one-time post-creation step, but nothing
prevents it from running on subsequent invocations.
Mitigating factor: The main branch is a protected branch in the
affected repositories (arda-frontend-app, kyle-frontend-app), so only
reviewed and merged code is ever deployed. The risk is not untested code,
but rather an unnecessary redeploy that disrupts an environment where a
specific version is intentionally pinned or being validated.
Recommendation: Guard start-job behind a first-deploy check (e.g.,
query whether the branch already has a successful job) or make it opt-in
via a --deploy-frontend flag.
C2. Secret values are shared across partitions
Section titled “C2. Secret values are shared across partitions”Location: lines 192–199
ARDA_SIGNUP_KEY, HUBSPOT_CLIENT_KEY, HUBSPOT_PAT, and
PYLON_WIDGET_KEY are fetched once from a single 1Password vault
(Arda-StageOAM / Arda-ProdOAM) and then deployed to every partition
in the loop (lines 506–514). If the script is run with
./amm.sh Alpha001 demo prod, both demo and prod get the same HubSpot
key. If the vaults are staging vaults, production gets staging secrets.
Only ARDA_API_KEY is partition-aware (lines 502–504 call
resolve_arda_api_key per partition). All other secrets are global.
Recommendation: Make each secret partition-aware via the existing
PARTITION_VAULT_MAP pattern, or validate that the vault matches the
partition before deploying.
C3. Secrets passed as CloudFormation parameter overrides
Section titled “C3. Secrets passed as CloudFormation parameter overrides”Location: lines 506–514
Secret values are passed as --parameter-overrides in plain text. These
appear in:
- CloudFormation stack events (visible in the AWS console)
- CloudTrail logs
- The script’s own
set -xvdebug output (line 3)
The RUNNER_DEBUG flag on line 3 enables set -xv which prints every
command including secrets to stdout/stderr.
Recommendation: Use NoEcho: true on CFN parameters (already done in
the template), but also consider using SSM SecureString or Secrets Manager
references instead of passing values on the command line. Disable set -xv
around secret-handling sections.
High — Partial Failure Leaves Inconsistent State
Section titled “High — Partial Failure Leaves Inconsistent State”H1. No transactional boundaries in the partition loop
Section titled “H1. No transactional boundaries in the partition loop”Location: lines 397–659
The partition loop deploys CDK stacks, Helm charts, CloudFormation templates, and Amplify apps in sequence. If any step fails (e.g., CDK deploy succeeds but secrets deploy fails), the partition is left in a half-deployed state. There is no rollback mechanism. Re-running the script from the top repeats all infrastructure and Helm steps, including successful ones, wasting time and risking drift.
Recommendation: Add a --from-step flag to resume from a specific step
after a partial failure. Alternatively, make each step independently
idempotent so re-runs are safe and fast.
H2. CDK --require-approval never with deploy --all
Section titled “H2. CDK --require-approval never with deploy --all”Location: line 249
CDK is invoked with --require-approval never and --all, meaning every
stack in the CDK app is deployed without human confirmation. If a developer
adds a new stack with destructive changes (e.g., replacing a database),
amm.sh will deploy it automatically on the next run.
Recommendation: Use --require-approval broadening (the CDK default)
for non-CI runs. Only use never in CI after explicit approval in the PR
review.
H3. Target group binding cleanup deletes unknown bindings
Section titled “H3. Target group binding cleanup deletes unknown bindings”Location: lines 453–497
The script iterates through targetgroupbinding resources in the
namespace, and deletes any that don’t match the expected HTTP/HTTPS ARNs.
If someone manually created a target group binding for debugging or a new
feature, it gets silently deleted.
Recommendation: Log which bindings are being deleted and require a
--cleanup-bindings flag for the deletion step, or at minimum emit a
warning before deleting.
H4. Helm upgrade --install can silently downgrade
Section titled “H4. Helm upgrade --install can silently downgrade”Location: lines 366–378, 384–395, 421–434
Helm chart versions are hardcoded (--version 1.13.4 for LBC,
--version 0.19.1 for External Secrets, --version 4.13.0 for nginx). If
the cluster has a newer version installed (e.g., from a manual upgrade),
running amm.sh silently downgrades it. The --atomic flag means a failed
downgrade rolls back, but a successful downgrade sticks.
Recommendation: Query the installed chart version before upgrading. If
the installed version is newer than the pinned version, warn and skip (or
require an explicit --force-helm-version flag).
H5. null env vars from Amplify crashes jq
Section titled “H5. null env vars from Amplify crashes jq”Location: line 628–654
If the Amplify app has no environment variables set (new app, or wiped),
aws amplify get-app --query "app.environmentVariables" --output json
returns null. The jq merge on line 654 would then evaluate
null | . + {...}, which produces an error. Under set -e, this aborts
the script mid-partition.
Recommendation: Default EXISTING_ENV to {} when the result is
null or empty:
EXISTING_ENV="$(aws amplify get-app ... || echo '{}')"[[ "${EXISTING_ENV}" == "null" ]] && EXISTING_ENV="{}"Medium — Operational Risks
Section titled “Medium — Operational Risks”M1. No --no-fail-on-empty-changeset on CloudFormation deploys
Section titled “M1. No --no-fail-on-empty-changeset on CloudFormation deploys”Location: lines 267, 506, 550, 563, 601
None of the aws cloudformation deploy calls use
--no-fail-on-empty-changeset. In AWS CLI v1, deploy returns exit code
255 when there are no changes. Under set -e, this aborts the script. AWS
CLI v2 changed this to return 0 by default, so the behavior is
version-dependent.
Recommendation: Add --no-fail-on-empty-changeset to all
cloudformation deploy calls for consistent behavior across CLI versions.
M2. cdk bootstrap runs on every invocation
Section titled “M2. cdk bootstrap runs on every invocation”Location: line 284
CDK bootstrap is designed to be idempotent, but running it every time adds ~30 seconds and makes unnecessary S3/CloudFormation API calls. If a bootstrap version change is introduced in a CDK update, this could unexpectedly modify the CDKToolkit stack.
Recommendation: Check whether the CDKToolkit stack exists and is up-to-date before bootstrapping. Skip if already current.
M3. aws sso login runs inside the partition loop
Section titled “M3. aws sso login runs inside the partition loop”Location: line 408
On non-CI runs, aws sso login is called before each partition iteration.
If SSO tokens expire mid-run (e.g., during a long CDK deploy), the next
partition gets a fresh login prompt. But if the user is not at the terminal,
the script blocks indefinitely. On CI (GITHUB_ACTIONS=true), this is
skipped.
Recommendation: Check SSO token validity once at the start. If the token lifetime is shorter than the expected run duration, warn the user.
M4. ARDA_API_KEY env var lingers across partitions
Section titled “M4. ARDA_API_KEY env var lingers across partitions”Location: lines 502–504
ARDA_API_KEY is set per-partition via resolve_arda_api_key, but it’s
exported to the environment. If the first partition sets it and the second
partition’s resolve_arda_api_key call fails, set -e catches it. But the
env var from the previous partition lingers. If the guard on line 502
(-z "${ARDA_API_KEY:-}") were ever changed to use the existing value as
a fallback instead of erroring, the wrong API key would be deployed to the
second partition.
Recommendation: Unset ARDA_API_KEY at the start of each partition
iteration to prevent cross-partition leakage.
M5. AWS_DEFAULT_PROFILE vs AWS_PROFILE confusion
Section titled “M5. AWS_DEFAULT_PROFILE vs AWS_PROFILE confusion”Location: lines 189, 203, 127–131
The script checks AWS_DEFAULT_PROFILE on line 189 but AWS SDK v3 requires
AWS_PROFILE (not AWS_DEFAULT_PROFILE). The script sets
AWS_DEFAULT_PROFILE on line 203, which may not be recognized by all SDK v3
calls. The --profile flag (lines 127–131) sets AWS_PROFILE, creating
two different code paths with different profile variable names.
Recommendation: Standardize on AWS_PROFILE throughout. Set both
variables for backward compatibility if needed, but prefer AWS_PROFILE.
Low — Robustness Gaps
Section titled “Low — Robustness Gaps”L1. No input validation on infrastructure parameter
Section titled “L1. No input validation on infrastructure parameter”Location: line 146
The infrastructure value is used directly in file paths (lines 254, 404),
CloudFormation stack names, and CloudWatch log entries without sanitization.
A value like ../../../etc would construct invalid but potentially confusing
paths.
Recommendation: Validate that the infrastructure name matches a known pattern (alphanumeric + limited punctuation).
L2. log_run_metadata can fail silently
Section titled “L2. log_run_metadata can fail silently”Location: lines 13–83
The logging function catches jq absence but not AWS CloudWatch API
failures. If put-log-events fails (wrong log stream sequence token,
permissions), the error is not fatal (no set -e inside the function) but
the deployment metadata is lost.
Recommendation: Log a warning to stderr if put-log-events fails so
the operator is aware that audit data was not recorded.
L3. kubectl config set-context without namespace
Section titled “L3. kubectl config set-context without namespace”Location: line 289
Sets the kubectl context to the infrastructure name but doesn’t scope to a
namespace. Subsequent kubectl commands in the partition loop use -n
explicitly, but any accidental bare kubectl command would target the
default namespace.
Recommendation: Minor — current usage is safe since all partition-scoped
commands include -n.
L4. Step numbering in logs is ambiguous
Section titled “L4. Step numbering in logs is ambiguous”The step variable increments globally across the infrastructure phase and
all partition phases. The if/else branches for CFN-managed vs.
manually-created apps both use ${step}.2.5.* step numbers, making it hard
to distinguish which code path ran when reviewing logs.
Recommendation: Include the code-path name in the step label (e.g.,
Step N.2.5 [cfn-managed] vs. Step N.2.5 [manual-app]).
Summary
Section titled “Summary”| # | Severity | Issue | Risk |
|---|---|---|---|
| C1 | Medium | amplify start-job fires on every run | Unnecessary redeploy of main (protected branch — no untested code) |
| C2 | Critical | Most secrets are not partition-aware | Staging secrets deployed to production |
| C3 | Critical | Secrets passed as CFN parameter overrides | Secrets visible in CloudTrail/console |
| H1 | High | No transactional boundaries | Half-deployed partitions on failure |
| H2 | High | --require-approval never --all | Auto-deploys destructive CDK changes |
| H3 | High | Unknown target group bindings deleted | Silently removes manual debug resources |
| H4 | High | Hardcoded Helm versions can downgrade | Silently reverts cluster components |
| H5 | High | null env vars from Amplify crashes jq | Script aborts mid-partition |
| M1 | Medium | No --no-fail-on-empty-changeset | CLI-version-dependent abort |
| M2 | Medium | cdk bootstrap every run | Unnecessary; could mutate CDKToolkit |
| M3 | Medium | aws sso login in loop | Blocks if user not at terminal |
| M4 | Medium | ARDA_API_KEY env var lingers | Wrong key if fallback logic changes |
| M5 | Medium | AWS_DEFAULT_PROFILE vs AWS_PROFILE | SDK v3 may not resolve credentials |
| L1 | Low | No input validation on infrastructure | Path traversal in stack names |
| L2 | Low | CloudWatch logging fails silently | Deployment metadata lost |
| L3 | Low | kubectl context without namespace | Risk of accidental default-ns operations |
| L4 | Low | Confusing step numbering | Hard to audit logs |
The most actionable items are C1 (guard amplify start-job behind a
first-deploy check), H5 (handle null from get-app), and H4 (pin
Helm versions with a minimum-version guard rather than exact version).
Copyright: (c) Arda Systems 2025-2026, All rights reserved
Copyright: © Arda Systems 2025-2026, All rights reserved