Skip to content

Phase 2 Design — Tiered Gates, Merge Queue, Deploy Quality Gate

This document describes the design of the Phase 2 pipeline. It is a follow-up to the Phase 1 Frontend Pipeline work, which migrated deployment off Amplify branch-sync and onto GitHub Actions.

  1. Reduce serialization of merges by adopting GitHub’s merge queue: PRs can be queued in parallel, batched, and tested against the rebased merge commit.
  2. Tier the checks so PR authors get fast feedback (~5 min) on every push and the expensive E2E coverage runs only inside the queue.
  3. Automate the release flow end-to-end: PR-body CHANGELOG → assembly commit on main → CLQ validation → GitHub Release → Deploy Frontend triggered by workflow_run.
  4. Gate prod deployment on Extended E2E + quarantine budget, without coupling the deploy workflow to a separate post-merge workflow.
  5. Preserve quarantine-test signal post-merge for the weekly flaky-test aggregation, without letting quarantined failures block deployment.
Workflow fileTriggerPurpose
ci.yamlpull_request, merge_groupCI Fast Gate — lint, build, unit-tests-coverage
e2e.yamlpull_request, merge_group, workflow_dispatchE2E Queue Gate — sanity + acceptance shards (skipped on PR, real on merge_group)
changelog-check.yamlpull_request, issue_commentValidates ## CHANGELOG in PR body, rejects direct edits to CHANGELOG.md
changelog-assembly.yamlpush (main)Assemble PR-body CHANGELOG entries, compute version, run CLQ, create GitHub Release
deploy.yamlworkflow_run (Changelog Assembly), workflow_dispatchDeploy Frontend (dev → stage → demo + prod, gated by inline quality gate)
nightly-e2e.yamlschedule (nightly)WebKit + Mobile Safari E2E
metrics.yamlschedule (weekly)Pipeline health metrics → tracking issue
flaky-test-aggregation.yamlschedule (weekly)Aggregate flaky-signals* artifacts → flaky-test issues

The previously separate post-merge-e2e.yaml was removed in PR #805 when its responsibilities moved into deploy.yaml’s quality gate. See decisions.md, DQ-PIPELINE-002.

Runs on every PR push. Fails fast and gives the author actionable feedback before approval.

Required checkSource workflowWhat it does
lintci.yamlESLint
buildci.yamlNext.js production build (includes typecheck)
unit-tests-coverageci.yamlJest with coverageThreshold enforcement
changelog-checkchangelog-check.yamlValidates ## CHANGELOG in PR body or author comments
e2ee2e.yamlPass-through summary; shard jobs skip on pull_request
quarantine-checkci.yamlReusable composite action — validates @quarantine tags and budget

Runs inside the GitHub merge queue against the rebased merge commit. Lint/build/unit-tests-coverage re-run with zero added wall time — they finish before the E2E shards.

Required checkSource workflowNotes
e2e-sanity-{alpha,bravo}e2e.yaml--grep "@sanity" --grep-invert "@quarantine"
e2e-acceptance-{alpha,bravo,charlie}e2e.yaml--grep "@acceptance" --grep-invert "@quarantine"

The build step runs once per queue entry and uploads a tar artifact; all shard jobs download it instead of re-building.

The assembly workflow runs first; everything else cascades from it.

push to main (PR-body merge commit)
└── changelog-assembly.yaml
├── extract `## CHANGELOG` entries (PR description, last comment wins)
├── compute SemVer bump from `.github/clq/changemap.json`
├── update CHANGELOG.md, package.json, package-lock.json
├── commit `[changelog-assembly]` (using CHANGELOG_ASSEMBLY_TOKEN PAT, not GITHUB_TOKEN)
├── push to main → triggers a re-entry to changelog-assembly.yaml that SKIPS via the marker check
├── run CLQ validation
├── create GitHub Release
└── (workflow conclusion: success) → triggers Deploy Frontend via workflow_run

The [changelog-assembly] marker in the commit message is the loop-prevention guard: workflows that would re-run on the assembly commit (lint, build, e2e, etc.) check for the marker and skip.

The deploy workflow runs the deploy chain and the quality gate in parallel, joining at deploy-demo/deploy-prod:

Changelog Assembly succeeds (t=0)
├── source-info (~5s)
│ └── deploy-dev (~6 min)
│ └── deploy-stage (~6 min)
│ └── deploy-demo (after evaluate; ~3 min Amplify deploy)
│ └── deploy-prod (after evaluate; environment-approval gated)
└── quality-gate-build (~3 min, parallel with source-info dependents)
├── quality-gate-alpha (~8 min)
├── quality-gate-bravo (~8 min)
├── quality-gate-quarantine (continue-on-error, ~5–10 min)
└── quality-gate-evaluate (depends on shards; ~10s)
├── checks quarantine BUDGET via reusable action
├── creates GitHub issue if E2E shards failed
└── exit 1 if E2E failed OR budget exhausted (blocks demo/prod via `needs`)

workflow_run events fire with github.event.workflow_run.head_sha set to the assembly commit. Without explicit ref:, actions/checkout defaults to the workflow’s branch HEAD, which can drift if subsequent commits land on main between assembly and gate execution. All three quality-gate jobs (build, shards, quarantine, evaluate) explicitly set:

- uses: actions/checkout@v5
with:
ref: ${{ github.event_name == 'workflow_run' && github.event.workflow_run.head_sha || github.sha }}

so the gate evaluates the same commit being deployed.

The earlier polling design (PR #803) emitted a prod_blocked boolean output and consumers gated on if: ... != 'true'. That worked but couples the dependency graph to a string output. PR #805 replaced it with exit 1 on failure: GitHub then automatically skips dependent jobs (deploy-demo, deploy-prod) via the needs chain. stage-annotation runs only when evaluate fails, surfacing a warning on the run summary.

The .npmrc reads _authToken=${GITHUB_TOKEN} to authenticate against npm.pkg.github.com for @arda-cards/* packages. PR #808 added the missing workflow-level

env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

block to deploy.yaml, matching ci.yaml and e2e.yaml. The omission was masked by warm ~/.npm cache hits (the package tarball was served locally and never required a registry hit) and surfaced when the cache was cold.

A spec is quarantined by adding @quarantine(YYYY-MM-DD, #issue) to its title:

test('TC-NAV-004 sidebar toggle @acceptance @quarantine(2026-05-04, #795)', async () => { ... });

.github/actions/quarantine-check/action.yml validates every @quarantine(...) tag in e2e/specs/ against e2e/quarantine.config.json:

OutputMeaning
budget_usedNumber of currently quarantined tests
budget_maxConfigured maximum (default: 5)
budget_exhaustedtrue iff budget_used > budget_max
violationsCount of tags with missing fields, expired dates, expiry too far in the future, or over-budget
PhaseBehavior
LocalTest runs (no grep filter applied locally)
Fast Gate (PR)E2E shards are skipped (pass-through summary). quarantine-check validates tags.
Queue Gate (merge_group)Sanity + acceptance shards run with --grep-invert "@quarantine" — quarantined tests are excluded from the merge gate.
Post-merge quality-gate-shardsExcludes @quarantine (same --grep-invert).
Post-merge quality-gate-quarantineRuns ONLY @quarantine tests. Non-blocking (continue-on-error: true). Emits a step summary table and a quarantine-results.json artifact for future metrics.
Post-merge quality-gate-evaluateChecks the quarantine BUDGET (count + violations) but ignores the quarantine job’s pass/fail.
NightlyAll tests run (no exclusion); failures create issues.

The previous post-merge-e2e.yaml workflow excluded @quarantine from its shards entirely, which conflicted with the documented lifecycle (“Post-merge: Run”). PR #807 restores the post-merge run via a dedicated job. It is non-blocking by design:

  • Job-level continue-on-error: true prevents quarantined failures from flipping the workflow conclusion.
  • No downstream needs references the job — deploy-demo / deploy-prod are unaffected.
  • Failures upload playwright/test-results/ artifacts when steps.run-quarantine.outcome == 'failure' (not if: failure(), which would not match because of the job-level continue-on-error).

This satisfies the “still run them post-merge” requirement from the lifecycle doc without re-coupling deploys to flaky tests.

The following emerged during Phase 2 implementation and are worth documenting for future workflow changes:

  1. CLQ validates the entire CHANGELOG, not just new entries. The version chain must be strictly sequential. Manual catch-up entries must use the correct category-to-bump mapping.
  2. actions/upload-artifact@v4 skips hidden directories (.next/) and rejects colons in filenames. Use tar cf before upload.
  3. ${{ github.event.head_commit.message }} breaks shell on em-dashes, backticks, and similar. Pass via an env: variable instead.
  4. GITHUB_TOKEN pushes do not trigger downstream workflows. The assembly uses CHANGELOG_ASSEMBLY_TOKEN (a PAT) so the push does trigger Deploy Frontend.
  5. GITHUB_TOKEN cannot access GitHub Projects. Use ARDA_GH_ACTION_PROJECT_WRITER for gh project item-add.
  6. Required checks must report on both pull_request and merge_group. Use pass-through patterns (e2e summary auto-passes on PR; changelog-check pass-through in queue).
  7. Workflow renames break workflow_run triggers. When Deploy Frontend was wired up, its trigger initially referenced "ci" after that workflow had been renamed to "CI Fast Gate".
  8. Pushing to a queued branch is blocked. Defer fixes to a follow-up PR when the branch is in the merge queue.
  9. New workflow files trigger on push even without a push trigger configured (GitHub one-time detection on first appearance).
  10. continue-on-error: true makes failure() unreliable. Use steps.<id>.outcome == 'failure' for per-step conditional uploads.
  11. gh has no -C <path> flag. Use gh -R <owner>/<repo> to target a repository without changing directory.
  12. Always run the full local check suite before any push to a workflow that exercises bash/jq scripts. The Phase 2 follow-up bug on issue #795 was caused by a single-quote inside an inline jq comment that escaped shell quoting.
FilePurpose
.github/workflows/ci.yamlCI Fast Gate
.github/workflows/e2e.yamlE2E Queue Gate
.github/workflows/changelog-check.yamlPR-body CHANGELOG validator
.github/workflows/changelog-assembly.yamlPost-merge assembly + CLQ + Release
.github/workflows/deploy.yamlDeploy Frontend
.github/workflows/nightly-e2e.yamlNightly WebKit + Mobile Safari
.github/workflows/metrics.yamlWeekly pipeline metrics
.github/workflows/flaky-test-aggregation.yamlWeekly flaky-test aggregation
.github/actions/quarantine-check/action.ymlReusable quarantine budget check
.github/clq/changemap.jsonCategory → SemVer bump mapping
e2e/quarantine.config.jsonQuarantine budget + expiry config
scripts/quarantine-validator.shQuarantine tag validator (used by composite action)
scripts/flaky-signal-collector.shFlaky signal collector
knowledge-base/pr-body-changelog.mdPR-body CHANGELOG process docs
knowledge-base/flaky-test-quarantine.mdQuarantine system docs
playwright.config.tsretries: 2 in CI