Phase 2 Design — Tiered Gates, Merge Queue, Deploy Quality Gate
This document describes the design of the Phase 2 pipeline. It is a follow-up to the Phase 1 Frontend Pipeline work, which migrated deployment off Amplify branch-sync and onto GitHub Actions.
- Reduce serialization of merges by adopting GitHub’s merge queue: PRs can be queued in parallel, batched, and tested against the rebased merge commit.
- Tier the checks so PR authors get fast feedback (~5 min) on every push and the expensive E2E coverage runs only inside the queue.
- Automate the release flow end-to-end: PR-body CHANGELOG → assembly commit on
main→ CLQ validation → GitHub Release → Deploy Frontend triggered byworkflow_run. - Gate
proddeployment on Extended E2E + quarantine budget, without coupling the deploy workflow to a separate post-merge workflow. - Preserve quarantine-test signal post-merge for the weekly flaky-test aggregation, without letting quarantined failures block deployment.
Workflow inventory
Section titled “Workflow inventory”| Workflow file | Trigger | Purpose |
|---|---|---|
ci.yaml | pull_request, merge_group | CI Fast Gate — lint, build, unit-tests-coverage |
e2e.yaml | pull_request, merge_group, workflow_dispatch | E2E Queue Gate — sanity + acceptance shards (skipped on PR, real on merge_group) |
changelog-check.yaml | pull_request, issue_comment | Validates ## CHANGELOG in PR body, rejects direct edits to CHANGELOG.md |
changelog-assembly.yaml | push (main) | Assemble PR-body CHANGELOG entries, compute version, run CLQ, create GitHub Release |
deploy.yaml | workflow_run (Changelog Assembly), workflow_dispatch | Deploy Frontend (dev → stage → demo + prod, gated by inline quality gate) |
nightly-e2e.yaml | schedule (nightly) | WebKit + Mobile Safari E2E |
metrics.yaml | schedule (weekly) | Pipeline health metrics → tracking issue |
flaky-test-aggregation.yaml | schedule (weekly) | Aggregate flaky-signals* artifacts → flaky-test issues |
The previously separate post-merge-e2e.yaml was removed in PR #805 when its responsibilities moved into deploy.yaml’s quality gate. See decisions.md, DQ-PIPELINE-002.
Tiered gates
Section titled “Tiered gates”Fast Gate (PR push, ~5 min)
Section titled “Fast Gate (PR push, ~5 min)”Runs on every PR push. Fails fast and gives the author actionable feedback before approval.
| Required check | Source workflow | What it does |
|---|---|---|
lint | ci.yaml | ESLint |
build | ci.yaml | Next.js production build (includes typecheck) |
unit-tests-coverage | ci.yaml | Jest with coverageThreshold enforcement |
changelog-check | changelog-check.yaml | Validates ## CHANGELOG in PR body or author comments |
e2e | e2e.yaml | Pass-through summary; shard jobs skip on pull_request |
quarantine-check | ci.yaml | Reusable composite action — validates @quarantine tags and budget |
Queue Gate (merge_group, ~10–15 min)
Section titled “Queue Gate (merge_group, ~10–15 min)”Runs inside the GitHub merge queue against the rebased merge commit. Lint/build/unit-tests-coverage re-run with zero added wall time — they finish before the E2E shards.
| Required check | Source workflow | Notes |
|---|---|---|
e2e-sanity-{alpha,bravo} | e2e.yaml | --grep "@sanity" --grep-invert "@quarantine" |
e2e-acceptance-{alpha,bravo,charlie} | e2e.yaml | --grep "@acceptance" --grep-invert "@quarantine" |
The build step runs once per queue entry and uploads a tar artifact; all shard jobs download it instead of re-building.
Post-Merge (push to main)
Section titled “Post-Merge (push to main)”The assembly workflow runs first; everything else cascades from it.
push to main (PR-body merge commit)└── changelog-assembly.yaml ├── extract `## CHANGELOG` entries (PR description, last comment wins) ├── compute SemVer bump from `.github/clq/changemap.json` ├── update CHANGELOG.md, package.json, package-lock.json ├── commit `[changelog-assembly]` (using CHANGELOG_ASSEMBLY_TOKEN PAT, not GITHUB_TOKEN) ├── push to main → triggers a re-entry to changelog-assembly.yaml that SKIPS via the marker check ├── run CLQ validation ├── create GitHub Release └── (workflow conclusion: success) → triggers Deploy Frontend via workflow_runThe [changelog-assembly] marker in the commit message is the loop-prevention guard: workflows that would re-run on the assembly commit (lint, build, e2e, etc.) check for the marker and skip.
Deploy Frontend quality gate
Section titled “Deploy Frontend quality gate”The deploy workflow runs the deploy chain and the quality gate in parallel, joining at deploy-demo/deploy-prod:
Changelog Assembly succeeds (t=0) │ ├── source-info (~5s) │ └── deploy-dev (~6 min) │ └── deploy-stage (~6 min) │ └── deploy-demo (after evaluate; ~3 min Amplify deploy) │ └── deploy-prod (after evaluate; environment-approval gated) │ └── quality-gate-build (~3 min, parallel with source-info dependents) ├── quality-gate-alpha (~8 min) ├── quality-gate-bravo (~8 min) ├── quality-gate-quarantine (continue-on-error, ~5–10 min) └── quality-gate-evaluate (depends on shards; ~10s) ├── checks quarantine BUDGET via reusable action ├── creates GitHub issue if E2E shards failed └── exit 1 if E2E failed OR budget exhausted (blocks demo/prod via `needs`)Why pin checkouts to the deploy SHA
Section titled “Why pin checkouts to the deploy SHA”workflow_run events fire with github.event.workflow_run.head_sha set to the assembly commit. Without explicit ref:, actions/checkout defaults to the workflow’s branch HEAD, which can drift if subsequent commits land on main between assembly and gate execution. All three quality-gate jobs (build, shards, quarantine, evaluate) explicitly set:
- uses: actions/checkout@v5 with: ref: ${{ github.event_name == 'workflow_run' && github.event.workflow_run.head_sha || github.sha }}so the gate evaluates the same commit being deployed.
Why quality-gate-evaluate exits 1
Section titled “Why quality-gate-evaluate exits 1”The earlier polling design (PR #803) emitted a prod_blocked boolean output and consumers gated on if: ... != 'true'. That worked but couples the dependency graph to a string output. PR #805 replaced it with exit 1 on failure: GitHub then automatically skips dependent jobs (deploy-demo, deploy-prod) via the needs chain. stage-annotation runs only when evaluate fails, surfacing a warning on the run summary.
Workflow-level GITHUB_TOKEN for npm ci
Section titled “Workflow-level GITHUB_TOKEN for npm ci”The .npmrc reads _authToken=${GITHUB_TOKEN} to authenticate against npm.pkg.github.com for @arda-cards/* packages. PR #808 added the missing workflow-level
env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}block to deploy.yaml, matching ci.yaml and e2e.yaml. The omission was masked by warm ~/.npm cache hits (the package tarball was served locally and never required a registry hit) and surfaced when the cache was cold.
Quarantine system
Section titled “Quarantine system”A spec is quarantined by adding @quarantine(YYYY-MM-DD, #issue) to its title:
test('TC-NAV-004 sidebar toggle @acceptance @quarantine(2026-05-04, #795)', async () => { ... });Reusable composite action
Section titled “Reusable composite action”.github/actions/quarantine-check/action.yml validates every @quarantine(...) tag in e2e/specs/ against e2e/quarantine.config.json:
| Output | Meaning |
|---|---|
budget_used | Number of currently quarantined tests |
budget_max | Configured maximum (default: 5) |
budget_exhausted | true iff budget_used > budget_max |
violations | Count of tags with missing fields, expired dates, expiry too far in the future, or over-budget |
Lifecycle phases
Section titled “Lifecycle phases”| Phase | Behavior |
|---|---|
| Local | Test runs (no grep filter applied locally) |
| Fast Gate (PR) | E2E shards are skipped (pass-through summary). quarantine-check validates tags. |
Queue Gate (merge_group) | Sanity + acceptance shards run with --grep-invert "@quarantine" — quarantined tests are excluded from the merge gate. |
Post-merge quality-gate-shards | Excludes @quarantine (same --grep-invert). |
Post-merge quality-gate-quarantine | Runs ONLY @quarantine tests. Non-blocking (continue-on-error: true). Emits a step summary table and a quarantine-results.json artifact for future metrics. |
Post-merge quality-gate-evaluate | Checks the quarantine BUDGET (count + violations) but ignores the quarantine job’s pass/fail. |
| Nightly | All tests run (no exclusion); failures create issues. |
Why the quarantine job is non-blocking
Section titled “Why the quarantine job is non-blocking”The previous post-merge-e2e.yaml workflow excluded @quarantine from its shards entirely, which conflicted with the documented lifecycle (“Post-merge: Run”). PR #807 restores the post-merge run via a dedicated job. It is non-blocking by design:
- Job-level
continue-on-error: trueprevents quarantined failures from flipping the workflow conclusion. - No downstream
needsreferences the job —deploy-demo/deploy-prodare unaffected. - Failures upload
playwright/test-results/artifacts whensteps.run-quarantine.outcome == 'failure'(notif: failure(), which would not match because of the job-levelcontinue-on-error).
This satisfies the “still run them post-merge” requirement from the lifecycle doc without re-coupling deploys to flaky tests.
Lessons learned (gotchas)
Section titled “Lessons learned (gotchas)”The following emerged during Phase 2 implementation and are worth documenting for future workflow changes:
- CLQ validates the entire CHANGELOG, not just new entries. The version chain must be strictly sequential. Manual catch-up entries must use the correct category-to-bump mapping.
actions/upload-artifact@v4skips hidden directories (.next/) and rejects colons in filenames. Usetar cfbefore upload.${{ github.event.head_commit.message }}breaks shell on em-dashes, backticks, and similar. Pass via anenv:variable instead.GITHUB_TOKENpushes do not trigger downstream workflows. The assembly usesCHANGELOG_ASSEMBLY_TOKEN(a PAT) so the push does triggerDeploy Frontend.GITHUB_TOKENcannot access GitHub Projects. UseARDA_GH_ACTION_PROJECT_WRITERforgh project item-add.- Required checks must report on both
pull_requestandmerge_group. Use pass-through patterns (e2esummary auto-passes on PR;changelog-checkpass-through in queue). - Workflow renames break
workflow_runtriggers. WhenDeploy Frontendwas wired up, its trigger initially referenced"ci"after that workflow had been renamed to"CI Fast Gate". - Pushing to a queued branch is blocked. Defer fixes to a follow-up PR when the branch is in the merge queue.
- New workflow files trigger on push even without a
pushtrigger configured (GitHub one-time detection on first appearance). continue-on-error: truemakesfailure()unreliable. Usesteps.<id>.outcome == 'failure'for per-step conditional uploads.ghhas no-C <path>flag. Usegh -R <owner>/<repo>to target a repository without changing directory.- Always run the full local check suite before any push to a workflow that exercises bash/jq scripts. The Phase 2 follow-up bug on issue #795 was caused by a single-quote inside an inline jq comment that escaped shell quoting.
Files of record
Section titled “Files of record”| File | Purpose |
|---|---|
.github/workflows/ci.yaml | CI Fast Gate |
.github/workflows/e2e.yaml | E2E Queue Gate |
.github/workflows/changelog-check.yaml | PR-body CHANGELOG validator |
.github/workflows/changelog-assembly.yaml | Post-merge assembly + CLQ + Release |
.github/workflows/deploy.yaml | Deploy Frontend |
.github/workflows/nightly-e2e.yaml | Nightly WebKit + Mobile Safari |
.github/workflows/metrics.yaml | Weekly pipeline metrics |
.github/workflows/flaky-test-aggregation.yaml | Weekly flaky-test aggregation |
.github/actions/quarantine-check/action.yml | Reusable quarantine budget check |
.github/clq/changemap.json | Category → SemVer bump mapping |
e2e/quarantine.config.json | Quarantine budget + expiry config |
scripts/quarantine-validator.sh | Quarantine tag validator (used by composite action) |
scripts/flaky-signal-collector.sh | Flaky signal collector |
knowledge-base/pr-body-changelog.md | PR-body CHANGELOG process docs |
knowledge-base/flaky-test-quarantine.md | Quarantine system docs |
playwright.config.ts | retries: 2 in CI |
Copyright: © Arda Systems 2025-2026, All rights reserved