Phase 2 Design — Tiered Gates, Merge Queue, Deploy Quality Gate

This document describes the design of the Phase 2 pipeline. It is a follow-up to the Phase 1 Frontend Pipeline work, which migrated deployment off Amplify branch-sync and onto GitHub Actions.

Goals

Reduce serialization of merges by adopting GitHub’s merge queue: PRs can be queued in parallel, batched, and tested against the rebased merge commit.
Tier the checks so PR authors get fast feedback (~5 min) on every push and the expensive E2E coverage runs only inside the queue.
Automate the release flow end-to-end: PR-body CHANGELOG → assembly commit on main → CLQ validation → GitHub Release → Deploy Frontend triggered by workflow_run.
Gate prod deployment on Extended E2E + quarantine budget, without coupling the deploy workflow to a separate post-merge workflow.
Preserve quarantine-test signal post-merge for the weekly flaky-test aggregation, without letting quarantined failures block deployment.

Workflow inventory

Workflow file	Trigger	Purpose
`ci.yaml`	`pull_request`, `merge_group`	CI Fast Gate — `lint`, `build`, `unit-tests-coverage`
`e2e.yaml`	`pull_request`, `merge_group`, `workflow_dispatch`	E2E Queue Gate — sanity + acceptance shards (skipped on PR, real on `merge_group`)
`changelog-check.yaml`	`pull_request`, `issue_comment`	Validates `## CHANGELOG` in PR body, rejects direct edits to `CHANGELOG.md`
`changelog-assembly.yaml`	`push` (main)	Assemble PR-body CHANGELOG entries, compute version, run CLQ, create GitHub Release
`deploy.yaml`	`workflow_run` (Changelog Assembly), `workflow_dispatch`	Deploy Frontend (dev → stage → demo + prod, gated by inline quality gate)
`nightly-e2e.yaml`	`schedule` (nightly)	WebKit + Mobile Safari E2E
`metrics.yaml`	`schedule` (weekly)	Pipeline health metrics → tracking issue
`flaky-test-aggregation.yaml`	`schedule` (weekly)	Aggregate `flaky-signals*` artifacts → flaky-test issues

The previously separate post-merge-e2e.yaml was removed in PR #805 when its responsibilities moved into deploy.yaml’s quality gate. See decisions.md, DQ-PIPELINE-002.

Tiered gates

Fast Gate (PR push, ~5 min)

Runs on every PR push. Fails fast and gives the author actionable feedback before approval.

Required check	Source workflow	What it does
`lint`	`ci.yaml`	ESLint
`build`	`ci.yaml`	Next.js production build (includes typecheck)
`unit-tests-coverage`	`ci.yaml`	Jest with `coverageThreshold` enforcement
`changelog-check`	`changelog-check.yaml`	Validates `## CHANGELOG` in PR body or author comments
`e2e`	`e2e.yaml`	Pass-through summary; shard jobs skip on `pull_request`
`quarantine-check`	`ci.yaml`	Reusable composite action — validates `@quarantine` tags and budget

Queue Gate (`merge_group`, ~10–15 min)

Runs inside the GitHub merge queue against the rebased merge commit. Lint/build/unit-tests-coverage re-run with zero added wall time — they finish before the E2E shards.

Required check	Source workflow	Notes
`e2e-sanity-{alpha,bravo}`	`e2e.yaml`	`--grep "@sanity" --grep-invert "@quarantine"`
`e2e-acceptance-{alpha,bravo,charlie}`	`e2e.yaml`	`--grep "@acceptance" --grep-invert "@quarantine"`

The build step runs once per queue entry and uploads a tar artifact; all shard jobs download it instead of re-building.

Post-Merge (push to `main`)

The assembly workflow runs first; everything else cascades from it.

push to main (PR-body merge commit)
└── changelog-assembly.yaml
      ├── extract `## CHANGELOG` entries (PR description, last comment wins)
      ├── compute SemVer bump from `.github/clq/changemap.json`
      ├── update CHANGELOG.md, package.json, package-lock.json
      ├── commit `[changelog-assembly]` (using CHANGELOG_ASSEMBLY_TOKEN PAT, not GITHUB_TOKEN)
      ├── push to main → triggers a re-entry to changelog-assembly.yaml that SKIPS via the marker check
      ├── run CLQ validation
      ├── create GitHub Release
      └── (workflow conclusion: success) → triggers Deploy Frontend via workflow_run

The [changelog-assembly] marker in the commit message is the loop-prevention guard: workflows that would re-run on the assembly commit (lint, build, e2e, etc.) check for the marker and skip.

Deploy Frontend quality gate

The deploy workflow runs the deploy chain and the quality gate in parallel, joining at deploy-demo/deploy-prod:

Changelog Assembly succeeds (t=0)
    │
    ├── source-info (~5s)
    │     └── deploy-dev (~6 min)
    │           └── deploy-stage (~6 min)
    │                 └── deploy-demo  (after evaluate; ~3 min Amplify deploy)
    │                 └── deploy-prod  (after evaluate; environment-approval gated)
    │
    └── quality-gate-build (~3 min, parallel with source-info dependents)
          ├── quality-gate-alpha (~8 min)
          ├── quality-gate-bravo (~8 min)
          ├── quality-gate-quarantine (continue-on-error, ~5–10 min)
          └── quality-gate-evaluate (depends on shards; ~10s)
                ├── checks quarantine BUDGET via reusable action
                ├── creates GitHub issue if E2E shards failed
                └── exit 1 if E2E failed OR budget exhausted (blocks demo/prod via `needs`)

Why pin checkouts to the deploy SHA

workflow_run events fire with github.event.workflow_run.head_sha set to the assembly commit. Without explicit ref:, actions/checkout defaults to the workflow’s branch HEAD, which can drift if subsequent commits land on main between assembly and gate execution. All three quality-gate jobs (build, shards, quarantine, evaluate) explicitly set:

- uses: actions/checkout@v5
  with:
    ref: ${{ github.event_name == 'workflow_run' && github.event.workflow_run.head_sha || github.sha }}

so the gate evaluates the same commit being deployed.

Why `quality-gate-evaluate` exits 1

The earlier polling design (PR #803) emitted a prod_blocked boolean output and consumers gated on if: ... != 'true'. That worked but couples the dependency graph to a string output. PR #805 replaced it with exit 1 on failure: GitHub then automatically skips dependent jobs (deploy-demo, deploy-prod) via the needs chain. stage-annotation runs only when evaluate fails, surfacing a warning on the run summary.

Workflow-level `GITHUB_TOKEN` for `npm ci`

The .npmrc reads _authToken=${GITHUB_TOKEN} to authenticate against npm.pkg.github.com for @arda-cards/* packages. PR #808 added the missing workflow-level

env:
  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

block to deploy.yaml, matching ci.yaml and e2e.yaml. The omission was masked by warm ~/.npm cache hits (the package tarball was served locally and never required a registry hit) and surfaced when the cache was cold.

Quarantine system

Tag

A spec is quarantined by adding @quarantine(YYYY-MM-DD, #issue) to its title:

test('TC-NAV-004 sidebar toggle @acceptance @quarantine(2026-05-04, #795)', async () => { ... });

Reusable composite action

.github/actions/quarantine-check/action.yml validates every @quarantine(...) tag in e2e/specs/ against e2e/quarantine.config.json:

Output	Meaning
`budget_used`	Number of currently quarantined tests
`budget_max`	Configured maximum (default: 5)
`budget_exhausted`	`true` iff `budget_used > budget_max`
`violations`	Count of tags with missing fields, expired dates, expiry too far in the future, or over-budget

Lifecycle phases

Phase	Behavior
Local	Test runs (no grep filter applied locally)
Fast Gate (PR)	E2E shards are skipped (pass-through summary). `quarantine-check` validates tags.
Queue Gate (`merge_group`)	Sanity + acceptance shards run with `--grep-invert "@quarantine"` — quarantined tests are excluded from the merge gate.
Post-merge `quality-gate-shards`	Excludes `@quarantine` (same `--grep-invert`).
Post-merge `quality-gate-quarantine`	Runs ONLY `@quarantine` tests. Non-blocking (`continue-on-error: true`). Emits a step summary table and a `quarantine-results.json` artifact for future metrics.
Post-merge `quality-gate-evaluate`	Checks the quarantine BUDGET (count + violations) but ignores the quarantine job’s pass/fail.
Nightly	All tests run (no exclusion); failures create issues.

Why the quarantine job is non-blocking

The previous post-merge-e2e.yaml workflow excluded @quarantine from its shards entirely, which conflicted with the documented lifecycle (“Post-merge: Run”). PR #807 restores the post-merge run via a dedicated job. It is non-blocking by design:

Job-level continue-on-error: true prevents quarantined failures from flipping the workflow conclusion.
No downstream needs references the job — deploy-demo / deploy-prod are unaffected.
Failures upload playwright/test-results/ artifacts when steps.run-quarantine.outcome == 'failure' (not if: failure(), which would not match because of the job-level continue-on-error).

This satisfies the “still run them post-merge” requirement from the lifecycle doc without re-coupling deploys to flaky tests.

Lessons learned (gotchas)

The following emerged during Phase 2 implementation and are worth documenting for future workflow changes:

CLQ validates the entire CHANGELOG, not just new entries. The version chain must be strictly sequential. Manual catch-up entries must use the correct category-to-bump mapping.
actions/upload-artifact@v4 skips hidden directories (.next/) and rejects colons in filenames. Use tar cf before upload.
${{ github.event.head_commit.message }} breaks shell on em-dashes, backticks, and similar. Pass via an env: variable instead.
GITHUB_TOKEN pushes do not trigger downstream workflows. The assembly uses CHANGELOG_ASSEMBLY_TOKEN (a PAT) so the push does trigger Deploy Frontend.
GITHUB_TOKEN cannot access GitHub Projects. Use ARDA_GH_ACTION_PROJECT_WRITER for gh project item-add.
Required checks must report on both pull_request and merge_group. Use pass-through patterns (e2e summary auto-passes on PR; changelog-check pass-through in queue).
Workflow renames break workflow_run triggers. When Deploy Frontend was wired up, its trigger initially referenced "ci" after that workflow had been renamed to "CI Fast Gate".
Pushing to a queued branch is blocked. Defer fixes to a follow-up PR when the branch is in the merge queue.
New workflow files trigger on push even without a push trigger configured (GitHub one-time detection on first appearance).
continue-on-error: true makes failure() unreliable. Use steps.<id>.outcome == 'failure' for per-step conditional uploads.
gh has no -C <path> flag. Use gh -R <owner>/<repo> to target a repository without changing directory.
Always run the full local check suite before any push to a workflow that exercises bash/jq scripts. The Phase 2 follow-up bug on issue #795 was caused by a single-quote inside an inline jq comment that escaped shell quoting.

Files of record

File	Purpose
`.github/workflows/ci.yaml`	CI Fast Gate
`.github/workflows/e2e.yaml`	E2E Queue Gate
`.github/workflows/changelog-check.yaml`	PR-body CHANGELOG validator
`.github/workflows/changelog-assembly.yaml`	Post-merge assembly + CLQ + Release
`.github/workflows/deploy.yaml`	Deploy Frontend
`.github/workflows/nightly-e2e.yaml`	Nightly WebKit + Mobile Safari
`.github/workflows/metrics.yaml`	Weekly pipeline metrics
`.github/workflows/flaky-test-aggregation.yaml`	Weekly flaky-test aggregation
`.github/actions/quarantine-check/action.yml`	Reusable quarantine budget check
`.github/clq/changemap.json`	Category → SemVer bump mapping
`e2e/quarantine.config.json`	Quarantine budget + expiry config
`scripts/quarantine-validator.sh`	Quarantine tag validator (used by composite action)
`scripts/flaky-signal-collector.sh`	Flaky signal collector
`knowledge-base/pr-body-changelog.md`	PR-body CHANGELOG process docs
`knowledge-base/flaky-test-quarantine.md`	Quarantine system docs
`playwright.config.ts`	`retries: 2` in CI