Skip to content

Goal: Product Slow Responses

Lifecycle: Completed

Investigation and remediation of user-visible slowness in the Arda product, anchored by PDEV-442. The /items page is the dominant source of slowness; root causes are distributed across the frontend (N+1 fan-out), the operations service (bitemporal SQL and CPU-starved pods), the EKS cluster configuration (missing metrics-server, undersized Fargate tasks), and the database (missing observability + sub-optimal instance class).

This is a lightweight project: detailed scope and acceptance criteria live in the Linear issues. Detailed analyses live alongside this file in analysis/; per-sub-issue goal docs live in sibling sub-directories (pdev-490/goal.md, etc.) as those strands open their formal project phase.

Parent and the active / shipped sub-issues, all in PDEV:

  • PDEV-442 — Product is slow: parent investigation; anchors the work and carries the cross-component findings comments.
  • PDEV-488 — Pod Capacity Update: resize operations Fargate pods, parameterise Helm chart, ship a disabled HPA, add JVM memory tuning + GC logging + JFR. Status: shipped (operations chart 2.24.0).
  • PDEV-489 — Front-End Performance Improvements: eliminate the per-row kanban N+1 on /items. Status: active. Resolution path settled on a frontend-only consolidation onto the existing /v1/kanban/kanban-card/query route with Filter.In; the back-end prerequisite (composite index on kanban_card) ships from PDEV-490.
  • PDEV-490 — Operations Performance Improvements: composite bitemporal indexes on kanban_card and item; AWS Advanced JDBC Wrapper adoption for reader-endpoint routing and graceful Aurora failover; HTTP 503 + Retry-After for transient SQL; cardsForItem cleanup (drop wasted COUNT). Status: design complete; implementation pending. See pdev-490/goal.md for the formal project goal.
  • PDEV-491 — EKS Performance Metrics: install the metrics-server managed addon via the CDK construct, unblocking the HPA shipped under PDEV-488.
  • PDEV-479 — Review prod Aurora DB configuration: Aurora hygiene — instance-class sizing, parameter group, slow-query logging. Status: shipped.
  • PDEV-498 — pg_stat_statements extension: install pg_stat_statements in every application database in every partition. Status: shipped (postgres-database-initializer:2.5.0).
  • PDEV-500 — {Infra}-SentryDsn infrastructure secret: Sentry DSN as an InfrastructureSecretsStack resource, plus the ExternallySuppliedSecret CDK construct and the amm.sh 1P-read extension. Status: shipped.
  • PDEV-509 — Graceful failover degradation: sub-issue absorbed by PDEV-490. Its three application-side improvements (JVM DNS TTL, transient SQL → HTTP 503, retry-on-transient) are subsumed by the JDBC wrapper adoption; closes with PDEV-490.
  • PDEV-499 — RDS Proxy evaluation: closed as won’t-do — JDBC wrapper is incompatible with RDS Proxy by design; the connection-multiplexing benefit RDS Proxy would have provided is moot now that reader-endpoint routing absorbs the writer-pool ceiling pressure.
  • PDEV-534 — Evaluate lowering transactionIsolation: new sub-issue under PDEV-490, deferred. Runs after PDEV-490 ships so the post-wrapper, post-index baseline is the reference point.
RepositoryRolePlanned changes
arda-frontend-appRequiredPDEV-489 — frontend consolidation of the per-row kanban N+1 onto the existing /v1/kanban/kanban-card/query route with Filter.In; strip dead BFF logging; fold counts into the SSRM response.
operationsRequiredPDEV-488 Helm chart (shipped); PDEV-490 — Flyway migrations adding composite bitemporal indexes on kanban_card and item, the missing tenant_id index on kanban_card, the cardsForItem cleanup, and the operations-component consumer wiring (common-module pin bump, application.conf jdbcUrl scheme + retry knobs).
common-moduleRequiredPDEV-490 — single coordinated release: AWS Advanced JDBC Wrapper adoption, AppError.Transient + HTTP 503 + retry policy, decorative TENANT_ID_INDEX declaration removal.
infrastructureRequiredPDEV-491 CDK construct — metrics-server managed addon; PDEV-479 Aurora parameter group + sizing (shipped); PDEV-500 InfrastructureSecretsStack + ExternallySuppliedSecret construct + amm.sh Sentry-DSN extension (shipped).
postgres-database-initializerRequiredPDEV-498 — CREATE EXTENSION IF NOT EXISTS pg_stat_statements; (shipped as postgres-database-initializer:2.5.0).
documentationRequiredThis project’s docs (goal + per-sub-issue goals + analyses); PDEV-490 architecture pages and operational runbooks.

High-level only; verifiable definitions live in each sub-issue.

  1. /items pageload p95 reduces materially against the Sentry baseline (today: 19.6 s).
  2. The N+1 fan-out is eliminated — the /items page issues at most one aggregate kanban request per load regardless of row count.
  3. The operations pod’s effective CPU is no longer the constraint — requests = limits on Fargate, with sizing per env and an HPA in place (gated until metrics-server is installed).
  4. The platform has working pod-level metrics — kubectl top pod returns numbers, the HPA scales prod under business-hour load.
  5. Aurora has the observability tooling needed to spot the next regression — pg_stat_statements, slow-query log writing events, and a decision recorded on the instance class.

In-repo (this directory):