Service Monitoring

Service monitoring ensures that the Arda backend components are available and responding correctly, and alerts the engineering team when intervention is needed. This initial version focuses on reactive monitoring — detecting failures and triggering corrective actions. Preventive monitoring will extend to the full infrastructure in a future phase.

System Under Monitoring

The monitored system consists of:

DNS resolution chain
API Gateway
VPC Link
Network Load Balancer (NLB)
nginx Ingress Controller (Kubernetes)
Component pods (Kubernetes)
Component dependencies (PostgreSQL database, third-party services)

Monitoring Signals and Alert Priority

API Gateway (P0)

The API Gateway is the primary monitoring target.

Signal	Condition	Alert
HTTP 5xx responses	Count exceeds threshold (TBD)	P0 alert
HTTP 4xx responses	Any count	No alert (user errors)
Request count	Significant deviation from expected pattern	Alert (pattern TBD)

Network Load Balancer (P1)

Signal	Condition	Alert
Healthy Ingress Controller count	Fewer healthy than unhealthy instances	P1 alert
Healthy Ingress Controller count	Zero healthy instances	P1 alert

API Tests (P0)

A dedicated set of API tests run on a schedule and exercise the full path from client to Component.

Failure	Alert
DNS resolution error	P0 alert
HTTP 404	P0 alert (misconfigured route)
Authentication failure	P0 alert (planned)
Other API errors	Handled by API Gateway monitoring

Technologies

Component	Technology
Monitoring	Amazon CloudWatch
Alerting	Dedicated Slack channel
Scheduled API tests	Bruno running as a Kubernetes cron job

Watching the Watchers

The monitoring system generates alerts when it does not observe the expected activity from scheduled API tests (absence of expected signals is itself a failure condition).

Health Check Pattern

Each Component exposes a health endpoint. The Bruno-based API tests call the health endpoint of every component on a schedule, verifying end-to-end availability of the complete ingress pipeline.