Skip to content

Service Monitoring

Service monitoring ensures that the Arda backend components are available and responding correctly, and alerts the engineering team when intervention is needed. This initial version focuses on reactive monitoring — detecting failures and triggering corrective actions. Preventive monitoring will extend to the full infrastructure in a future phase.

The monitored system consists of:

  • DNS resolution chain
  • API Gateway
  • VPC Link
  • Network Load Balancer (NLB)
  • nginx Ingress Controller (Kubernetes)
  • Component pods (Kubernetes)
  • Component dependencies (PostgreSQL database, third-party services)

The API Gateway is the primary monitoring target.

SignalConditionAlert
HTTP 5xx responsesCount exceeds threshold (TBD)P0 alert
HTTP 4xx responsesAny countNo alert (user errors)
Request countSignificant deviation from expected patternAlert (pattern TBD)
SignalConditionAlert
Healthy Ingress Controller countFewer healthy than unhealthy instancesP1 alert
Healthy Ingress Controller countZero healthy instancesP1 alert

A dedicated set of API tests run on a schedule and exercise the full path from client to Component.

FailureAlert
DNS resolution errorP0 alert
HTTP 404P0 alert (misconfigured route)
Authentication failureP0 alert (planned)
Other API errorsHandled by API Gateway monitoring
ComponentTechnology
MonitoringAmazon CloudWatch
AlertingDedicated Slack channel
Scheduled API testsBruno running as a Kubernetes cron job

The monitoring system generates alerts when it does not observe the expected activity from scheduled API tests (absence of expected signals is itself a failure condition).

Each Component exposes a health endpoint. The Bruno-based API tests call the health endpoint of every component on a schedule, verifying end-to-end availability of the complete ingress pipeline.