Service Monitoring
Service monitoring ensures that the Arda backend components are available and responding correctly, and alerts the engineering team when intervention is needed. This initial version focuses on reactive monitoring — detecting failures and triggering corrective actions. Preventive monitoring will extend to the full infrastructure in a future phase.
System Under Monitoring
Section titled “System Under Monitoring”The monitored system consists of:
- DNS resolution chain
- API Gateway
- VPC Link
- Network Load Balancer (NLB)
- nginx Ingress Controller (Kubernetes)
- Component pods (Kubernetes)
- Component dependencies (PostgreSQL database, third-party services)
Monitoring Signals and Alert Priority
Section titled “Monitoring Signals and Alert Priority”API Gateway (P0)
Section titled “API Gateway (P0)”The API Gateway is the primary monitoring target.
| Signal | Condition | Alert |
|---|---|---|
| HTTP 5xx responses | Count exceeds threshold (TBD) | P0 alert |
| HTTP 4xx responses | Any count | No alert (user errors) |
| Request count | Significant deviation from expected pattern | Alert (pattern TBD) |
Network Load Balancer (P1)
Section titled “Network Load Balancer (P1)”| Signal | Condition | Alert |
|---|---|---|
| Healthy Ingress Controller count | Fewer healthy than unhealthy instances | P1 alert |
| Healthy Ingress Controller count | Zero healthy instances | P1 alert |
API Tests (P0)
Section titled “API Tests (P0)”A dedicated set of API tests run on a schedule and exercise the full path from client to Component.
| Failure | Alert |
|---|---|
| DNS resolution error | P0 alert |
| HTTP 404 | P0 alert (misconfigured route) |
| Authentication failure | P0 alert (planned) |
| Other API errors | Handled by API Gateway monitoring |
Technologies
Section titled “Technologies”| Component | Technology |
|---|---|
| Monitoring | Amazon CloudWatch |
| Alerting | Dedicated Slack channel |
| Scheduled API tests | Bruno running as a Kubernetes cron job |
Watching the Watchers
Section titled “Watching the Watchers”The monitoring system generates alerts when it does not observe the expected activity from scheduled API tests (absence of expected signals is itself a failure condition).
Health Check Pattern
Section titled “Health Check Pattern”Each Component exposes a health endpoint. The Bruno-based API tests call the health endpoint of every component on a schedule, verifying end-to-end availability of the complete ingress pipeline.
Copyright: © Arda Systems 2025-2026, All rights reserved