Service Monitoring

Service Monitoring¶

This article describes how we will monitor the availability of the components that build the backend of our applications.

The goal of the monitoring system is to ensure that the backend is available and services the requests from the front-end
or from other external clients, and to alert the engineering team when interventions are needed.

This initial version sets up the structure and focuses on reactive monitoring, detecting failures and triggering
corrective actions. In a later phase, monitoring will extend to the whole infrastructure and will include preventive
monitoring.

This version of document describes the signals being watched, experience will tells the actual signal sequence that
requires an intervention.

The system under monitoring consists of an API Gateway, a network load balancer, an ingress controller,
multiple components, and the resources those components rely on, like a database or a 3rd party service.

The simplified diagram shows the request flow from an external client.

Monitoring and alerting is focused mainly on the API Gateway and relies heavily on AWS and kubernetes to keep the
components available.

The alerting rules described in the document are an initial set and will be improved with experience.
They don’t include kubernetes or other resources yet.

API Gateway¶

The API Gateway maps request to the unified API of Arda to requests to individual components.

The API Gateway monitors responses with an HTTP status of 5xx, indicating a failure of the request to reach the business
logic in the component. It alerts when requests reach a threshold TBD.

The API Gateway monitors responses with an HTTP status of 4xx, indicating that the request triggered a business
validation error; because they represent user errors, they will not trigger an alert.

The API Gateway monitors request counts. Deviation from the expected pattern will trigger an alert, as they might be
symptom of an upstream problem, the pattern has to be established first.

Network load balancer¶

The network load balancer maps the request to the cluster in which the components are deployed and connect to a set of
ingress controller instances in the targeted cluster.

The network load balancer monitors the count of healthy and unhealthy instances it can reach. It alerts when
there are fewer healthy than unhealthy instances or when there are no healthy instances.

Creating traffic¶

A dedicated set of API Tests run on a schedule and ensure that complete chain from Client to the Components is working.

Alerts will be raised for failures to resolve DNS or connect to the API Gateway, as well as for 404,
indicating a misconfigured route; other errors are already handled by the API gateway itself.

This layer will be expanded to also check for authentication issues.

Watching the Watchers¶

The monitoring system will alert when it doesn’t see the expected activities from the API tests.

Technologies¶

Amazon CloudWatch ensures monitoring.

Alerts will be posted to a dedicated slack channel.

API Tests will be implemented with Bruno, access one of the health check endpoints of every component and execute as a
cron job in kubernetes.

Service Monitoring