To effectively monitor and understand the performance of your applications and infrastructure, having a well-designed metrics system is crucial. Here are the key requirements and components for building a reliable metrics system.
Requirements for a Metrics System
- Multidimensional Data Model: The metrics system should support a multidimensional data model that can be sliced and diced along different dimensions as defined by the service (e.g., instance, service, endpoint, method).
- Operational Simplicity: The system should be easy to operate and maintain, minimizing overhead and complexity.
- Scalable Data Collection: The system must support scalable data collection and offer a decentralized architecture, allowing independent teams to set up their own monitoring servers.
- Powerful Query Language: A powerful query language should be available to leverage the data model for alerting and graphing, enabling precise insights into system performance.
Client Libraries
Client libraries play an essential role in the metrics system:
- They handle details like thread safety, bookkeeping, and producing the Prometheus text exposition format in response to HTTP requests.
- Since metrics-based monitoring doesn’t track individual events, client library memory usage doesn’t increase with more events. Instead, memory usage depends on the number of metrics being tracked.
Instrumentation
To effectively monitor different types of services, appropriate instrumentation methods must be used. Here are three common types of services and how they should be instrumented:
Online-Serving Systems
For online-serving systems, such as web services, the RED Method is used. This method involves tracking:
- Requests: The count of incoming requests.
- Errors: The count of failed requests.
- Duration: The latency or response time of requests.
For example, a cache might track these metrics for both overall performance and for cache misses that need to be recalculated or fetched from a backend.
Offline-Serving Systems
Offline-serving systems, such as log processors, usually batch up work and consist of multiple stages in a pipeline with queues in between. These systems run continuously, which distinguishes them from batch jobs. The USE Method is used for these types of systems:
- Utilization: How much of the system’s capacity is in use (e.g., how much work is in progress).
- Saturation: The amount of queued work and how much work is currently being processed.
- Errors: Any errors encountered during processing.
Batch Jobs
Batch jobs are processes that run at scheduled intervals. The key metrics for batch jobs include:
- Run Time: How long it took for the job to complete.
- Stage Duration: How long each stage of the job took to complete.
- Success Time: The time at which the job last succeeded.
Alerts can be set for when the job hasn’t succeeded within a certain time frame.
Idempotency for Batch Jobs: Idempotency is an important concept for batch jobs. It means that performing an operation more than once has the same effect as performing it only once, which is crucial for reliability and preventing unintended side effects.