Trade Offs made in o11y systems for high volume traffic

Introduction

Phillip Carter’s blog “The Observability CAP Theorem” explores the challenging balancing act of building and managing observability systems for high-traffic organizations, something which resonates with me. It explores the trade-offs between key features of observability—like cost, data detail, retention, and query efficiency—and how organizations have to prioritize based on their specific needs.

One big takeaway is that no system can excel at everything. For example, making data queries fast and efficient might mean sacrificing how much data you can store long-term or how detailed that data is.

Another point that stood out is how observability priorities change over time. What works today might need to evolve as your company grows, regulations change, or new tools emerge. With that in mind, I wanted to write some thoughts and takeaways from the blog—both to reflect on and as a guide for future decisions.

Properties of Observability

  • Quick and efficient real-time querying of your data
  • Sufficient historical data availability per query, ranging from days to years
  • Comprehensive access to relevant data for a specific context
  • Cost-effective operations that align with budgetary expectations

Trade-offs in Observability Systems

Due to the challenges of managing large volumes of data, observability systems often make trade-offs to optimize for specific use cases or properties. These trade-offs include:

  • Comprehensive Metric Storage vs. Investigation Efficiency: Observability systems may prioritize the ability to cheaply store and observe metrics for extensive datasets, but this can come at the cost of making in-depth investigations into root causes more challenging.

  • Extensive Log Retention vs. Query Costs: Systems may enable the retention of all log data, even in a user-managed cloud environment, but this often involves incurring costs for each query or analysis performed.

  • Fast Trace Queries vs. Sampling Trade-offs: To enable fast querying and analysis of trace data, systems may rely on effective sampling techniques, potentially reducing the completeness of the data available for analysis.

  • Unified Data Ingestion vs. User Experience Fragmentation: Systems that allow all types of data to be sent to a single tool may result in inconsistent user experiences, depending on the type of data being analyzed.

  • Uniform Analysis Across Data Sources vs. Specialized Capabilities: Tools that offer consistent analysis and visualization for all data types may sacrifice advanced features optimized for specific kinds of telemetry data.

Prioritizing Observability Properties for companies generating high volume traffic

For experienced engineers in organizations, unsampled high volume data can be stored cost-effectively in long-term archives, priorities shift to emphasize real-time operational efficiency. The updated order of importance is as follows:

  1. Fast Queries for Incident Response

    • Quick access to actionable insights during incidents is critical for minimizing downtime and addressing urgent issues.
    • Engineers value systems that enable rapid querying over recent, relevant data, as this directly impacts the ability to resolve problems efficiently.
  2. Efficient Sampling and Aggregation

    • Effective techniques like sampling, filtering, and aggregation allow teams to maintain representativeness while reducing data volume and cost.
    • This ensures that a smaller, relevant subset of data is readily available in fast-querying stores, while the rest can be archived for later use if needed.
  3. Cost-Effective Compliance Data Retention

    • Compliance data can be stored in low-cost archival systems (e.g., object storage like AWS S3 or similar solutions), minimizing its impact on operational observability costs.
    • Retrieval and processing can be deferred to meet regulatory requirements without affecting real-time observability needs.
  4. Cost-Conscious Operations

    • Observability costs must align with organizational budgets. Prioritizing cost-effective tools and strategies ensures observability delivers value without overextending financial resources.
    • Open-source solutions or tiered pricing models for commercial tools can help manage costs effectively.

Conclusion

Observability handles different types of data streams with unpredictable volumes. The trade-offs in observability systems are rooted in technical and organizational realities. The ultimate impact of these trade-offs often depends on an organization’s unique priorities, such as tolerance for incident duration, audit requirements, tool standardization, and budget constraints. These priorities can shift over time due to internal dynamics or external pressures.

End
Built with Hugo
Theme Stack designed by Jimmy