Recently, the observability team was tasked with a latency reporting project. The key features included:
- Daily and monthly latency aggregations
- Support for various quantiles
- A reasonable data delay SLO (24 hours)
and more. For the purpose of this blog, we will focus on the data processing aspect. We needed a system that could read latency data from a Kafka stream, partition the data by various attributes such as endpoints, and precompute hourly, daily, and weekly quantiles (e.g., p50, p99).
Why did we choose Flink?
We chose Flink because of several key features:
- Stateful processing: It allows us to maintain state across events.
- Built-in windowing support: Flink offers windowing functions (e.g., tumbling, sliding) that divide the data stream into finite subsets (windows) based on time or count.
- Scalability: It can handle high-throughput data streams, such as those from Kafka.
- Fault tolerance: Through state snapshots and checkpointing, Flink ensures resilience.
Flink Playground
Naturally, besides going through the flink documentation [insert link], I decided go to my usual playground of choice. Docker. Below is the architecture. Checkout the code and docker-compose.yml for more information. Below is the basic architecture.
So what impressed me?
Flink’s brilliance lies in its ability to effortlessly handle the complexity of distributed stream processing with concise, expressive code. take this one line for example
|
|
Flink simplifies real-time data processing by abstracting away many low-level concerns while giving developers precise control over time, state, and computation. In this single line, Flink allows us to:
- Partition the stream: The keyBy function logically partitions the incoming latencyStream based on timestamps, so each partition processes its events independently, enabling scalability in distributed environments.
- Windowing: Using TumblingProcessingTimeWindows.of(Time.minutes(1)), Flink groups the data into 1-minute windows based on event processing time, making it easy to aggregate data over defined time intervals.
- Custom processing: The process function applies a user-defined LatencyAggregateProcessFunction for calculating latency metrics, allowing custom logic to be executed for each window.
- Type safety and efficiency: Using Flinkโs TypeHint, we ensure type safety and help optimize serialization and deserialization in distributed environments, which enhances performance.
This one-liner hides a ton of complexity โ from fault tolerance to scaling โ and Flinkโs ability to marry simplicity with powerful distributed processing is where it shines. In just a few lines of code, you have a resilient, scalable, and efficient stream processing pipeline!