What is flink and why should we care about it

Recently, the observability team was tasked with a latency reporting project. The key features included:

Daily and monthly latency aggregations
Support for various quantiles
A reasonable data delay SLO (24 hours)

and more. For the purpose of this blog, we will focus on the data processing aspect. We needed a system that could read latency data from a Kafka stream, partition the data by various attributes such as endpoints, and precompute hourly, daily, and weekly quantiles (e.g., p50, p99).

Why did we choose Flink?

We chose Flink because of several key features:

Stateful processing: It allows us to maintain state across events.
Built-in windowing support: Flink offers windowing functions (e.g., tumbling, sliding) that divide the data stream into finite subsets (windows) based on time or count.
Scalability: It can handle high-throughput data streams, such as those from Kafka.
Fault tolerance: Through state snapshots and checkpointing, Flink ensures resilience.

Flink Playground

Naturally, besides going through the flink documentation [insert link], I decided go to my usual playground of choice. Docker. Below is the architecture. Checkout the code and docker-compose.yml for more information. Below is the basic architecture.

graph TD; P["Python App (produces latency data)"] --> A["Kafka Broker (latency topic)"] %% Python app produces random latency data and sends it to the Kafka broker A --> B["Flink Pipeline"] %% Kafka Broker receives latency messages produced by the Python app, which are consumed by the Flink pipeline for processing B --> C["1-Minute Tumbling Window"] %% The Flink Pipeline splits the data stream into a 1-minute tumbling window for short-term aggregation B --> D["1-Hour Tumbling Window"] %% The Flink Pipeline also splits the data stream into a 1-hour tumbling window for long-term aggregation C --> E["P50/P99 Aggregation"] %% Once the 1-minute window is over, the data is processed to calculate P50 and P99 latency metrics D --> F["P50/P99 Aggregation"] %% Once the 1-hour window is over, the data is also processed to calculate P50 and P99 latency metrics E --> G["Storage System (P50/P99 results)"] %% The aggregated results from the 1-minute window are sent to a storage system for further use F --> G["Storage System (P50/P99 results)"] %% The aggregated results from the 1-hour window are also sent to a storage system for further use

So what impressed me?

Flink’s brilliance lies in its ability to effortlessly handle the complexity of distributed stream processing with concise, expressive code. take this one line for example

1
2
3
4
5
    val oneMinuteWindows = latencyStream
        .keyBy({ it.timestamp / 60000 }, TypeInformation.of(object : TypeHint<Long>() {}))
        .window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
        .process(LatencyAggregateProcessFunction())
        .returns(TypeInformation.of(object : TypeHint<AggregateResult>() {}))

Flink simplifies real-time data processing by abstracting away many low-level concerns while giving developers precise control over time, state, and computation. In this single line, Flink allows us to:

Partition the stream: The keyBy function logically partitions the incoming latencyStream based on timestamps, so each partition processes its events independently, enabling scalability in distributed environments.
Windowing: Using TumblingProcessingTimeWindows.of(Time.minutes(1)), Flink groups the data into 1-minute windows based on event processing time, making it easy to aggregate data over defined time intervals.
Custom processing: The process function applies a user-defined LatencyAggregateProcessFunction for calculating latency metrics, allowing custom logic to be executed for each window.
Type safety and efficiency: Using Flink’s TypeHint, we ensure type safety and help optimize serialization and deserialization in distributed environments, which enhances performance.

This one-liner hides a ton of complexity — from fault tolerance to scaling — and Flink’s ability to marry simplicity with powerful distributed processing is where it shines. In just a few lines of code, you have a resilient, scalable, and efficient stream processing pipeline!