System Design Collection: Apache Flink from 10,000 Ft, and Constructing a Flink-powered Suggestion Engine

, I’ve had Apache Flink on my “issues I actually need to grasp correctly” listing. I’d seen it talked about alongside Kafka, heard it come up in conversations about real-time pipelines, and sort-of understood the use case. However I’d by no means truly sat down and discovered it correctly.

For those who really feel the identical means, you’re in good firm. There’s good purpose to find out about Flink, it’s one of the vital common instruments in software program engineering proper now. Netflix makes use of it for near-real-time anomaly detection of their streaming infrastructure. Alibaba reportedly runs one of many largest Flink deployments on this planet — processing a whole bunch of billions of occasions per day throughout tens of hundreds of machines. Uber constructed their analytical platform round it. Flink has change into the spine of how a few of the most data-intensive corporations on this planet course of data because it occurs. So if Flink has been in your listing too, this can be a good time to truly perceive it.

So I dove in. And I used to be truthfully shocked, not simply by what Flink is, however by why it exists and how its constructed. The story of Flink is admittedly the story of a a lot deeper concept: the thought of perceive high-scale, continuously streaming information. The issue assertion is definitely fairly easy, how do you construct real-world and sensible solutions from huge scale of steady information. This submit is my try to clarify that concept from the bottom up, and present you the place Flink suits into it.

Let’s dive in.

Earlier than We Begin

Two ideas come up continuously on this submit which can be price ensuring we’re on the identical web page about earlier than we go additional.

What’s a stream? A stream is a steady, doubtlessly endless sequence of information arriving over time. Take into consideration a person searching an internet site — each web page view, each click on, each scroll is an occasion being produced. One after one other, in actual time. There’s no pure “finish” to this — so long as the person is energetic, occasions preserve coming. That’s a stream.

What’s batch processing? Batch processing means taking a finite, bounded assortment of information and processing it abruptly. As a substitute of reacting to every occasion because it arrives, you gather occasions for a time frame — say, an hour — after which run a computation over all of them collectively. The computation has a transparent begin and a transparent finish.

Each are respectable methods to course of information. The strain between them is what Flink was constructed to resolve — and we’ll get there.

Again To The Drawback: How We Really Produce Information

Let me make this concrete with an instance we’ll use all through this submit.

Think about you’re constructing a advice engine — the sort that exhibits customers “you may also like these” based mostly on what they’ve been viewing. To do that effectively, your system must know issues like:
What has this person been clicking on in the previous few minutes?
What gadgets are trending proper now throughout all customers?
Which merchandise did this person view however not buy within the final session?

Now, the place does that information come from? Each time a person opens a product web page, you document an occasion. Each click on, each buy, each search — your utility is repeatedly writing information that look roughly like this:

{ "user_id": "u-8821", "item_id": "p-443", "event_type": "view", "timestamp": "2024–03–10T14:32:01Z" }
{ "user_id": "u-1042", "item_id": "p-117", "event_type": "buy", "timestamp": "2024–03–10T14:32:03Z" }
{ "user_id": "u-8821", "item_id": "p-501", "event_type": "click on", "timestamp": "2024–03–10T14:32:07Z" }

One document each few seconds for each person, throughout thousands and thousands of concurrent customers, repeatedly. That’s your information. Not a file. Not a desk that refreshes as soon as a day. A stream — an ongoing, endless sequence of occasions that describes what your customers are doing proper now.

Once more, that is what this stream appears like —

And but the dominant paradigm for years was to take that stream and… ignore the truth that it was a stream. Dump the occasions into recordsdata each hour. Await the batch job to run. Then serve suggestions based mostly on what customers had been doing final hour.

Why? As a result of batch processing is conceptually easy. precisely what information you might have. You possibly can purpose in regards to the computation clearly — it begins, it runs, it finishes. Methods like Hadoop and MapReduce (you don’t must know these in depth for this submit) had been constructed round this mannequin and scaled to monumental information sizes. They labored.

However there’s a basic value: latency. In case your batch job runs each hour, then at worst case, a person’s conduct proper now gained’t affect their suggestions for as much as an hour. For a advice engine, meaning a person who simply confirmed robust curiosity in mountain climbing gear will get proven laptop computer equipment — as a result of the system hasn’t caught up but. The person looked for a mountain climbing rucksack, and it’s worthwhile to present them tents and mountain climbing poles on the subsequent web page load, not one hour later.

For fraud detection, hourly latency means fraudulent transactions go undetected for an hour. For a reside dashboard, it means your “real-time” metrics may be upto 59 minutes stale. The price of batch is that occasions occur in actual time, however your system solely learns about them on a schedule.

In order information volumes grew and latency necessities tightened, engineers began constructing streaming methods alongside their batch methods — methods that might course of every occasion because it arrived, in milliseconds. Apache Storm was an early chief right here. Amazon Kinesis. LinkedIn’s Samza.

However constructing a brand new streaming system, whereas sustaining an current batch system, isn’t so easy. Now you might have two methods to keep up. Your streaming pipeline computed approximate, real-time outcomes. Your batch pipeline ran in a single day and produced correct, full outcomes. You needed to write the identical enterprise logic twice — as soon as for every system, in numerous frameworks, in numerous languages, stored in sync manually. When the batch job and the streaming job disagreed on a quantity (and so they at all times disagreed ultimately), you had to determine which one was incorrect.

Your advice engine on this new world now appears like this: a streaming part that updates suggestions in near-real-time based mostly on current occasions, and a batch part that rebuilds the complete advice mannequin each night time based mostly on historic information.

Two codebases. Two deployment pipelines. Two units of bugs. One serving layer making an attempt to reconcile them.

The Key Perception: Batch Is Only a Particular Case of Streaming

Right here’s the thought on the coronary heart of Flink, and it’s fairly easy:

A bounded information set is only a particular case of an unbounded information stream that occurs to finish.

Your historic database of 5 years of person occasions — that’s a stream that began 5 years in the past and stopped in the present day. Your log recordsdata from final month — that’s a stream with a starting and an finish. The distinction between “batch information” and “streaming information” isn’t a basic distinction in regards to the nature of the information. On the finish of the day, it’s simply JSON occasions of what the person searched and clicked on. The query is whether or not the stream continues to be flowing or has stopped.
Going again to our advice engine: the “historic information” you course of in your nightly batch job and the “real-time occasions” you course of in your streaming pipeline are each simply information in the identical sequence of person occasions. The one distinction is once you learn them. The nightly batch job reads information from 6 months in the past. The streaming pipeline reads information from 6 seconds in the past. Similar information, completely different time window.

For those who construct a system that processes streams natively — and handles each infinite streams and finite ones — you don’t want separate methods. You don’t want to keep up two codebases. You’ve gotten one engine, one set of logic, and also you level it at no matter slice of the stream you want.

That’s what Flink tries to do.

So What Is Apache Flink?

Apache Flink is a distributed stream processing framework. It takes a doubtlessly unbounded stream of information (or a bounded batch of information — identical factor), processes it in parallel throughout a cluster of machines, and produces outcomes repeatedly as information flows by.

Internally, Flink jobs are written in code, and are transformed to a DAG. For instance, that is how code for a Flink Job would seem like (It’s not vital to grasp all the main points, that is to simply give a tough concept) –

// ── 1. SOURCES ──────────────────────────────────────────────
searches = readFromKafka("search-events")
clicks = readFromKafka("click-events")


// ── 2. PER-USER ACTIVITY (windowed aggregation) ─────────────
// group occasions by person, compute rolling options over final 30 min
userActivity = (searches + clicks)
.keyBy(userId)
.window(slidingWindow(dimension=30min, slide=1min))
.mixture(activityAggregator)
// → { userId, recentQueries, recentClicks, classes, ... }


// ── 3. USER EMBEDDING (name user-tower mannequin) ───────────────
// flip the exercise options right into a vector
userState = userActivity.asyncMap(callUserTowerModel)
// → { userId, embedding[128], options }


// ── 4. CANDIDATE GENERATION (2 sources, then merge) ─────────
annCandidates = userState.asyncMap(vectorAnnLookup) // ~500 gadgets
trendingCandidates = userState.asyncMap(trendingLookup) // ~200 gadgets

allCandidates = (annCandidates + trendingCandidates)
.keyBy(userId)
.window(2sec)
.scale back(mergeAndDedupe)
// → { userId, candidates: ~1000 itemIds }


// ── 5. FETCH ITEM FEATURES (batched lookup) ─────────────────
scoringInputs = allCandidates
.joinWith(userState, on=userId)
.asyncMap(fetchItemFeatures)
// → { userId, userFeatures, [(itemId, itemFeatures) × ~1000] }


// ── 6. RANKING (name rating mannequin) ─────────────────────────
ranked = scoringInputs.asyncMap(callRankingModel)
// → { userId, high 100 (itemId, rating) pairs }


// ── 7. SINK ─────────────────────────────────────────────────
ranked.writeTo(redis)

Internally, Flink breaks down this code to a graph of bodily duties to be finished, and breaks these duties to smaller set of parallel “subtasks” —

Flink pushes duties to employee nodes. Every employee runs its assigned duties repeatedly, sends periodic heartbeats again to Flink, and studies if a process fails so Flink can restart it.

Lets break down the core ideas of Flink

Core Ideas

Streams and Operators

Let me begin from the only doable image and construct up.

Each Flink program is a dataflow graph: a set of operators related by information streams. Don’t fear if this sounds summary proper now — we’ll construct the image piece by piece and it’ll click on rapidly.

Sources produce information (studying from Kafka, a file, a database).

Operators remodel it.

Sinks devour the output (writing to a database, one other Kafka subject, a dashboard).

An operator is a unit of processing logic. For our advice engine, an operator would possibly filter out bot visitors, or enrich an occasion with product metadata, or depend what number of occasions every product was considered. Every operator receives information from a number of enter streams, does one thing to them, and emits information to a number of output streams.

A stream is the sequence of information flowing between operators. In our case, a stream of person occasions: view occasions, click on occasions, buy occasions, one after one other as they occur.

That is the fundamental form of any Flink job.

Parallelism

A single machine can course of occasions quick — however in the event you’re dealing with thousands and thousands of customers, a single machine isn’t sufficient. Flink solves this by working each operator in parallel: every operator is cut up into a number of subtasks that run concurrently on completely different machines in your cluster.

If in case you have a Filter operator with parallelism 4, there are 4 cases working concurrently, every processing a distinct portion of the stream. Add extra machines, get extra subtasks, deal with increased volumes. That is how Flink scales to billions of occasions per day.
For our advice engine, this implies the window aggregation for 10 million customers isn’t working sequentially on one machine — it’s cut up throughout dozens of staff

State

Going again to our advice engine: when a person views a product, that single occasion by itself tells you nearly nothing. You want context. What else has this person been viewing prior to now couple of minutes? Have they been merchandise in the identical class? Did they nearly buy one thing comparable final session? To reply these questions, your system wants reminiscence — it wants to recollect what occurred earlier than.

Within the early days of stream processing, most methods had been stateless. Every occasion was processed in isolation: the operator noticed the occasion, remodeled it, moved on. No reminiscence of what got here earlier than. This labored positive for easy pipelines — filtering out bot visitors, enriching occasions with metadata from a lookup desk. However it was basically too restricted for something that required reasoning about patterns over time.

Take into consideration what our advice engine truly must do. For each incoming occasion, it must ask: “What has person u-8821 finished within the final 10 minutes?” To reply that query, somebody must be retaining a working listing of person u-8821’s current occasions. And person u-1042’s current occasions. And all the opposite customers. That’s state — information that accumulates and evolves as information move by the operator, fairly than being derived recent from every particular person document.
Flink makes state a first-class idea. An operator can declare state explicitly — a counter, a hash map keyed by person ID, a sorted listing of current occasions. Flink offers you that state as a managed object you may learn and write throughout processing. For our advice engine, the state is perhaps a hash map from person ID to “listing of merchandise IDs considered within the final 10 minutes.” Each time a brand new view occasion arrives, you lookup the person within the map, append the merchandise, and trim occasions older than 10 minutes.

However managing state in a distributed system is genuinely laborious. What occurs when the machine working your operator crashes? That in-memory hash map is gone. Flink handles this: it periodically snapshots all operator state to sturdy storage, so on restoration it could actually restore the whole lot to the place it was earlier than the failure. And it ensures that state updates are utilized precisely as soon as — even when a machine crashes and the identical occasions are replayed throughout restoration, your counts gained’t be doubled.

We’ll go deep on how Flink achieves exactly-once ensures in a future structure submit. For now, simply know that Flink offers you state that feels as dependable as writing to a database, with the efficiency of an in-memory hash map.

Home windows

We’ve acquired a stream of person occasions, operators working in parallel, and state accumulating per person. Now right here’s an issue that comes up nearly instantly in any actual aggregation.

Let’s say you wish to compute “the ten most considered merchandise within the final 5 minutes” — to energy a “trending now” part of your website. You’ve gotten an operator that’s counting views per product. However your stream is infinite. When do you emit a end result? You possibly can’t wait till “all of the occasions” arrive — they by no means cease arriving.

You want a option to slice the infinite stream into finite items and compute over every bit. That’s a window.

A window is a bounded chunk of your stream. You outline it, Flink teams the occasions into that chunk, and when the chunk is “full,” it runs your aggregation and emits a end result. Flink has a number of window varieties, Tumbling Home windows, Sliding Home windows, Session Home windows, and many others. It’s not crucial to grasp the variations between every window sort, however the gist of home windows is that it appears at information for a while interval.

Tidbits from the Authentic Paper

I spent a while studying the 2015 Apache Flink paper — ”Apache Flink: Stream and Batch Processing in a Single Engine” by Carbone, Katsifodimos, Ewen, Markl, Haridi, and Tzoumas. Just a few issues from the paper that add helpful shade to what we lined above:

On Fault Tolerance and Precisely-As soon as Ensures

The paper describes exactly-once semantics this manner: “Flink affords strict exactly-once-processing consistency ensures for stateful operators by a mix of distributed snapshots and partial re-execution upon restoration.” The important thing phrase there’s *partial* re-execution — when a machine fails, Flink doesn’t restart the whole job from the start. It rolls again all operators to their final profitable snapshot, then replays solely the enter from that time ahead. The utmost quantity of reprocessing is bounded by the hole between two consecutive checkpoints — which is a tunable parameter.

The mechanism that makes this work with out pausing the computation known as Asynchronous Barrier Snapshotting (ABS) — and it’s genuinely intelligent. We’ll cowl it in full element within the subsequent submit. However the headline is: Flink injects particular “barrier” markers into the information stream, which move by the operators like common information. When an operator receives a barrier, it snapshots its state to sturdy storage and forwards the barrier downstream — all whereas persevering with to course of information. No pause, no freeze, no missed occasions.

On Unified Batch and Stream Processing

One of many clearest statements within the paper is that this: “A bounded information set is a particular case of an unbounded information stream.” The authors are making a philosophical declare, not only a technical one. And so they again it up: “Batch computations are executed by the identical runtime as streaming computations. The runtime executable could also be parameterized with blocked information streams to interrupt up massive computations into remoted levels which can be scheduled successively.”
In plain phrases: there isn’t any separate batch engine in Flink. Batch jobs run on the very same distributed dataflow runtime that processes your Kafka streams. The one distinction is that batch jobs use “blocked” information change between levels — the upstream operator finishes absolutely earlier than the downstream one begins. Every part else — the operator mannequin, the state administration, the serialization — is equivalent.

Going again to our advice engine: this implies the job that counts real-time view traits and the job that processes 6 months of historic occasions for mannequin retraining can share the identical operators, the identical cluster, and the identical codebase. The paper’s promise is that the Lambda Structure — with its two methods and two codebases — is solely not obligatory.

Wrapping Up

Let’s rapidly do a TLDR:

Information is produced as steady streams, however we’ve traditionally pressured it into batches — creating latency and the operational ache of sustaining two methods

Flink is constructed on the perception that batch is only a particular case of streaming— and unifies each in a single engine

The core constructing blocks are: operators (processing logic), streams (information in movement), state (reminiscence that persists throughout information), and home windows (bounded slices of a stream for computation)
Fault tolerance with exactly-once ensures is inbuilt.

Ideally, I might have actually favored to go in-depth to every of those matters (and there’s a lot of depth in them), however this submit has already change into fairly lengthy, so I’ll defer that to future Sanil for now. You may also observe me on LinkedIn for extra byte-sized posts and to know what I’m studying about proper now.

We talked quite a bit about Apache Kafka (given its the spine of most information architectures), however did you ever marvel how Apache Kafka works and what its structure is like? I used to be shocked to find out how easy Kafka actually is below the hood. I wrote a complete weblog submit about it right here —
System Design Collection: Apache Kafka from 10,000 ft
Let’s take a look at what Kafka is, the way it works and when ought to we use it!medium.com

For those who’re in search of one thing extra in-depth, I’d advocate testing one in every of my hottest posts on Temporal, a workflow orchestration device, with in-depth explanations on how occasions are scheduled, began and accomplished —
System Design Collection: A Step-by-Step Breakdown of Temporal’s Inner Structure
A step-by-step deep dive into Temporal’s structure — protecting workflows, duties, shards, partitions, and the way Temporal…medium.com

System Design Collection: Apache Flink from 10,000 Ft, and Constructing a Flink-powered Suggestion Engine

Migrating a textual content agent to a voice assistant with Amazon Nova 2 Sonic

Organizing Brokers’ reminiscence at scale: Namespace design patterns in AgentCore Reminiscence

Organizing Brokers’ reminiscence at scale: Namespace design patterns in AgentCore Reminiscence

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

How Cursor Really Indexes Your Codebase

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

Construct a serverless audio summarization resolution with Amazon Bedrock and Whisper

Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

About Us

Category

Recent Posts