7.6.22

There is only stream processing.

This post is about what to call stream processing. Strangely enough, it has little do with terminology, but almost all with perspective.

If you are pressed for time (ha!), the conclusion is that "stream processing vs batch processing" is a false dichotomy. Batch processing is merely an implementation of stream processing. Every time people use these as opposite, they are making the wrong abstraction.

Can a reasonable definition be general?

A reasonable definition of data stream is a sequence of data chunks, where the length of the sequence is either "large" or unbounded.

This is a pretty general definition, esp if we leave open what "large" means.

  • Anything transmitted over TCP, say a file or HTTP response
  • a Twitter feed
  • system or application logs, say metadata from any user's visit to this blog
  • the URLs of websites you visit
  • location data acquired from your mobile phone
  • personal details of children who start school every year

It also stands to reason that stream processing is processing of a data stream. A data stream is data that is separated in time.

When a definition is that general, engineers get suspicious. Should it not make a difference whether a chunk of data arrives every year or every millisecond? Every conversation about stream processing happens in a context where we know a lot more about the temporal nature of the sequence.

The same engineers, when dealing with distributed systems, have no problem refering to large data sets as a whole, even if they fully know that it may be distributed over many machines. We may well call a large data set data that is separated in space. A data stream could also be both, separated in time and space (maybe we could call that a "distributed data stream").

At the warehouse

Where am I going with this? Let's quickly establish that "batch processing" is but a form of stream processing, in the above definition.

What do people call batch processing? The etymology of this goes back to punch cards, but it is not about those.

Batch processing is frequently found in data warehousing, or extract-transform-load (ETL) process. This is a setup that has been around since the 1970s, and this does not make it bad. What is essential is that data is periodically ingested (extract), say in the form of large files. It is then turned (transform) into a uniform representation that is suitable for various kinds of querying (load).

Accumulating data and processing it periodically, we have seen this before. Does the data fit the general definition of data stream? Surely, since there is a notion of new data is coming in.

What could be alternative terminology? Tyler Akidau uses the word "batch engine" in this post on stream processing. So the good old periodic processing could be called "stream processing done with a batch engine."

I promised that this is not only about terminology. When did people first feel the need to distinguish batch processing from stream processing?

Data stream management systems

The people who systematically needed to distinguish processing of data streams from simply large data sets were the database community. Dear academics, I don't mean to hurt any feeling but I will just count all papers on "data stream processing" or "complex event processing" as database community.

The database researchers implemented efficient evaluation of a (reasonaly) standard, high level, declarative language: relational queries (SQL). Efficient evaluation and performance meant to make best use of available, bounded resources. As part of this journey, architectures appeared that look very similar to what streams of partial results (such Volcano, or iterator model).

Take a look the relational query (SELECT becomes $\pi$, WHERE becomes $\sigma$, JOIN becomes $\Join$). What would happen if we flipped the direction of the arrows? Instead of bounded data like the "Employee" and "Building" tables, we could have continuous streams that go through the machinery.

Exercise: draw the diagram above, with all arrows reversed, and think about how this could be a useful stream processing system. Maybe there is a setup procedure that has to happen in each building before an engineer starts working from there and they would like to know about arrivals and new departures.

Data stream management system (DSMS) became a thing 20 years ago when implementors realized that most of their stuff will continue to work when tuples come in continuously, mutatis mutandis.

  • Michael Stonebraker, of PostgreSQL fame, built a system called Aurora and founded a company StreamBase systems
  • At Stanford they wrote about the Stanford Stream Data Management System and Continuous Query Language (CQL)
  • (I'm not going to do a survey here)

Academic authors will delineate their field using descriptions as follows.

  • "high" rate of updates
  • requirement to provide results in "real time" (bounded time)
  • limited resources (e.g. memory)

We are now getting closer to the more specific meaning people attach to "stream processing:" not only do we receive the data in chunks at different times, we also need to produce results in a "short" amount of time, or with "bounded" machine resources.

In order to understand why database folks streaming, one shoud know that evaluating queries in DBMS with high performance is also a form a stream processing. When a user types a SQL query at the prompt (SELECT ... FROM ..., the all-caps giving that distinctive pre-history feel), the result should come in as quickly as possible, and there are surely some limits on the machine resources. So generalizing query evaluation to continuosly arriving data really was a logical next step to overcome limitations.

Interlude: Windowing and micro-batching

If you did the exercise, you may wonder how a join is supposed to work when we take a streaming perspective? There are multiple answers.

A particular nice and useful scenario is if the joins can be considered as a simple lookups, for example when each row in a data stream is enriched with something that is looked up via a service.

Another scenario is the windowed stream join: here we consider bounded segments (windows) of two data stream and join what we find in those. Usually, this requires some guarantees: the streams are a priori not synchronous. They may either be synchronous enough or one may use some amount of buffering and periodically process what is in the buffer.

Wait - did I just write "periodic processing"? That is right, so it looks like when stream processing is hardcore enough, it contains periodic processing again. This is where usually people will say things like "micro-batch." There are simply scenarios in stream processing that cannot be done without periodic processing (everything that involves windows in processing time).

A horizontal argument

Now, the words "limited machine resources" meant something different 20 years ago. We can (and do) build systems that involves machines and communications, sharding or "horizontal scaling". From the good old MapReduce to NoSQL and Craig Chambers FlumeJava (which lives happily on as Apache Beam) to Spark and Cloud Dataflow, there is a series of systems, APIs that deal with stream processing in a distributed manner.

Tying the knot

The irony of talking about "batch processing" today is that periodic processing of large files, also involves distributed stream processing underneath. When a large input data set is processed with a MapReduce, it is distributed across a set of workers that map-phase locally produce partial result. The shuffle phase then takes care of getting all partial results to the right place for the reduce-phase. The "separation in space" is also a "separation in time:" a particular partition can be "done" before others, which means that results is a result stream.

Depending on the level at which one is discussing a system, the performance expectations we have, the trade-offs, one may be able to ignore spatial and temporal separation. It seems that the recognizing the temporal separation always brings advantages.

No comments: