Introduction

Streaming data is a big deal in big data these days. As more and more businesses seek to tame the massive unbounded data sets that pervade our world, streaming systems have finally reached a level of maturity sufficient for mainstream adoption.

  • With this practical guide, data engineers, data scientists, and developers will learn how to work with streaming data in a conceptual and platform-agnostic way.
  • Expanded from Tyler Akidau’s popular blog posts Streaming 101 and Streaming 102, this book takes you from an introductory level to a nuanced understanding of the what, where, when, and how of processing real-time data streams.
  • You’ll also dive deep into watermarks and exactly-once processing with coauthors Slava Chernyak and Reuven Lax.
1. Streaming 101

Why streaming is awesome, death to Lambda Architecture, data processing patterns.

2. The What, Where, When, and How of Data Processing

The basics, explained in prose, animation, limmerick, and interpretive dance.

Here are some examples using Apache Beam in Java:

PCollection raw = IO.read(...); PCollection> input = raw.apply(ParDo.of(new ParseFn()); PCollection> scores = input .apply(Sum.integersPerKey()); PCollection> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))) .apply(Sum.integersPerKey()); PCollection> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); PCollection> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .triggering(Sequence( Repeat(AtPeriod(Duration.standardMinutes(1))) .OrFinally(AtWatermark()), Repeat(AtCount(1)))) .apply(Sum.integersPerKey()); PCollection> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .triggering( AtWatermark() .withEarlyFirings(AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey());
3. Watermarks

Progress and completeness in unbounded data sets, how watermarks are established and propagated, real-world examples.

4. Advanced Windowing

Practical considerations for going beyond the basics: processing-time windows, sessions, and the importance of custom windowing.

5. Exactly-Once and Side Effects

The illusionary art of creating perfection out of imperfection; the mythical beasts known as idempotence and determinism;

  • How Apache Flink, Apache Spark, and Google Cloud Dataflow work their magic
  • 6. Streams and Tables

    The basis for life, the universe, and everything, assuming a data-processingly skewed definition of the above, or: that feeling when you realize everything you do was invented by the database community decades ago.

    7. The Practicalities of Persistent State

    For those "functional programming be damned, just give me a Turing machine" kinds of days.

    8. Streaming SQL
  • Contorting relational algebra for fun and profit.
  • Also, time-varying relations will change your life.
  • 9. Streaming Joins

    A brief moment of panic upon realizing that all joins are streaming at the core, followed by a prolonged sense of relief upon realizing it makes them all the easier to understand.

    10. The Evolution of Large-Scale Data Processing

    An opinionated history of stream processing in the MapReduce lineage of systems, with a healthy dose of source material citations for drawing your own conclusions.

    Reference