crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Thoughts on Cloud Dataflow
Date Sun, 25 Jan 2015 20:51:10 GMT
You all may have seen the blog post I wrote about the Spark-based
implementation of Google's Cloud Dataflow (CDF) API:

http://blog.cloudera.com/blog/2015/01/new-in-cloudera-labs-google-cloud-dataflow-on-apache-spark/

There are two very natural questions I'll try to answer in the rest of this
email:

1) Why did I do this?
2) What does it mean for the Crunch project?

On the first question, I personally enjoy reading and thinking about APIs
for data processing pipelines. It's a weird hobby, but it's mine, and when
Google came to me and said that I could get an early look at the successor
to FlumeJava, I was more than happy to help them out and see how hard it
would be to run what they had on top of Spark.

There are two ideas in CDF that I really like: the PTransform abstraction
and the unification of arbitrary batch and stream processing into a single
data model and API. I'll explain them in terms of the Spark/Crunch/Scalding
(SCS) programming model, which are all essentially the same: an object that
represents a distributed collection that can be transformed into other
distributed collections by function application.

1) PTransforms: This is a pattern I see in SCS projects all of the time
(writing this in Scala to save on keystrokes):

def count[S](input: PCollection[S]): PTable[S, Long] = {
  input.map(s => (s, 1L)).groupByKey().sum()
}

org.apache.crunch.lib is essentially a collection of functions that look
like this: they perform a complex transformation via a series of primitive
transforms (pDo, GBK, union, and combineValues.) This makes it easy for
developers to write pipelines using higher-level concepts, but the SCS
planner/executors only represent pipelines in terms of the primitive
transforms, so we lose this higher-level information during pipeline
execution, which can make it difficult to figure out the mapping between
the logical pipeline and the actual MR/Spark jobs that are executing the
code when it comes time to diagnose errors and/or performance bottlenecks.

In CDF, you can capture the higher-level structure of your pipelines via
PTransforms, which look like this:

https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/PTransform

So if I have a PTransform that is made up of other PTransforms (which is
how the entire CDF transform library is structured), the CDF
planner/executor knows about that hierarchical structure of the pipeline,
and you can examine the runtime performance of your pipeline in terms of
those higher level concepts-- it's sort of like when you're using a JVM
profiler like YourKit and you can navigate through the function call
hierarchy to find bottlenecks. It's a very cool idea and I am insanely
jealous I didn't think of first. :)

2) The unification of batch and streaming: this is more interesting, b/c
CDF is closer to Jay Kreps' kappa architecture than Nathan Marz's lambda
architecture, in the sense that CDF streaming is capable of doing
everything CDF batch can do and offers the same fault tolerance guarantees
(including only-once element processing). However, the batch engine is
still used for pure batch pipelines b/c it is faster and/or less resource
intensive than the streaming engine.

CDF streaming is built on top of Millwheel, which doesn't have an exact
open-source analogue at this point, and is capable of doing some things
that I don't think can be done in Spark Streaming, Storm or Samza b/c
Millwheel assumes you have access to a KV store like Bigtable/HBase for
maintaining per-key state and can make a distinction between when an event
happened and when the event arrived for processing by the streaming engine.
In particular, I believe it's possible for CDF to create real-time sessions
where the sessionization criteria involves a pre-defined gap between events
for the same key (i.e., end one session and start another if more than N
seconds goes by and I don't see any new events for a given key.) The
distinction between event-time and arrival-time also has some implications
for handling late-arriving events that require CDF to have some semantics
around windowing that don't appear to exist in the other stream processing
engines. I'm not sure of the best strategy here, so I was thinking of
talking to folks who work on the various streaming engines to get their
take on how to support this functionality.

In terms of what CDF means for Crunch: in the short-term, I don't think it
means anything yet. There are still lots of MapReduce-based data pipelines
in the world that should be written in Crunch instead, esp. as people start
moving to the next generation of execution engines like Spark. Spark isn't
quite to the point where it can reliably handle multi-terabyte data
pipelines the way MapReduce can, but it's clearly getting better quickly,
and the volume and complexity it can handle is going up. I think that
Crunch is the best way to help developers make the transition from MR to
Spark as smoothly as possible.

At some point, we could consider making the current version of Crunch 1.0,
and working on a unified batch/streaming execution engine along the lines
of the CDF API as Crunch 2.0. I believe that the unified batch/streaming
pipeline model is the next really interesting challenge in pipelines, and
the Crunch community would be my favorite place to work on it.

Thanks for putting up with the long email,
Josh

Mime
View raw message