crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject using Avro for all shuffles + MSCR fusion
Date Mon, 18 Jun 2012 17:13:15 GMT
Hey folks,

I am meeting w/Robert this afternoon to discuss this further, but I
thought it would be good to get these thoughts out on the list. I'd
like to propose that we switch Crunch to always use Avro serialization
for the shuffle, for two primary reasons:

1) It will make MSCR fusion significantly easier to implement. If I
have 2 JobPrototypes that are going over the same input but are
grouping the data in different ways, say as (K1, V1), (K2, V2), etc.,
I can perform MSCR fusion easily in Avro by having them all write out
KEY = Pair(Integer, Union(K1, K2, ...)), VALUE = Union(V1, V2, ...),
where the Integer on the Pair indicates which MR job is being written,
and we can handle the data extraction and routing inside of Crunch.
Doing this same trick with writables would be done by having a
BytesWritable act as the Union in the expression.
2) As per Stefan's emails, I believe it will significantly improve the
performance of all MR jobs. I realized that a major reason that Avro
MapReduces significantly outperform Writable-based ones is that we
always use the Avro-specific comparator for sorting the keys, which
does not require any data de-serialization during the shuffle.

I added support for using Avro with arbitrary Writable types in this commit:

so I think that we can do this in a way that won't impact anyone who
was using a custom Writable already, although we should of course add
a field to GroupingOptions that disables this conversion (as well as
MSCR fusion) if the developer so desires.

I think that all of the core Crunch developers are Avro advocates, but
I wanted to open this up for discussion in case I got that wrong.


Director of Data Science
Twitter: @josh_wills

View raw message