crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: Crunch performance & cluster configuration
Date Tue, 11 Aug 2015 02:06:27 GMT
Hey Everett,

My two cents (keeping in mind that I write very few incredibly complex
pipelines) is that the best way to improve MR performance is to generate as
many answers as you can per shuffle operation-- i.e., from a performance
perspective, you're better off cogrouping N data sources that need to be
joined on the same key and processing all of their child joins as part of a
single MR job than you are doing pairwise joins between the individual
sources and running (N choose 2) MR jobs independently of one another.
Whenever I'm confronted w/a bunch of outputs I need to generate, all of my
design/data modeling thought goes into creating them in as few jobs as

With that aside, it's probably the case that if spending all of your time
serializing/deserializing data or doing IO, a bit of profiling and managing
things like calls to o.a.h.conf.Configuration.get(), the type of
intermediate data serialization you're doing, and making sure that you're
using RawComparator implementation everywhere you can. One of the reasons
we try to push Avro everywhere we can in Crunch is the great out-of-the-box
support for doing shuffles w/o having to deserialize keys. Twitter gave a
great preso recently about their experiences on top of Scalding/Cascading
that should also be helpful for Crunch:


On Mon, Aug 10, 2015 at 1:49 PM, Everett Anderson <> wrote:

> Hi,
> We've written a large processing pipeline in Crunch, which has been great
> because it's testable and the code is rather clear.
> When using the MapReduce runner, we end up with around 350 executed MR
> applications for one month of input data. We're doing a lot of joins, so we
> expect many applications.
> I'm trying to figure out our strategy and cluster configurations for
> scaling to more data on AWS EMR.
> We've set our bytes per reduce target low enough that we usually have more
> Map and Reduce tasks than machines, but not by much, and no given shard or
> application seems to be a long pole.
> I've noticed that
> 1) Most individual Map or Reduce jobs are short-lived, commonly 1-2
> minutes with our one month input data set.
> 2) Adding EMR Task instances (which don't participate in HDFS so must
> send/receive everything over the network) does not help us scale -- their
> CPU utilization is terrible.
> 3) Adding Core instances does seem to help reduce runtime, though their
> CPU utilization starts going down.
> This makes me suspect that our main bottleneck will be in either disk or
> network I/O in shuffles.
> Does anyone have pointers for evaluating or tweaking performance in a
> many-MR application Crunch pipeline like this? Given Crunch makes it so
> easy to write these, I suspect others would hit the same issues.
> Would switching from MapReduce to Spark likely be a big win? My uninformed
> impression is that Spark might require fewer disk operations, though I
> don't see how it could avoid more cross-machine shuffles given our joins.
> Thanks,
> Everett
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message