crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Everett Anderson <>
Subject Crunch performance & cluster configuration
Date Mon, 10 Aug 2015 20:49:34 GMT

We've written a large processing pipeline in Crunch, which has been great
because it's testable and the code is rather clear.

When using the MapReduce runner, we end up with around 350 executed MR
applications for one month of input data. We're doing a lot of joins, so we
expect many applications.

I'm trying to figure out our strategy and cluster configurations for
scaling to more data on AWS EMR.

We've set our bytes per reduce target low enough that we usually have more
Map and Reduce tasks than machines, but not by much, and no given shard or
application seems to be a long pole.

I've noticed that

1) Most individual Map or Reduce jobs are short-lived, commonly 1-2 minutes
with our one month input data set.

2) Adding EMR Task instances (which don't participate in HDFS so must
send/receive everything over the network) does not help us scale -- their
CPU utilization is terrible.

3) Adding Core instances does seem to help reduce runtime, though their CPU
utilization starts going down.

This makes me suspect that our main bottleneck will be in either disk or
network I/O in shuffles.

Does anyone have pointers for evaluating or tweaking performance in a
many-MR application Crunch pipeline like this? Given Crunch makes it so
easy to write these, I suspect others would hit the same issues.

Would switching from MapReduce to Spark likely be a big win? My uninformed
impression is that Spark might require fewer disk operations, though I
don't see how it could avoid more cross-machine shuffles given our joins.


*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

View raw message