hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-5830) Reuse output collectors across maps running on the same jvm
Date Thu, 14 May 2009 07:18:45 GMT
Reuse output collectors across maps running on the same jvm
-----------------------------------------------------------

                 Key: HADOOP-5830
                 URL: https://issues.apache.org/jira/browse/HADOOP-5830
             Project: Hadoop Core
          Issue Type: Improvement
          Components: mapred
            Reporter: Arun C Murthy


We have evidence that cutting the shuffle-crossbar between maps and reduces (m * r) leads
to perfomant applications since:
# It cuts down the number of connections necessary to shuffle and hence reduces load on the
serving-side (TaskTracker) and improves latency (terasort, HADOOP-1338, HADOOP-5223)
# Reduces seeks required for the TaskTracker to serve the map-outputs

So far we've had to manually tune applications to cut down the shuffle- crossbar by having
fatter maps with custom input formats etc. For e.g. we saw a significant improvement while
running the petasort when we went from ~800,000 maps to 80,00 maps (1.5G to 15G per map) i.e.
from 48+ hours to 16 hours,  

The downsides are:
# The burden falls on the application-writer to tune this with custom input-formats etc.
# The naive method of using a higher min.split.size leads to considerable non-local i/o on
the maps.

Given these, the proposal is to keep the 'output collector' open across jvm reuse for maps,
there-by enabling 'combiners' across map-tasks. This would have the happy-effect of fixing
both the above. The downsides are that it will add latency to jobs (since map-outputs cannot
be shuffled till a few maps on the same jvm are done, then followed by a final sort/merge/combine)
and the failure cases get a bit more complicated.

Thoughts? Lets discuss...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message