hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amar Kamat (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-910) Reduces can do merges for the on-disk map output files in parallel with their copying
Date Fri, 15 Feb 2008 06:15:08 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12569185#action_12569185

Amar Kamat commented on HADOOP-910:

I think the performance degradation with {{mapred.tasktracker.map/reduce.tasks.maximum = 4}}
is transient. I ran a job with same settings mentioned earlier but with {{mapred.tasktracker.map/reduce.tasks.maximum
= 4}} and the results are as follows
||#||trunk + patch||trunk||
|1|1hr 2 min|1 hr 4 min|
|2| 1hr 2 min|1 hr 5 min|
Although the dataset was different the timings are pretty good.
Just to be sure that the patch doesnt degrade the performance I ran a job with the following
config :
Number of nodes : 200
Java heap size : 1024mb
io.sort.factor : 100
Following are the results 
With {{mapred.tasktracker.map/reduce.tasks.maximum=2}}
||#||trunk + patch||trunk||
|1|1hr 6 min|1 hr 7 min|
|2| 1hr 11 min|1 hr 8 min|
|3| 1hr 7 min| 1 hr 9 min|
With {{mapred.tasktracker.map/reduce.tasks.maximum=4}}
||#||trunk + patch||trunk||
|1|1hr 14 min|1 hr 16 min|
|2| 1hr 16 min|1 hr 16 min|
|3| 1hr 18 min| 1 hr 17 min|
This is expected because on-disk merge rarely occurs with {{io.sort.factor = 100}}  i.e the
performance is not degraded with no (or very few) on-disk merges .
Hardware configuration  :
Processor : 4x HT 1.8GHz Intel Xeon
Ram : 8GB 
Disk : 4 disks each of 250GB 

> Reduces can do merges for the on-disk map output files in parallel with their copying
> -------------------------------------------------------------------------------------
>                 Key: HADOOP-910
>                 URL: https://issues.apache.org/jira/browse/HADOOP-910
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>         Attachments: HADOOP-910-review.patch
> Proposal to extend the parallel in-memory-merge/copying, that is being done as part of
HADOOP-830, to the on-disk files.
> Today, the Reduces dump the map output files to disk and the final merge happens only
after all the map outputs have been collected. It might make sense to parallelize this part.
That is, whenever a Reduce has collected io.sort.factor number of segments on disk, it initiates
a merge of those and creates one big segment. If the rate of copying is faster than the merge,
we can probably have multiple threads doing parallel merges of independent sets of io.sort.factor
number of segments. If the rate of copying is not as fast as merge, we stand to gain a lot
- at the end of copying of all the map outputs, we will be left with a small number of segments
for the final merge (which hopefully will feed the reduce directly (via the RawKeyValueIterator)
without having to hit the disk for writing additional output segments).
> If the disk bandwidth is higher than the network bandwidth, we have a good story, I guess,
to do such a thing.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message