hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dick King (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-363) When combiners exist, postpone mappers' spills of map output to disk until combiners are unsuccessful.
Date Sat, 15 Jul 2006 17:01:14 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-363?page=comments#action_12421313 ] 
Dick King commented on HADOOP-363:

I have to leave soon, but I'll write a quick comment.

By including a combiner at all you're saying that the cost of one extra deserialization and
serialization [to do a little combine internally rather than a massive one in a reducer] is
cheaper than the cost of shuffling an extra datum in the big shuffle.  Note that as the output
buffer fills with stable items the benefits of an early combine decreases but so do the costs.
 Only the costs of the comparisons remains high as the buffer fills.

Having said that, the point of some records' values getting large as many values get folded
in under one key is salient. However, this problem isn't necxessarily mitigated by changing
the buffer residue trigger point.  Still, I suppose I can support making the test relatively
modest simply because it's a conservative thing to do, but this should be a configuration
for those who have different ideas for a specific job.


> When combiners exist, postpone mappers' spills of map output to disk until combiners
are unsuccessful.
> ------------------------------------------------------------------------------------------------------
>                 Key: HADOOP-363
>                 URL: http://issues.apache.org/jira/browse/HADOOP-363
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Dick King
> When a map/reduce job is set up with a combiner, the mapper tasks each build up an in-heap
collection of 100K key/value pairs -- and then apply the combiner to reduce that to whatever
it becomes by applying the combiner to sets with like keys before spilling to disk to send
it to the reducers.
> Typically running the combiner consumes a lot less resources than shipping the data,
especially since the data end up in a reducer where probably the same code will be run anyway.
> I would like to see this changed so that when the combiner shrinks the 100K key/value
pairs to less than, say, 90K, we just keep running the mapper and combiner alternately until
we get enough distinct keys to make this unlikely to be worthwhile [or until we run out of
input, of course].
> This has two costs: the whole internal buffer has to be re-sorted so we can apply the
combiner even though as few as 10K new elements have been added, and in some cases we'll call
the combiner on many singletons.  
> The first of these costs can be avoided by doing a mini-sort in the new pairs section
and doing a merge to develop the combiner sets and the new sorted retained elements section.
> The second of these costs can be avoided by detecting what would otherwise be singleton
combiner calls and not making them, which is a good idea in itself even if we don't decide
to do this reform.
> The two techniques combine well; recycled elements of the buffer need not be combined
if there's no new element with the same key.
> -dk

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message