hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Baldeschwieler <eri...@yahoo-inc.com>
Subject Re: [jira] Commented: (HADOOP-363) When combiners exist, postpone mappers' spills of map output to disk until combiners are unsuccessful.
Date Mon, 17 Jul 2006 18:11:34 GMT
I'd be interested in actual data showing that this maters.

On Jul 15, 2006, at 10:01 AM, Dick King (JIRA) wrote:

>     [ http://issues.apache.org/jira/browse/HADOOP-363? 
> page=comments#action_12421313 ]
> Dick King commented on HADOOP-363:
> ----------------------------------
> I have to leave soon, but I'll write a quick comment.
> By including a combiner at all you're saying that the cost of one  
> extra deserialization and serialization [to do a little combine  
> internally rather than a massive one in a reducer] is cheaper than  
> the cost of shuffling an extra datum in the big shuffle.  Note that  
> as the output buffer fills with stable items the benefits of an  
> early combine decreases but so do the costs.  Only the costs of the  
> comparisons remains high as the buffer fills.
> Having said that, the point of some records' values getting large  
> as many values get folded in under one key is salient. However,  
> this problem isn't necxessarily mitigated by changing the buffer  
> residue trigger point.  Still, I suppose I can support making the  
> test relatively modest simply because it's a conservative thing to  
> do, but this should be a configuration for those who have different  
> ideas for a specific job.
> -dk
>> When combiners exist, postpone mappers' spills of map output to  
>> disk until combiners are unsuccessful.
>> --------------------------------------------------------------------- 
>> ---------------------------------
>>                 Key: HADOOP-363
>>                 URL: http://issues.apache.org/jira/browse/HADOOP-363
>>             Project: Hadoop
>>          Issue Type: Improvement
>>          Components: mapred
>>            Reporter: Dick King
>> When a map/reduce job is set up with a combiner, the mapper tasks  
>> each build up an in-heap collection of 100K key/value pairs -- and  
>> then apply the combiner to reduce that to whatever it becomes by  
>> applying the combiner to sets with like keys before spilling to  
>> disk to send it to the reducers.
>> Typically running the combiner consumes a lot less resources than  
>> shipping the data, especially since the data end up in a reducer  
>> where probably the same code will be run anyway.
>> I would like to see this changed so that when the combiner shrinks  
>> the 100K key/value pairs to less than, say, 90K, we just keep  
>> running the mapper and combiner alternately until we get enough  
>> distinct keys to make this unlikely to be worthwhile [or until we  
>> run out of input, of course].
>> This has two costs: the whole internal buffer has to be re-sorted  
>> so we can apply the combiner even though as few as 10K new  
>> elements have been added, and in some cases we'll call the  
>> combiner on many singletons.
>> The first of these costs can be avoided by doing a mini-sort in  
>> the new pairs section and doing a merge to develop the combiner  
>> sets and the new sorted retained elements section.
>> The second of these costs can be avoided by detecting what would  
>> otherwise be singleton combiner calls and not making them, which  
>> is a good idea in itself even if we don't decide to do this reform.
>> The two techniques combine well; recycled elements of the buffer  
>> need not be combined if there's no new element with the same key.
>> -dk
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators: http://issues.apache.org/jira/secure/ 
> Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/ 
> software/jira

View raw message