hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi" <runp...@yahoo-inc.com>
Subject RE: [jira] Created: (HADOOP-363) When combiners exist, postpone mappers' spills of map output to disk until combiners are unsuccessful.
Date Fri, 14 Jul 2006 17:57:37 GMT

That is a good idea. There may be an even simpler improvement.
Basically, in the flush method of CombiningCollector class, you can
selectively decide to run reduce for some keys and not to run for other
keys. A selection criteria can be based on the sizes of the value arrays. 
The goal is to reduce the number of values in the hash table to certain
fraction (say 30%) of the 100K limit by calling reduce function for as fewer
keys as possible. One way to achieve this is to scan the table first to get
a histogram list of the value array sizes. Then determine the maximum number
(a size threshold) such that the total number of values of those keys whose
value array sizes are at or lower than the threshold is lower than a given
number. Then we apply reduce to all the keys whose value array sizes are
greater than the threshold and keep the rest in the table. 

Runping


> -----Original Message-----
> From: Dick King (JIRA) [mailto:jira@apache.org]
> Sent: Friday, July 14, 2006 10:15 AM
> To: hadoop-dev@lucene.apache.org
> Subject: [jira] Created: (HADOOP-363) When combiners exist, postpone
> mappers' spills of map output to disk until combiners are unsuccessful.
> 
> When combiners exist, postpone mappers' spills of map output to disk until
> combiners are unsuccessful.
> --------------------------------------------------------------------------
> ----------------------------
> 
>                  Key: HADOOP-363
>                  URL: http://issues.apache.org/jira/browse/HADOOP-363
>              Project: Hadoop
>           Issue Type: Improvement
>           Components: mapred
>             Reporter: Dick King
> 
> 
> When a map/reduce job is set up with a combiner, the mapper tasks each
> build up an in-heap collection of 100K key/value pairs -- and then apply
> the combiner to reduce that to whatever it becomes by applying the
> combiner to sets with like keys before spilling to disk to send it to the
> reducers.
> 
> Typically running the combiner consumes a lot less resources than shipping
> the data, especially since the data end up in a reducer where probably the
> same code will be run anyway.
> 
> I would like to see this changed so that when the combiner shrinks the
> 100K key/value pairs to less than, say, 90K, we just keep running the
> mapper and combiner alternately until we get enough distinct keys to make
> this unlikely to be worthwhile [or until we run out of input, of course].
> 
> This has two costs: the whole internal buffer has to be re-sorted so we
> can apply the combiner even though as few as 10K new elements have been
> added, and in some cases we'll call the combiner on many singletons.
> 
> The first of these costs can be avoided by doing a mini-sort in the new
> pairs section and doing a merge to develop the combiner sets and the new
> sorted retained elements section.
> 
> The second of these costs can be avoided by detecting what would otherwise
> be singleton combiner calls and not making them, which is a good idea in
> itself even if we don't decide to do this reform.
> 
> The two techniques combine well; recycled elements of the buffer need not
> be combined if there's no new element with the same key.
> 
> -dk
> 
> 
> 
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/software/jira
> 
> 



Mime
View raw message