hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "W.P. McNeill" <bill...@gmail.com>
Subject Re: Reducer granularity and starvation
Date Thu, 19 May 2011 01:04:35 GMT
Here's a consequence that I see of having the values be much larger than the
keys: there's not much point in me adding a combiner.

My mapper emits pairs of the form:

<Key, Value>

where the size of value is much greater than the size of Key.  The reducer
then processes input of the form:

<Key, Iterator<Value>>

The reducer then looks at the set of values corresponding to a Key and
separates it into one of two bins.  I don't think this is particularly
CPU-intensive, however, the reducer needs access to the entire set of
Values.  The set can't be boiled down into some smaller sufficient statistic
the way, say, in a word count program we can combine the counts for a word
from different documents into a single number.  As a result, the only
combiner strategy I can see is to have the mapper emit a Value as a single
item list:

<Key, [Value]>

Have a combiner combine the lists:

<Key, [Value, Value...]

and then the reducer would work on lists of lists.

<Key, Iterator<[Value, Value...]>>

This would save on redundant Key IO, but since Values are so much bigger
than Keys I don't think this would matter.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message