It sounds like your key distribution is being reflected in the size of your
reduce tasks, thus
making some of them take much longer than the rest.
There are three solutions to this:
a) downsample. Particularly for statistical computations, once you have
seen a thousand instances, you have seen them all. This is particularly
effective for selfjoin problems where the reduce skew can be the input skew
squared. Downsampling can be done in a combiner. Remember to retain the
number of lost records.
b) combine. If you are counting or doing some other sort of associative
operation, then combiners will distribute your reduce load more widely.
c) subreduce. This is related to combining but you have more control (and
less bandwidth reduction). Simply add a random number in the range [1..n]
to each common key. This breaks it down by a factor of n. Work on the
pieces and then combine later.
On Sat, Dec 11, 2010 at 3:38 AM, Rob Stewart <robstewart57@googlemail.com>wrote:
>  I know that for a fact my key distribution is quite radically skewed
> (some keys with *many* value, most keys with few).
>
