hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <...@yahoo-inc.com>
Subject Re: Sorting values of a key in reduce phase
Date Wed, 08 Aug 2007 22:23:02 GMT

On Aug 6, 2007, at 10:12 PM, novice user wrote:

>    In reduce phase, with outputValueGroupingComparator, we can sort  
> all keys
> and then group values of a particular key together and send it to  
> reduce()
> method. Is there a way to sort values of a particular key  
> efficiently before
> it reaches to reduce method?

There are two comparators that are used for sorting for precisely  
this purpose. In particular:

JobConf.getOutputKeyComparator()
JobConf.getOutputValueGroupingComparator()

The first controls the sort and the second is used to control which  
keys are a single call to reduce.

Therefore, if your data has primary key K1 and secondary K2:

class MyKey implements WritableComparable {
   K1 primary;
   K2 secondary;
   ...
}

you make the map output key MyKey and the OutputKeyComparator uses  
both primary and secondary to pick the order. The  
OuputValueGroupingComparator would just compare the primary keys for  
equality. So if your data looked like:

K1(1), K2(1), V1
K1(1), K2(2), V2
K1(2), K2(1), V3
K1(2), K2(2), V4

the records would be sorted as above, but the reduce would see two  
calls once with K1(1) with values V1 and V2 and once with K1(2) with  
values V3 and V4.

-- Owen

PS. The OutputValueGroupingComparator is a bad name. It should be  
OutputKeyGroupingComparator or something.

Mime
View raw message