hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Qiong Zhang" <jam...@yahoo-inc.com>
Subject RE: sort by value
Date Fri, 08 Feb 2008 00:38:49 GMT
Thank you all for the reply. 

Looks like the class KeyFieldBasedPartitioner in
org.apache.hadoop.mapred.lib can be used in Hadoop streaming to sort
both key (like primary key) and value (like secondary key) without data
duplication.

It is useful if we have same functionality in the native Java API.

James
-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Wednesday, February 06, 2008 1:53 PM
To: core-user@hadoop.apache.org
Subject: Re: sort by value




On 2/6/08 11:58 AM, "Joydeep Sen Sarma" <jssarma@facebook.com> wrote:

> 
>> But it actually adds duplicate data (i.e., the value column which
> needs 
>> sorting) to the key.
> 
> Why? U can always take it out of the value to remove the redundancy.
> 

Actually, you can't in most cases.

Suppose you have input data like this:

   a, b_1
   a, b_2
   a, b_1

And then the mapper produces data like this for each input record:

   a, b, 1
   a, *, 1
   a, b_2, 1
   a, *, 1
   a, b_1, 1
   a, *, 1

If you use the first two fields as the key so that you can sort the
records
nicely, you get the following inputs to the reducer

   <a, *>, [3, 2, 1]

You now don't know what the counts go to except for the first one.  If
you
replicate the second field in the value output of the map, then you get
this

   <a, *>, [[*,3], [b_1, 2], [b_2, 1]]

And you can produce the desired output:

   a, b_1, 2/3
   a, b_2, 1/3


Mime
View raw message