hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: how to get all different values for each key
Date Wed, 03 Aug 2011 12:53:46 GMT
Secondary sort is the way to go. Easier to dedup a sorted input set.
Although you can also try to filter in map and combine phases to a
safe extent possible (sets, etc.), to speed up the process and reduce
data transfers.

On Wed, Aug 3, 2011 at 4:07 PM, Jianxin Wang <wangjx798@gmail.com> wrote:
> thanks! Matthew :
> *
> *
> *    how about using SecondarySory to get <key,values>, the values are
> sorted for every key.*
> *then traverse the sorted values to get all unique values.*
> *    *
> *   I am not sure which way is more efficient. I doubt HashSet is a
> complicated data structure.
> *
> 2011/8/3 Matthew John <tmatthewjohn1988@gmail.com>
>
>> Hey,
>>
>> I feel HashSet is a good method to dedup. To increase the overall
>> efficiency
>> you could also look into Combiner running the same Reducer code. That would
>> ensure less data in the sort-shuffle phase.
>>
>> Regards,
>> Matthew
>>
>> On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang <wangjx798@gmail.com> wrote:
>>
>> > hi,harsh
>> >     After map, I can get all values for one key, but I want dedup these
>> > values, only get all unique values. now I just do it like the image.
>> >
>> >     I think the following code is not efficient.(using a HashSet to
>> dedup)
>> > Thanks:)
>> >
>> > private static class MyReducer extends
>> > Reducer<LongWritable,LongWritable,LongWritable,LongsWritable>
>> > {
>> > HashSet<Long> uids=new HashSet<Long>();
>> >  LongsWritable unique_uids=new LongsWritable();
>> > public void reduce(LongWritable key,Iterable<LongWritable> values,Context
>> > context)throws IOException,InterruptedException
>> >  {
>> > uids.clear();
>> > for(LongWritable v:values)
>> >  {
>> > uids.add(v.get());
>> > }
>> >  int size=uids.size();
>> > long[] l=new long[size];
>> > int i=0;
>> >  for(long uid:uids)
>> > {
>> > l[i]=uid;
>> >  i++;
>> > }
>> > unique_uids.Set(l);
>> >  context.write(key,unique_uids);
>> > }
>> > }
>> >
>> >
>> > 2011/8/3 Harsh J <harsh@cloudera.com>
>> >
>> >> Use MapReduce :)
>> >>
>> >> If map output: (key, value)
>> >> Then reduce input becomes: (key, [iterator of values across all maps
>> >> with (key, value)])
>> >>
>> >> I believe this is very similar to the wordcount example, but minus the
>> >> summing. For a given key, you get all the values that carry that key
>> >> in the reducer. Have you tried to run a simple program to achieve this
>> >> before asking? Or is something specifically not working?
>> >>
>> >> On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang <wangjx798@gmail.com>
>> wrote:
>> >> > HI,
>> >> >    I hava many <key,value> pairs now, and want to get all different
>> >> values
>> >> > for each key, which way is efficient for this work.
>> >> >
>> >> >   such as input : <1,2> <1,3> <1,4> <1,3>
<2,1> <2,2>
>> >> >   output: <1,2/3/4> <2,1/2>
>> >> >
>> >> >   Thanks!
>> >> >
>> >> > walter
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >>
>> >
>> >
>>
>



-- 
Harsh J

Mime
View raw message