hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew John <tmatthewjohn1...@gmail.com>
Subject Re: how to get all different values for each key
Date Wed, 03 Aug 2011 10:26:48 GMT
Hey,

I feel HashSet is a good method to dedup. To increase the overall efficiency
you could also look into Combiner running the same Reducer code. That would
ensure less data in the sort-shuffle phase.

Regards,
Matthew

On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang <wangjx798@gmail.com> wrote:

> hi,harsh
>     After map, I can get all values for one key, but I want dedup these
> values, only get all unique values. now I just do it like the image.
>
>     I think the following code is not efficient.(using a HashSet to dedup)
> Thanks:)
>
> private static class MyReducer extends
> Reducer<LongWritable,LongWritable,LongWritable,LongsWritable>
> {
> HashSet<Long> uids=new HashSet<Long>();
>  LongsWritable unique_uids=new LongsWritable();
> public void reduce(LongWritable key,Iterable<LongWritable> values,Context
> context)throws IOException,InterruptedException
>  {
> uids.clear();
> for(LongWritable v:values)
>  {
> uids.add(v.get());
> }
>  int size=uids.size();
> long[] l=new long[size];
> int i=0;
>  for(long uid:uids)
> {
> l[i]=uid;
>  i++;
> }
> unique_uids.Set(l);
>  context.write(key,unique_uids);
> }
> }
>
>
> 2011/8/3 Harsh J <harsh@cloudera.com>
>
>> Use MapReduce :)
>>
>> If map output: (key, value)
>> Then reduce input becomes: (key, [iterator of values across all maps
>> with (key, value)])
>>
>> I believe this is very similar to the wordcount example, but minus the
>> summing. For a given key, you get all the values that carry that key
>> in the reducer. Have you tried to run a simple program to achieve this
>> before asking? Or is something specifically not working?
>>
>> On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang <wangjx798@gmail.com> wrote:
>> > HI,
>> >    I hava many <key,value> pairs now, and want to get all different
>> values
>> > for each key, which way is efficient for this work.
>> >
>> >   such as input : <1,2> <1,3> <1,4> <1,3> <2,1>
<2,2>
>> >   output: <1,2/3/4> <2,1/2>
>> >
>> >   Thanks!
>> >
>> > walter
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message