hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jianxin Wang <wangjx...@gmail.com>
Subject Re: how to get all different values for each key
Date Wed, 03 Aug 2011 10:37:08 GMT
thanks! Matthew :
*
*
*    how about using SecondarySory to get <key,values>, the values are
sorted for every key.*
*then traverse the sorted values to get all unique values.*
*    *
*   I am not sure which way is more efficient. I doubt HashSet is a
complicated data structure.
*
2011/8/3 Matthew John <tmatthewjohn1988@gmail.com>

> Hey,
>
> I feel HashSet is a good method to dedup. To increase the overall
> efficiency
> you could also look into Combiner running the same Reducer code. That would
> ensure less data in the sort-shuffle phase.
>
> Regards,
> Matthew
>
> On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang <wangjx798@gmail.com> wrote:
>
> > hi,harsh
> >     After map, I can get all values for one key, but I want dedup these
> > values, only get all unique values. now I just do it like the image.
> >
> >     I think the following code is not efficient.(using a HashSet to
> dedup)
> > Thanks:)
> >
> > private static class MyReducer extends
> > Reducer<LongWritable,LongWritable,LongWritable,LongsWritable>
> > {
> > HashSet<Long> uids=new HashSet<Long>();
> >  LongsWritable unique_uids=new LongsWritable();
> > public void reduce(LongWritable key,Iterable<LongWritable> values,Context
> > context)throws IOException,InterruptedException
> >  {
> > uids.clear();
> > for(LongWritable v:values)
> >  {
> > uids.add(v.get());
> > }
> >  int size=uids.size();
> > long[] l=new long[size];
> > int i=0;
> >  for(long uid:uids)
> > {
> > l[i]=uid;
> >  i++;
> > }
> > unique_uids.Set(l);
> >  context.write(key,unique_uids);
> > }
> > }
> >
> >
> > 2011/8/3 Harsh J <harsh@cloudera.com>
> >
> >> Use MapReduce :)
> >>
> >> If map output: (key, value)
> >> Then reduce input becomes: (key, [iterator of values across all maps
> >> with (key, value)])
> >>
> >> I believe this is very similar to the wordcount example, but minus the
> >> summing. For a given key, you get all the values that carry that key
> >> in the reducer. Have you tried to run a simple program to achieve this
> >> before asking? Or is something specifically not working?
> >>
> >> On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang <wangjx798@gmail.com>
> wrote:
> >> > HI,
> >> >    I hava many <key,value> pairs now, and want to get all different
> >> values
> >> > for each key, which way is efficient for this work.
> >> >
> >> >   such as input : <1,2> <1,3> <1,4> <1,3> <2,1>
<2,2>
> >> >   output: <1,2/3/4> <2,1/2>
> >> >
> >> >   Thanks!
> >> >
> >> > walter
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message