hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Geoffry Roberts <geoffry.robe...@gmail.com>
Subject Re: Mystery, A Tale of Two Reducers
Date Fri, 17 Jun 2011 21:14:27 GMT
This is for the edification of the group.

The clone solution worked.  Here's how I handled it.

Second Reducer (redux) :

protected void reduce(Text key, Iterable<Text> visitors, Context ctx)
throws IOException, InterruptedException {

List<Text> list = new ArrayList<Text>();
for (Text visitor : visitors) {
         list.add(new Text(visitor));  // Create a new visitor.
     }
     for (Text visitor : list) {
         ctx.write(key, visitor);
     }
 }

Life is good again.

On 17 June 2011 13:38, Harsh J <harsh@cloudera.com> wrote:

> Geoffry,
>
> The problem here is that the Reducer in Hadoop reuses the same
> container object to pass on all values and keys. Thus, what you're
> really holding in your second reducer's code are "References" to this
> object -> Which upon writing will all be a mess of duplicates and what
> not cause they are all gonna be referring to the last gotten value
> every iteration.
>
> The solution, when you want to persist a particular key or value
> object, is to .clone() it into the list so that the list does store
> real, new objects in it and not multiple references of the same
> object.
>
> On Sat, Jun 18, 2011 at 2:00 AM, Geoffry Roberts
> <geoffry.roberts@gmail.com> wrote:
> > All,
> >
> > I have come across a situation that I don't understand.
> >
> > First Reducer:
> >
> > Behold the first of two reducers.  A fragment of it's output follows.
> > Simple no?  It doesn't do anything.  I've highlighted two records from
> the
> > output.  Keep them in mind.  Now lets look at the second reducer.
> >
> > protected void reduce(Text key, Iterable<Text> visitors, Context ctx)
> >  throws IOException, InterruptedException {
> >     for (Text visitor : visitors) {
> >        ctx.write(key, visitor);
> >     }
> >  }
> >
> > 2005-09-16=33614    42340108    more==>
> > 2005-09-16=33614    42340106    more==>
> > 2005-09-16=33614    42340113    more==>
> > 2005-09-16=44135    42324490    more==>
> > 2005-09-16=44135    42339700    more==>
> > ...
> > 2005-09-16=44135    42324489    more==>
> >
> >
> > Second Reducer:
> >
> > This is a variation on the reducer from above.  A fragment of it's output
> > follows.  The difference is I add all visitors to a list then I iterate
> > through the list to produce my output.  Remember the two highlighted
> records
> > from above? They are now showing up in the output as duplicates and the
> > other records appear to be missing.  Why?  I have never seen an ArrayList
> > behave like this.  It must have something to do with hadoop.
> >
> > I have a reasons for using the list.  One such reason is that I must have
> a
> > full count of all visitors before I can do my output, but I spare you.
> >
> > To my mind, this second reducer should output the same as the first.
> >
> > protected void reduce(Text key, Iterable<Text> visitors, Context ctx)
> > throws IOException, InterruptedException {
> >     List<Text> list = new ArrayList<Text>();
> >     for (Text visitor : visitors) {
> >         list.add(visitor);
> >     }
> >     for (Text visitor : list) {
> >         ctx.write(key, visitor);
> >     }
> > }
> >
> > 2005-09-16=33614    42340113    more==>
> > 2005-09-16=33614    42340113    more==>
> > 2005-09-16=33614    42340113    more==>
> > 2005-09-16=44135    42324489    more==>
> > 2005-09-16=44135    42324489    more==>
> >
> > Thanks in advance
> >
> > --
> > Geoffry Roberts
> >
> >
>
>
>
> --
> Harsh J
>



-- 
Geoffry Roberts

Mime
View raw message