Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 272D94A10 for ; Fri, 17 Jun 2011 21:14:54 +0000 (UTC) Received: (qmail 7619 invoked by uid 500); 17 Jun 2011 21:14:53 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 7569 invoked by uid 500); 17 Jun 2011 21:14:53 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 7561 invoked by uid 99); 17 Jun 2011 21:14:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Jun 2011 21:14:53 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of geoffry.roberts@gmail.com designates 209.85.218.48 as permitted sender) Received: from [209.85.218.48] (HELO mail-yi0-f48.google.com) (209.85.218.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Jun 2011 21:14:48 +0000 Received: by yic24 with SMTP id 24so2283779yic.35 for ; Fri, 17 Jun 2011 14:14:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=KWSFUPEq1vzx15VKsyPEd/XDOZZxSHy4Pvm2OsP5/HU=; b=Ur9alPXI4BIUkShHB45D4yjW5BKXPGnhkzunwNyjkX63bjUDPc3GC09puUalFBwkT2 XcKxsfsWsaNvqncvkjQgReG6zV6ypa3kR8Mj8n7TfuOaVjQfNbiQnVr1OfHq06j94ll4 nKLFht3GtkLXMnsCYRb3Li2Msu8hjb8Qvz0PQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=dM+cnc+JrMEI1Ig4y4BHM/Dyr/d0DqnkE/MxC56YXiXM+lhRfTsM5qOqFBkDvviGMV T86mQ9CHuurJ/rycxQ/44gFPjGCygwC621sdzTJ67jpnY8x4Hh/MMrKkQbl11mhpJyF9 ndY/z+Px3M9PZjhZb8xL9Y9Uczw6XaFw9198s= MIME-Version: 1.0 Received: by 10.91.113.12 with SMTP id q12mr3111694agm.68.1308345267215; Fri, 17 Jun 2011 14:14:27 -0700 (PDT) Received: by 10.90.87.20 with HTTP; Fri, 17 Jun 2011 14:14:27 -0700 (PDT) In-Reply-To: References: Date: Fri, 17 Jun 2011 14:14:27 -0700 Message-ID: Subject: Re: Mystery, A Tale of Two Reducers From: Geoffry Roberts To: mapreduce-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001485f77542e55d9f04a5eede22 --001485f77542e55d9f04a5eede22 Content-Type: text/plain; charset=ISO-8859-1 This is for the edification of the group. The clone solution worked. Here's how I handled it. Second Reducer (redux) : protected void reduce(Text key, Iterable visitors, Context ctx) throws IOException, InterruptedException { List list = new ArrayList(); for (Text visitor : visitors) { list.add(new Text(visitor)); // Create a new visitor. } for (Text visitor : list) { ctx.write(key, visitor); } } Life is good again. On 17 June 2011 13:38, Harsh J wrote: > Geoffry, > > The problem here is that the Reducer in Hadoop reuses the same > container object to pass on all values and keys. Thus, what you're > really holding in your second reducer's code are "References" to this > object -> Which upon writing will all be a mess of duplicates and what > not cause they are all gonna be referring to the last gotten value > every iteration. > > The solution, when you want to persist a particular key or value > object, is to .clone() it into the list so that the list does store > real, new objects in it and not multiple references of the same > object. > > On Sat, Jun 18, 2011 at 2:00 AM, Geoffry Roberts > wrote: > > All, > > > > I have come across a situation that I don't understand. > > > > First Reducer: > > > > Behold the first of two reducers. A fragment of it's output follows. > > Simple no? It doesn't do anything. I've highlighted two records from > the > > output. Keep them in mind. Now lets look at the second reducer. > > > > protected void reduce(Text key, Iterable visitors, Context ctx) > > throws IOException, InterruptedException { > > for (Text visitor : visitors) { > > ctx.write(key, visitor); > > } > > } > > > > 2005-09-16=33614 42340108 more==> > > 2005-09-16=33614 42340106 more==> > > 2005-09-16=33614 42340113 more==> > > 2005-09-16=44135 42324490 more==> > > 2005-09-16=44135 42339700 more==> > > ... > > 2005-09-16=44135 42324489 more==> > > > > > > Second Reducer: > > > > This is a variation on the reducer from above. A fragment of it's output > > follows. The difference is I add all visitors to a list then I iterate > > through the list to produce my output. Remember the two highlighted > records > > from above? They are now showing up in the output as duplicates and the > > other records appear to be missing. Why? I have never seen an ArrayList > > behave like this. It must have something to do with hadoop. > > > > I have a reasons for using the list. One such reason is that I must have > a > > full count of all visitors before I can do my output, but I spare you. > > > > To my mind, this second reducer should output the same as the first. > > > > protected void reduce(Text key, Iterable visitors, Context ctx) > > throws IOException, InterruptedException { > > List list = new ArrayList(); > > for (Text visitor : visitors) { > > list.add(visitor); > > } > > for (Text visitor : list) { > > ctx.write(key, visitor); > > } > > } > > > > 2005-09-16=33614 42340113 more==> > > 2005-09-16=33614 42340113 more==> > > 2005-09-16=33614 42340113 more==> > > 2005-09-16=44135 42324489 more==> > > 2005-09-16=44135 42324489 more==> > > > > Thanks in advance > > > > -- > > Geoffry Roberts > > > > > > > > -- > Harsh J > -- Geoffry Roberts --001485f77542e55d9f04a5eede22 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable This is for the edification of the group.

The clone solution worked.= =A0 Here's how I handled it.

Second Reducer (redux) :

protected void reduce(Tex= t key, Iterable<Text> visitors, Context ctx)
throws IOException, InterruptedException {
=A0=A0=A0

List<Text> list =3D new ArrayList<Text>();=A0= =A0=A0=A0
for (Text visitor : visitors) {

=A0=A0=A0=A0 =A0=A0=A0 list.add(new Text(visitor));=A0 // Create a new visi= tor.
=A0=A0=A0=A0 }
=A0=A0=A0=A0 for (Text visitor : list) {
=A0=A0=A0=A0 =A0=A0=A0 ctx.write(key, visitor);
=A0=A0=A0=A0 }
=A0}

Life is = good again.

On 17 June 2011 13:38, Harsh = J <harsh@clouder= a.com> wrote:
Geoffry,

The problem here is that the Reducer in Hadoop reuses the same
container object to pass on all values and keys. Thus, what you're
really holding in your second reducer's code are "References"= to this
object -> Which upon writing will all be a mess of duplicates and what not cause they are all gonna be referring to the last gotten value
every iteration.

The solution, when you want to persist a particular key or value
object, is to .clone() it into the list so that the list does store
real, new objects in it and not multiple references of the same
object.

On Sat, Jun 18, 2011 at 2:00 AM, Geoffry Roberts
<geoffry.roberts@gmail.com<= /a>> wrote:
> All,
>
> I have come across a situation that I don't understand.
>
> First Reducer:
>
> Behold the first of two reducers.=A0 A fragment of it's output fol= lows.
> Simple no?=A0 It doesn't do anything.=A0 I've highlighted two = records from the
> output.=A0 Keep them in mind.=A0 Now lets look at the second reducer.<= br> >
> protected void reduce(Text key, Iterable<Text> visitors, Context= ctx)
> =A0throws IOException, InterruptedException {
> =A0=A0=A0 for (Text visitor : visitors) {
> =A0=A0 =A0=A0=A0 ctx.write(key, visitor);
> =A0=A0=A0 }
> =A0}
>
> 2005-09-16=3D33614=A0=A0=A0 42340108=A0=A0=A0 more=3D=3D>
> 2005-09-16=3D33614=A0=A0=A0 42340106=A0=A0=A0 more=3D=3D>
> 2005-09-16=3D33614=A0=A0=A0 42340113=A0=A0=A0 more=3D=3D>
> 2005-09-16=3D44135=A0=A0=A0 42324490=A0=A0=A0 more=3D=3D>
> 2005-09-16=3D44135=A0=A0=A0 42339700=A0=A0=A0 more=3D=3D>
> ...
> 2005-09-16=3D44135=A0=A0=A0 42324489=A0=A0=A0 more=3D=3D>
>
>
> Second Reducer:
>
> This is a variation on the reducer from above.=A0 A fragment of it'= ;s output
> follows.=A0 The difference is I add all visitors to a list then I iter= ate
> through the list to produce my output.=A0 Remember the two highlighted= records
> from above? They are now showing up in the output as duplicates and th= e
> other records appear to be missing.=A0 Why?=A0 I have never seen an Ar= rayList
> behave like this.=A0 It must have something to do with hadoop.
>
> I have a reasons for using the list.=A0 One such reason is that I must= have a
> full count of all visitors before I can do my output, but I spare you.=
>
> To my mind, this second reducer should output the same as the first. >
> protected void reduce(Text key, Iterable<Text> visitors, Context= ctx)
> throws IOException, InterruptedException {
> =A0=A0=A0 List<Text> list =3D new ArrayList<Text>();
> =A0=A0=A0 for (Text visitor : visitors) {
> =A0=A0=A0 =A0=A0=A0 list.add(visitor);
> =A0=A0=A0 }
> =A0=A0=A0 for (Text visitor : list) {
> =A0=A0=A0 =A0=A0=A0 ctx.write(key, visitor);
> =A0=A0=A0 }
> }
>
> 2005-09-16=3D33614=A0=A0=A0 42340113=A0=A0=A0 more=3D=3D>
> 2005-09-16=3D33614=A0=A0=A0 42340113=A0=A0=A0 more=3D=3D>
> 2005-09-16=3D33614=A0=A0=A0 42340113=A0=A0=A0 more=3D=3D>
> 2005-09-16=3D44135=A0=A0=A0 42324489=A0=A0=A0 more=3D=3D>
> 2005-09-16=3D44135=A0=A0=A0 42324489=A0=A0=A0 more=3D=3D>
>
> Thanks in advance
>
> --
> Geoffry Roberts
>
>



--
Harsh J



--
Geoffry Roberts<= br>
--001485f77542e55d9f04a5eede22--