Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1D35610403 for ; Mon, 13 Jan 2014 20:46:13 +0000 (UTC) Received: (qmail 61699 invoked by uid 500); 13 Jan 2014 19:45:34 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 61473 invoked by uid 500); 13 Jan 2014 19:45:31 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 61150 invoked by uid 99); 13 Jan 2014 19:45:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Jan 2014 19:45:31 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dsuiter@rdx.com designates 74.125.82.175 as permitted sender) Received: from [74.125.82.175] (HELO mail-we0-f175.google.com) (74.125.82.175) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Jan 2014 19:45:27 +0000 Received: by mail-we0-f175.google.com with SMTP id p61so1842777wes.20 for ; Mon, 13 Jan 2014 11:45:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rdx.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=lw5tjVGQkpwbt04pJw6WDVgawh6jl42zD6YoejQO8To=; b=WLkXpRtJSrGqA6WiXDzxdpWqWqS/+lbJL4zcLcMrdRGQT5+HAvnqJHykcArhm28hrF tADQqIUCUg/I1hz8oUgaSwcSzpbXB2urCj8NTy0JSmOnoLqOLYDqzmU6dpOZ5iqFZgU/ PM8Dkb/r4Z3H4xVAfnKZQPB2l3jcdlkmijydk= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=lw5tjVGQkpwbt04pJw6WDVgawh6jl42zD6YoejQO8To=; b=bUhnNY2Rte/H+W3ism0Qf/bJU8ltiZwZIdKHq29Z0KSbtBifl1d3+pSStXaA5RGmYh ALyhDMG2pw2afnsMr+ru66rHMeK0tDWhIvlOs/pHTVbRaVaGcm3WfogomJvfPKcSXhl1 ma0lMrIouKgVm00gYbHRmjhbopHYb4soNMXw7m+wnmxjbXbDcv0IaGknwJb2XCyhFFU7 zy0TQKnxc5J31HG8xcgPEfPsaJa8Wle95I16plELCl3hqO9DNvWNt6UUgMdd9iq8+Gxf NIFpumYKW3Y7pqTk27DUXGtvxN8bmL0s15C/LhgjMP9mYtwFoCR4ta4X+DnxsWWa38WH X0BA== X-Gm-Message-State: ALoCoQmix8fYBqeVAwBkEof9oyZJMEZWXx4wh6qn6JJMbBiHkps3BZeZnW2AWBWlqzrsR4TwtiND MIME-Version: 1.0 X-Received: by 10.194.142.174 with SMTP id rx14mr23160378wjb.45.1389642306287; Mon, 13 Jan 2014 11:45:06 -0800 (PST) Received: by 10.217.108.73 with HTTP; Mon, 13 Jan 2014 11:45:06 -0800 (PST) In-Reply-To: References: <869970D71E26D7498BDAC4E1CA92226B86E0F663@MBX021-E3-NJ-2.exch021.domain.local> Date: Mon, 13 Jan 2014 14:45:06 -0500 Message-ID: Subject: Re: manipulating key in combine phase From: Devin Suiter RDX To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=089e01176ee3080fcc04efdf50af X-Virus-Checked: Checked by ClamAV on apache.org --089e01176ee3080fcc04efdf50af Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable I believe combine process is after that step, so, no. What comes out of a mapper is a set of records {k1, v1} {k1, v2} {k1, v(n)} {k2, v1} {k2, v2} {k2, v(n)} and then reducers aggregate that into arrays like {k1, {v1, v2, v(n)}}, {k2, {v1, v2, v(n)}} and performs logic on the value set for each unique key, for example. What comes out of a combiner is {k1, {v1, v2, v(n)}}, {k2, {v1, v2, v(n)}}, the same {k, v} map that reducer builds, and then the reducer does the logic on the value set for each unique key. If you change the key in the combiner, you aren't working with the same set, and so you've used your combiner as another mapper, essentially. But your method signature won't be right. Combiner is designed solely to reduce network traffic from mappers to reducers, since there are usually more mappers than reducers, it reduces bottlenecking at switches. If you want to change the key after you've set the key, I feel like you should use chainMapper and/or write custom input/output format classes if you need to. *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com On Mon, Jan 13, 2014 at 12:39 PM, Amit Sela wrote: > More than a solution, I'd like to know if a combiner is allowed to change > the key ? will it interfere with the mappers sort/merge ? > > > On Mon, Jan 13, 2014 at 3:06 PM, Devin Suiter RDX wrote= : > >> Amit, >> >> Have you explored chainMapper class? >> >> *Devin Suiter* >> Jr. Data Solutions Software Engineer >> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 >> Google Voice: 412-256-8556 | www.rdx.com >> >> >> On Sun, Jan 12, 2014 at 7:28 PM, John Lilley w= rote: >> >>> Isn=92t this is what you=92d normally do in the Mapper? >>> >>> My understanding of the combiner is that it is like a =93mapper-side >>> pre-reducer=94 and operates on blocks of data that have already been so= rted >>> by key, so mucking with the keys doesn=92t **seem** like a good idea. >>> >>> john >>> >>> >>> >>> *From:* Amit Sela [mailto:amits@infolinks.com] >>> *Sent:* Sunday, January 12, 2014 9:26 AM >>> *To:* user@hadoop.apache.org >>> *Subject:* manipulating key in combine phase >>> >>> >>> >>> Hi all, >>> >>> >>> >>> I was wondering if it is possible to manipulate the key during combine: >>> >>> >>> >>> Say I have a mapreduce job where the key has many qualifiers. >>> >>> I would like to "split" the key into two (or more) keys if it has more >>> than, say 100 qualifiers. >>> >>> In the combiner class I would do something like: >>> >>> >>> >>> int count =3D 0; >>> >>> for (Writable value: values) { >>> >>> if (++count >=3D 100){ >>> >>> context.write(newKey, value); >>> >>> } else { >>> >>> context.write(key, value); >>> >>> } >>> >>> } >>> >>> >>> >>> where newKey is something like key+randomUUID >>> >>> >>> >>> I know that the combiner can be called "zero, once or more..." and I'm >>> getting strange results (same key written more then once) so I would be >>> glad to get some deeper insight into how the combiner works. >>> >>> >>> >>> Thanks, >>> >>> >>> >>> Amit. >>> >> >> > --089e01176ee3080fcc04efdf50af Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable
I believe combine process is after that step, so, no.
=
What comes out of a mapper is a set of records {k1, v1} {k1,= v2} {k1, v(n)} {k2, v1} {k2, v2} {k2, v(n)} and then reducers aggregate th= at into arrays like {k1, {v1, v2, v(n)}}, {k2, {v1, v2, v(n)}} and performs= logic on the value set for each unique key, for example.

What comes out of a combiner is {k1, {v1, v2, v(n)}}, {= k2, {v1, v2, v(n)}}, the same {k, v} map that reducer builds, and then the = reducer does the logic on the value set for each unique key.

If you change the key in the combiner, you aren't workin= g with the same set, and so you've used your combiner as another mapper= , essentially. But your method signature won't be right.

Combiner is designed solely to reduce network traffic from m= appers to reducers, since there are usually more mappers than reducers, it = reduces bottlenecking at switches.

If you want to = change the key after you've set the key, I feel like you should use cha= inMapper and/or write custom input/output format classes if you need to.

Devin Suiter
Jr. Data Solutions Software Engineer
<= div>
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412= -256-8556 |=A0www.rdx.com=


On Mon, Jan 13, 2014 at 12:39 PM, Amit S= ela <amits@infolinks.com> wrote:
More than a solution, I'd like to know if a combiner i= s allowed to change the key ? will it interfere with the mappers sort/merge= ?=A0


On= Mon, Jan 13, 2014 at 3:06 PM, Devin Suiter RDX <dsuiter@rdx.com> wrote:
Amit,

Ha= ve you explored chainMapper class?
Devin Suiter
Jr. Data Solutions Softw= are Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Vo= ice: = 412-256-8556 |=A0www.= rdx.com


On Sun, Jan 12, 2014 at 7:28 PM, John Li= lley <john.lilley@redpoint.net> wrote:

Isn=92t this is what you= =92d normally do in the Mapper?

My understanding of the c= ombiner is that it is like a =93mapper-side pre-reducer=94 and operates on = blocks of data that have already been sorted by key, so mucking with the keys doesn=92t *seem* like a good idea.

john=

=A0<= /p>

From: Amit Sel= a [mailto:amits@in= folinks.com]
Sent: Sunday, January 12, 2014 9:26 AM
To: user= @hadoop.apache.org
Subject: manipulating key in combine phase

=A0

Hi all,=A0

=A0

I was wondering if it is possible to manipulate the = key during combine:

=A0

Say I have a mapreduce job where the key has many qu= alifiers.=A0

I would like to "split" the key into two (= or more) keys if it has more than, say 100 qualifiers.

In the combiner class I would do something like:<= /u>

=A0

int count =3D 0;

for (Writable value: values) {

=A0 if (++count >=3D 100){

=A0 =A0 context.write(newKey, value);<= /p>

=A0 } else {

=A0 =A0 context.write(key, value);

=A0 }

}

=A0

where newKey is something like key+randomUUID=

=A0

I know that the combiner can be called "zero, o= nce or more..." and I'm getting strange results (same key written = more then once) so I would be glad to get some deeper insight into how the = combiner works.

=A0

Thanks,

=A0

Amit.




--089e01176ee3080fcc04efdf50af--