Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of dsuiter@rdx.com designates
 74.125.82.175 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAAMYKhpknFehwqbBP5HvcroZ42qJ0YdOxhr8_4x6F96fSYzTOA@mail.gmail.com>
References: 
 <CAAMYKhrbnXbEwaFd+on_BPDXjeL_ycPNopABnG7M+8byuubJzQ@mail.gmail.com>
	<869970D71E26D7498BDAC4E1CA92226B86E0F663@MBX021-E3-NJ-2.exch021.domain.local>
	<CAE_UNJWm82db5BMCkEsbK2qPA69g2WZcA_XEgBvk=_iyAiqnRQ@mail.gmail.com>
	<CAAMYKhpknFehwqbBP5HvcroZ42qJ0YdOxhr8_4x6F96fSYzTOA@mail.gmail.com>
Date: Mon, 13 Jan 2014 14:45:06 -0500
Message-ID: 
 <CAE_UNJW05SzHQMM+7phgnRGPmYG2VJgVnUM68a5JM5ZYxun1Tw@mail.gmail.com>
Subject: Re: manipulating key in combine phase
From: Devin Suiter RDX <dsuiter@rdx.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=089e01176ee3080fcc04efdf50af

--089e01176ee3080fcc04efdf50af
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

I believe combine process is after that step, so, no.

What comes out of a mapper is a set of records {k1, v1} {k1, v2} {k1, v(n)}
{k2, v1} {k2, v2} {k2, v(n)} and then reducers aggregate that into arrays
like {k1, {v1, v2, v(n)}}, {k2, {v1, v2, v(n)}} and performs logic on the
value set for each unique key, for example.

What comes out of a combiner is {k1, {v1, v2, v(n)}}, {k2, {v1, v2, v(n)}},
the same {k, v} map that reducer builds, and then the reducer does the
logic on the value set for each unique key.

If you change the key in the combiner, you aren't working with the same
set, and so you've used your combiner as another mapper, essentially. But
your method signature won't be right.

Combiner is designed solely to reduce network traffic from mappers to
reducers, since there are usually more mappers than reducers, it reduces
bottlenecking at switches.

If you want to change the key after you've set the key, I feel like you
should use chainMapper and/or write custom input/output format classes if
you need to.

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Mon, Jan 13, 2014 at 12:39 PM, Amit Sela <amits@infolinks.com> wrote:

> More than a solution, I'd like to know if a combiner is allowed to change
> the key ? will it interfere with the mappers sort/merge ?
>
>
> On Mon, Jan 13, 2014 at 3:06 PM, Devin Suiter RDX <dsuiter@rdx.com> wrote=
:
>
>> Amit,
>>
>> Have you explored chainMapper class?
>>
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>>
>> On Sun, Jan 12, 2014 at 7:28 PM, John Lilley <john.lilley@redpoint.net>w=
rote:
>>
>>>  Isn=92t this is what you=92d normally do in the Mapper?
>>>
>>> My understanding of the combiner is that it is like a =93mapper-side
>>> pre-reducer=94 and operates on blocks of data that have already been so=
rted
>>> by key, so mucking with the keys doesn=92t **seem** like a good idea.
>>>
>>> john
>>>
>>>
>>>
>>> *From:* Amit Sela [mailto:amits@infolinks.com]
>>> *Sent:* Sunday, January 12, 2014 9:26 AM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* manipulating key in combine phase
>>>
>>>
>>>
>>> Hi all,
>>>
>>>
>>>
>>> I was wondering if it is possible to manipulate the key during combine:
>>>
>>>
>>>
>>> Say I have a mapreduce job where the key has many qualifiers.
>>>
>>> I would like to "split" the key into two (or more) keys if it has more
>>> than, say 100 qualifiers.
>>>
>>> In the combiner class I would do something like:
>>>
>>>
>>>
>>> int count =3D 0;
>>>
>>> for (Writable value: values) {
>>>
>>>   if (++count >=3D 100){
>>>
>>>     context.write(newKey, value);
>>>
>>>   } else {
>>>
>>>     context.write(key, value);
>>>
>>>   }
>>>
>>> }
>>>
>>>
>>>
>>> where newKey is something like key+randomUUID
>>>
>>>
>>>
>>> I know that the combiner can be called "zero, once or more..." and I'm
>>> getting strange results (same key written more then once) so I would be
>>> glad to get some deeper insight into how the combiner works.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> Amit.
>>>
>>
>>
>

--089e01176ee3080fcc04efdf50af
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I believe combine process is after that step, so, no.<div>=
<br></div><div>What comes out of a mapper is a set of records {k1, v1} {k1,=
 v2} {k1, v(n)} {k2, v1} {k2, v2} {k2, v(n)} and then reducers aggregate th=
at into arrays like {k1, {v1, v2, v(n)}}, {k2, {v1, v2, v(n)}} and performs=
 logic on the value set for each unique key, for example.</div>
<div><br></div><div>What comes out of a combiner is {k1, {v1, v2, v(n)}}, {=
k2, {v1, v2, v(n)}}, the same {k, v} map that reducer builds, and then the =
reducer does the logic on the value set for each unique key.</div><div>
<br></div><div>If you change the key in the combiner, you aren&#39;t workin=
g with the same set, and so you&#39;ve used your combiner as another mapper=
, essentially. But your method signature won&#39;t be right.</div><div>
<br></div><div>Combiner is designed solely to reduce network traffic from m=
appers to reducers, since there are usually more mappers than reducers, it =
reduces bottlenecking at switches.</div><div><br></div><div>If you want to =
change the key after you&#39;ve set the key, I feel like you should use cha=
inMapper and/or write custom input/output format classes if you need to.</d=
iv>
</div><div class=3D"gmail_extra"><br clear=3D"all"><div><div dir=3D"ltr"><b=
>Devin Suiter</b><div><div>Jr. Data Solutions Software Engineer</div><div><=
div><img src=3D"http://i76.servimg.com/u/f76/12/40/55/53/untitl10.png"></di=
v><div>
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212<br>Google Voice: 412=
-256-8556 |=A0<a href=3D"http://www.rdx.com/" target=3D"_blank">www.rdx.com=
</a></div></div></div></div></div>
<br><br><div class=3D"gmail_quote">On Mon, Jan 13, 2014 at 12:39 PM, Amit S=
ela <span dir=3D"ltr">&lt;<a href=3D"mailto:amits@infolinks.com" target=3D"=
_blank">amits@infolinks.com</a>&gt;</span> wrote:<br><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex">
<div dir=3D"ltr">More than a solution, I&#39;d like to know if a combiner i=
s allowed to change the key ? will it interfere with the mappers sort/merge=
 ?=A0</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On=
 Mon, Jan 13, 2014 at 3:06 PM, Devin Suiter RDX <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:dsuiter@rdx.com" target=3D"_blank">dsuiter@rdx.com</a>&gt;</sp=
an> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Amit,<div><br></div><div>Ha=
ve you explored chainMapper class?</div></div><div class=3D"gmail_extra"><b=
r clear=3D"all">

<div><div dir=3D"ltr"><b>Devin Suiter</b><div><div>Jr. Data Solutions Softw=
are Engineer</div><div>
<div><img src=3D"http://i76.servimg.com/u/f76/12/40/55/53/untitl10.png"></d=
iv><div>100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212<br>Google Vo=
ice: <a href=3D"tel:412-256-8556" value=3D"+14122568556" target=3D"_blank">=
412-256-8556</a> |=A0<a href=3D"http://www.rdx.com/" target=3D"_blank">www.=
rdx.com</a></div>


</div></div></div></div><div><div>
<br><br><div class=3D"gmail_quote">On Sun, Jan 12, 2014 at 7:28 PM, John Li=
lley <span dir=3D"ltr">&lt;<a href=3D"mailto:john.lilley@redpoint.net" targ=
et=3D"_blank">john.lilley@redpoint.net</a>&gt;</span> wrote:<br><blockquote=
 class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli=
d;padding-left:1ex">


<div lang=3D"EN-US" link=3D"blue" vlink=3D"purple">
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">Isn=92t this is what you=
=92d normally do in the Mapper?<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">My understanding of the c=
ombiner is that it is like a =93mapper-side pre-reducer=94 and operates on =
blocks of data that have already been sorted by key, so mucking
 with the keys doesn=92t *<b>seem</b>* like a good idea.<u></u><u></u></spa=
n></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">john<u></u><u></u></span>=
</p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0<u></u></span><=
/p>
<p class=3D"MsoNormal"><b><span style=3D"font-size:10.0pt;font-family:&quot=
;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span style=3D"font-s=
ize:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> Amit Sel=
a [mailto:<a href=3D"mailto:amits@infolinks.com" target=3D"_blank">amits@in=
folinks.com</a>]
<br>
<b>Sent:</b> Sunday, January 12, 2014 9:26 AM<br>
<b>To:</b> <a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user=
@hadoop.apache.org</a><br>
<b>Subject:</b> manipulating key in combine phase<u></u><u></u></span></p>
<p class=3D"MsoNormal"><u></u>=A0<u></u></p>
<div>
<p class=3D"MsoNormal">Hi all,=A0<u></u><u></u></p>
<div>
<p class=3D"MsoNormal"><u></u>=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">I was wondering if it is possible to manipulate the =
key during combine:<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Say I have a mapreduce job where the key has many qu=
alifiers.=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">I would like to &quot;split&quot; the key into two (=
or more) keys if it has more than, say 100 qualifiers.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">In the combiner class I would do something like:<u><=
/u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">int count =3D 0;<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">for (Writable value: values) {<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=A0 if (++count &gt;=3D 100){<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=A0 =A0 context.write(newKey, value);<u></u><u></u><=
/p>
</div>
<div>
<p class=3D"MsoNormal">=A0 } else {<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=A0 =A0 context.write(key, value);<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=A0 }<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">}<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">where newKey is something like key+randomUUID<u></u>=
<u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">I know that the combiner can be called &quot;zero, o=
nce or more...&quot; and I&#39;m getting strange results (same key written =
more then once) so I would be glad to get some deeper insight into how the =
combiner works.<u></u><u></u></p>


</div>
<div>
<p class=3D"MsoNormal"><u></u>=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Thanks,<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Amit.<u></u><u></u></p>
</div>
</div>
</div>
</div>

</blockquote></div><br></div></div></div>
</blockquote></div><br></div>
</blockquote></div><br></div>

--089e01176ee3080fcc04efdf50af--