Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of kasha@cloudera.com designates
 209.85.212.182 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <1341890140.62753.YahooMailNeo@web112120.mail.gq1.yahoo.com>
References: <1341711420.55453.YahooMailNeo@web112109.mail.gq1.yahoo.com>
	<BFE9C55E-2066-40F5-98F4-A0E9480831B6@hortonworks.com>
	<1341863711.81310.YahooMailNeo@web112104.mail.gq1.yahoo.com>
	<FF5B218A-0D95-4498-A148-8978159F48F6@hortonworks.com>
	<1341890140.62753.YahooMailNeo@web112120.mail.gq1.yahoo.com>
Date: Mon, 9 Jul 2012 20:33:52 -0700
Message-ID: 
 <CALwhT97vtdyTnmBU2Br0AcEB24jt6Z9N6o7Utwbd48YtYn8zPQ@mail.gmail.com>
Subject: Re: Basic question on how reducer works
From: Karthik Kambatla <kasha@cloudera.com>
To: mapreduce-user@hadoop.apache.org, Grandl Robert <rgrandl@yahoo.com>
Content-Type: multipart/alternative; boundary=f46d043bd75841d95004c47166e8

--f46d043bd75841d95004c47166e8
Content-Type: text/plain; charset=ISO-8859-1

The partitioner is configurable. The default partitioner, from what I
remember, computes the partition as the hashcode modulo number of
reducers/partitions. For random input, it is balanced, but some cases can
have very skewed key distribution. Also, as you have pointed out, the
number of values per key can also vary. Together, both of them determine
"weight" of each partition as you call it.

Karthik

On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <rgrandl@yahoo.com> wrote:

> Thanks Arun.
>
> So just for my clarification. The map will create partitions according to
> the number of reducers s.t. each reducer to get almost same number of keys
> in its partition. However, each key can have different number of values so
> the "weight" of each partition will depend on that. Also when a new <key,
> value> is added into a partition a hash on the partition ID will be
> computed to find the corresponding partition ?
>
> Robert
>
>   ------------------------------
> *From:* Arun C Murthy <acm@hortonworks.com>
> *To:* mapreduce-user@hadoop.apache.org
> *Sent:* Monday, July 9, 2012 4:33 PM
>
> *Subject:* Re: Basic question on how reducer works
>
>
> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote:
>
> Thanks a lot guys for answers.
>
> Still I am not able to find exactly the code for the following things:
>
> 1. reducer to read from a Map output only its partition. I looked into
> ReduceTask#getMapOutput which do the actual read in
> ReduceTask#shuffleInMemory, but I don't see where it specify which
> partition to read(reduceID).
>
>
> Look at TaskTracker.MapOutputServlet.
>
> 2. still don't understand very well in which part of the
> code(MapTask.java) the intermediate data is written do which partition. So
> MapOutputBuffer is the one who actually writes the data to buffer and spill
> after buffer is full. Could you please elaborate a bit on how the data is
> written to which partition ?
>
>
> Essentially you can think of the partition-id as the 'primary key' and the
> actual 'key' in the map-output of <key, value> as the 'secondary key'.
>
> hth,
> Arun
>
> Thanks,
> Robert
>
>   ------------------------------
> *From:* Arun C Murthy <acm@hortonworks.com>
> *To:* mapreduce-user@hadoop.apache.org
> *Sent:* Monday, July 9, 2012 9:24 AM
> *Subject:* Re: Basic question on how reducer works
>
> Robert,
>
> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
>
> Hi,
>
> I have some questions related to basic functionality in Hadoop.
>
> 1. When a Mapper process the intermediate output data, how it knows how
> many partitions to do(how many reducers will be) and how much data to go in
> each  partition for each reducer ?
>
> 2. A JobTracker when assigns a task to a reducer, it will also specify the
> locations of intermediate output data where it should retrieve it right ?
> But how a reducer will know from each remote location with intermediate
> output what portion it has to retrieve only ?
>
>
> To add to Harsh's comment. Essentially the TT *knows* where the output of
> a given map-id/reduce-id pair is present via an output-file/index-file
> combination.
>
> Arun
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>
>
>
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>
>
>
>

--f46d043bd75841d95004c47166e8
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

The partitioner is configurable. The default partitioner, from what I remem=
ber, computes the partition as the hashcode modulo number of reducers/parti=
tions. For random input, it is balanced, but some cases can have very skewe=
d key distribution. Also, as you have pointed out, the number of values per=
 key can also vary. Together, both of them determine &quot;weight&quot; of =
each partition as you call it.<div>
<br></div><div>Karthik<br><br><div class=3D"gmail_quote">On Mon, Jul 9, 201=
2 at 8:15 PM, Grandl Robert <span dir=3D"ltr">&lt;<a href=3D"mailto:rgrandl=
@yahoo.com" target=3D"_blank">rgrandl@yahoo.com</a>&gt;</span> wrote:<br><b=
lockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px =
#ccc solid;padding-left:1ex">
<div><div style=3D"font-size:12pt;font-family:times new roman,new york,time=
s,serif"><div><span>Thanks Arun.</span></div><div><br><span></span></div><d=
iv><span>So just for my clarification. The map will create partitions accor=
ding to the number of reducers s.t. each reducer to get almost same number =
of keys in its partition. However, each key can have different number of va=
lues so the &quot;weight&quot; of each partition will depend on that. Also =
when a new &lt;key, value&gt; is added into a partition a hash on the parti=
tion ID will be computed to find the corresponding partition ?</span></div>
<div><br><span></span></div><div><span>Robert<br></span></div><div><br></di=
v>  <div style=3D"font-family:times new roman,new york,times,serif;font-siz=
e:12pt"> <div style=3D"font-family:times new roman,new york,times,serif;fon=
t-size:12pt">
 <div dir=3D"ltr"> <font face=3D"Arial"><div class=3D"im"> <hr size=3D"1"> =
 <b><span style=3D"font-weight:bold">From:</span></b> Arun C Murthy &lt;<a =
href=3D"mailto:acm@hortonworks.com" target=3D"_blank">acm@hortonworks.com</=
a>&gt;<br>
 <b><span style=3D"font-weight:bold">To:</span></b> <a href=3D"mailto:mapre=
duce-user@hadoop.apache.org" target=3D"_blank">mapreduce-user@hadoop.apache=
.org</a> <br> </div><b><span style=3D"font-weight:bold">Sent:</span></b> Mo=
nday, July 9, 2012 4:33 PM<div>
<div class=3D"h5"><br> <b><span style=3D"font-weight:bold">Subject:</span><=
/b> Re: Basic question on how reducer works<br> </div></div></font> </div><=
div><div class=3D"h5"> <br>
<div><div><br><div><div>On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote:</=
div><br><blockquote type=3D"cite"><div><div style=3D"font-size:12pt;font-fa=
mily:times,serif"><div><span>Thanks a lot guys for answers. <br></span></di=
v>
<div><br><span></span></div><div><span>Still I am not able to find exactly =
the code for the following things:</span></div><div><br><span></span></div>=
<div><span>1. reducer to read from a Map output only its partition. I looke=
d into ReduceTask#getMapOutput which do the actual read in ReduceTask#shuff=
leInMemory, but I don&#39;t see where it specify which partition to read(re=
duceID).</span></div>
<div><br></div></div></div></blockquote><div><br></div>Look at TaskTracker.=
MapOutputServlet.</div><div><br><blockquote type=3D"cite"><div><div style=
=3D"font-size:12pt;font-family:times,serif"><div><span></span></div><div><s=
pan>2. still don&#39;t understand very well in which part of the code(MapTa=
sk.java) the intermediate data is written do which partition. So MapOutputB=
uffer is the one who actually writes the data to buffer and spill after buf=
fer is full. Could you please elaborate a bit on how the data is written to=
 which partition
 ?</span></div><div><br></div></div></div></blockquote><div><br></div>Essen=
tially you can think of the partition-id as the &#39;primary key&#39; and t=
he actual &#39;key&#39; in the map-output of &lt;key, value&gt; as the &#39=
;secondary key&#39;.</div>
<div><br></div><div>hth,</div><div>Arun</div><div><br><blockquote type=3D"c=
ite"><div><div style=3D"font-size:12pt;font-family:times,serif"><div><span>=
</span></div><div><span>Thanks,</span></div><div><span>Robert</span></div><=
div>
<br></div>  <div style=3D"font-family:times new roman,new york,times,serif;=
font-size:12pt"> <div style=3D"font-family:times new roman,new york,times,s=
erif;font-size:12pt"> <div dir=3D"ltr"> <font face=3D"Arial"> <hr size=3D"1=
">  <b><span style=3D"font-weight:bold">From:</span></b> Arun C Murthy &lt;=
<a rel=3D"nofollow" href=3D"mailto:acm@hortonworks.com" target=3D"_blank">a=
cm@hortonworks.com</a>&gt;<br>
 <b><span style=3D"font-weight:bold">To:</span></b> <a rel=3D"nofollow" hre=
f=3D"mailto:mapreduce-user@hadoop.apache.org" target=3D"_blank">mapreduce-u=
ser@hadoop.apache.org</a> <br> <b><span style=3D"font-weight:bold">Sent:</s=
pan></b> Monday, July 9, 2012 9:24 AM<br>
 <b><span style=3D"font-weight:bold">Subject:</span></b> Re: Basic question=
 on how reducer works<br> </font> </div> <br>
<div><div>Robert,<div><br><div><div>On Jul 7, 2012, at 6:37 PM, Grandl Robe=
rt wrote:</div><br><blockquote type=3D"cite"><div><div style=3D"font-size:1=
2pt;font-family:times,serif"><div>Hi,</div><div><br></div><div>I have some =
questions related to basic functionality in Hadoop.=A0</div>
<div><br></div><div>1. When a Mapper process the intermediate output data, =
how it knows how many partitions to do(how many reducers will be) and how m=
uch data to go in each=A0 partition for each reducer ?</div><div><br></div>
<div>2. A JobTracker when assigns a task to a reducer, it will also specify=
 the locations of intermediate output data where it should retrieve it righ=
t ? But how a reducer will know from each remote location with intermediate=
 output what portion it has to retrieve only ?</div>
</div></div></blockquote><div><br></div>To add to Harsh&#39;s
 comment. Essentially the TT *knows* where the output of a given map-id/red=
uce-id pair is present via an output-file/index-file combination.</div><div=
><br></div><div>Arun</div><div><br></div><div>--</div><div><span style=3D"t=
ext-indent:0px;letter-spacing:normal;font-variant:normal;font-style:normal;=
font-weight:normal;line-height:normal;border-collapse:separate;text-transfo=
rm:none;font-size:medium;white-space:normal;font-family:Helvetica;word-spac=
ing:0px"><span style=3D"text-indent:0px;letter-spacing:normal;font-variant:=
normal;font-style:normal;font-weight:normal;line-height:normal;border-colla=
pse:separate;text-transform:none;font-size:medium;white-space:normal;font-f=
amily:Helvetica;word-spacing:0px"><div style=3D"word-wrap:break-word">
<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;fo=
nt-style:normal;font-weight:normal;line-height:normal;border-collapse:separ=
ate;text-transform:none;font-size:medium;white-space:normal;font-family:Hel=
vetica;word-spacing:0px"><div style=3D"word-wrap:break-word">
Arun C. Murthy</div><div style=3D"word-wrap:break-word">Hortonworks Inc.<br=
><a href=3D"http://hortonworks.com/" target=3D"_blank">http://hortonworks.c=
om/</a><br><br></div></span></div></span></span>
</div>
<br></div></div></div><br><br> </div> </div>  </div></div></blockquote></di=
v><br><div>
<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;fo=
nt-style:normal;font-weight:normal;line-height:normal;border-collapse:separ=
ate;text-transform:none;font-size:medium;white-space:normal;font-family:Hel=
vetica;word-spacing:0px"><span style=3D"text-indent:0px;letter-spacing:norm=
al;font-variant:normal;font-style:normal;font-weight:normal;line-height:nor=
mal;border-collapse:separate;text-transform:none;font-size:medium;white-spa=
ce:normal;font-family:Helvetica;word-spacing:0px"><div style=3D"word-wrap:b=
reak-word">
<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;fo=
nt-style:normal;font-weight:normal;line-height:normal;border-collapse:separ=
ate;text-transform:none;font-size:medium;white-space:normal;font-family:Hel=
vetica;word-spacing:0px"><div style=3D"word-wrap:break-word">
--</div><div style=3D"word-wrap:break-word">Arun C. Murthy</div><div style=
=3D"word-wrap:break-word">Hortonworks Inc.<br><a rel=3D"nofollow" href=3D"h=
ttp://hortonworks.com/" target=3D"_blank">http://hortonworks.com/</a><br><b=
r></div>
</span></div></span></span>
</div>
<br></div></div><br><br> </div></div></div> </div>  </div></div></blockquot=
e></div><br></div>

--f46d043bd75841d95004c47166e8--