Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of bbeaudreault@hubspot.com
 designates 74.125.149.205 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <869970D71E26D7498BDAC4E1CA92226B658BE365@MBX021-E3-NJ-2.exch021.domain.local>
References: <BLU0-SMTP237EBEEFF2EF6B02089D99C8FA80@phx.gbl>
 <CDC1071F.793A%sanjay.subramanian@wizecommerce.com>
 <1369191774.59557.YahooMailNeo@web190702.mail.sg3.yahoo.com>
 <1370603836.37168.YahooMailNeo@web190704.mail.sg3.yahoo.com>
 <869970D71E26D7498BDAC4E1CA92226B658BE365@MBX021-E3-NJ-2.exch021.domain.local>
From: Bryan Beaudreault <bbeaudreault@hubspot.com>
Date: Fri, 7 Jun 2013 11:58:29 -0400
Message-ID: 
 <CANZDn9sm=zb=frmscmh0YMPS68NPhpbSrwi9guDhQNsptXdLvw@mail.gmail.com>
Subject: Re: Why/When partitioner is used.
To: "hbase-user@hadoop.apache.org" <user@hadoop.apache.org>
Cc: Sai Sai <saigraph@yahoo.in>
Content-Type: multipart/alternative; boundary=20cf3071ca88b8dabd04de92815a

--20cf3071ca88b8dabd04de92815a
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

There are practical applications for defining your own partitioner as well:

1) Controlling database concurrency.  For instance, lets say you have a
distributed datastore like HBase or even your own mysql sharding scheme.
 Using the default HashPartitioner, keys will get for the most part
randomly distributed across your reducers.  If your reduce code does
database saves or gets, this could cause periods where all reducers are
hitting a single database.  This may be more concurrency than your database
can handle, so you could use a partitioner to send all keys you know would
hit Shard A to reducers 1,2,3, and and all that would hit Shard B to
reducers 4,5,6.

2) I've also used partitioners when I want to do some cross-key operations
such as deduping, counting, or otherwise.  You can further combine the
custom partitioner with your own custom comparator and grouping comparator
to do many advanced operations based the application you are working on.

Since a single Reducer instance is used to reduce() all tuples in a
partition, being able to control exactly which records make it onto a
partition is a hugely valuable tool.


On Fri, Jun 7, 2013 at 10:03 AM, John Lilley <john.lilley@redpoint.net>wrot=
e:

>  There are kind of two parts to this.  The semantics of MapReduce promise
> that all tuples sharing the same key value are sent to the same reducer, =
so
> that you can write useful MR applications that do things like =93count wo=
rds=94
> or =93summarize by date=94.  In order to accomplish that, the shuffle pha=
se of
> MR performs a partitioning by key to move tuples sharing the same key to
> the same node where they can be processed together.  You can think of
> key-partitioning as a strategy that assists in parallel distributed sorti=
ng.
> ****
>
> john****
>
> ** **
>
> *From:* Sai Sai [mailto:saigraph@yahoo.in]
> *Sent:* Friday, June 07, 2013 5:17 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Why/When partitioner is used.****
>
> ** **
>
> I always get confused why we should partition and what is the use of it.*=
*
> **
>
> Why would one want to send all the keys starting with A to Reducer1 and B
> to R2 and so on...****
>
> Is it just to parallelize the reduce process.****
>
> Please help.****
>
> Thanks****
>
> Sai****
>

--20cf3071ca88b8dabd04de92815a
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">There are practical applications for defining your own par=
titioner as well:<div><br></div><div style>1) Controlling database concurre=
ncy. =A0For instance, lets say you have a distributed datastore like HBase =
or even your own mysql sharding scheme. =A0Using the default HashPartitione=
r, keys will get for the most part randomly distributed across your reducer=
s. =A0If your reduce code does database saves or gets, this could cause per=
iods where all reducers are hitting a single database. =A0This may be more =
concurrency than your database can handle, so you could use a partitioner t=
o send all keys you know would hit Shard A to reducers 1,2,3, and and all t=
hat would hit Shard B to reducers 4,5,6.</div>

<div style><br></div><div style>2) I&#39;ve also used partitioners when I w=
ant to do some cross-key operations such as deduping, counting, or otherwis=
e. =A0You can further combine the custom partitioner with your own custom c=
omparator and grouping comparator to do many advanced operations based the =
application you are working on.</div>

<div style><br></div><div style>Since a single Reducer instance is used to =
reduce() all tuples in a partition, being able to control exactly which rec=
ords make it onto a partition is a hugely valuable tool.</div></div><div cl=
ass=3D"gmail_extra">

<br><br><div class=3D"gmail_quote">On Fri, Jun 7, 2013 at 10:03 AM, John Li=
lley <span dir=3D"ltr">&lt;<a href=3D"mailto:john.lilley@redpoint.net" targ=
et=3D"_blank">john.lilley@redpoint.net</a>&gt;</span> wrote:<br><blockquote=
 class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli=
d;padding-left:1ex">


<div lang=3D"EN-US" link=3D"blue" vlink=3D"purple">
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">There are kind of two par=
ts to this.=A0 The semantics of MapReduce promise that all tuples sharing t=
he same key value are sent to the same reducer, so that you
 can write useful MR applications that do things like =93count words=94 or =
=93summarize by date=94.=A0 In order to accomplish that, the shuffle phase =
of MR performs a partitioning by key to move tuples sharing the same key to=
 the same node where they can be processed
 together.=A0 You can think of key-partitioning as a strategy that assists =
in parallel distributed sorting.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">john<u></u><u></u></span>=
</p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0<u></u></span><=
/p>
<div>
<div style=3D"border:none;border-top:solid #b5c4df 1.0pt;padding:3.0pt 0in =
0in 0in">
<p class=3D"MsoNormal"><b><span style=3D"font-size:10.0pt;font-family:&quot=
;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span style=3D"font-s=
ize:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> Sai Sai =
[mailto:<a href=3D"mailto:saigraph@yahoo.in" target=3D"_blank">saigraph@yah=
oo.in</a>]
<br>
<b>Sent:</b> Friday, June 07, 2013 5:17 AM<br>
<b>To:</b> <a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user=
@hadoop.apache.org</a><br>
<b>Subject:</b> Re: Why/When partitioner is used.<u></u><u></u></span></p>
</div>
</div><div><div class=3D"h5">
<p class=3D"MsoNormal"><u></u>=A0<u></u></p>
<div>
<div>
<p class=3D"MsoNormal" style=3D"background:white"><span style>I always get =
confused why we should partition and what is the use of it.<u></u><u></u></=
span></p>
</div>
<div>
<p class=3D"MsoNormal" style=3D"background:white"><span style>Why would one=
 want to send all the keys starting with A to Reducer1 and B to R2 and so o=
n...<u></u><u></u></span></p>
</div>
<div>
<p class=3D"MsoNormal" style=3D"background:white"><span style>Is it just to=
 parallelize the reduce process.<u></u><u></u></span></p>
</div>
<div>
<p class=3D"MsoNormal" style=3D"background:white"><span style>Please help.<=
u></u><u></u></span></p>
</div>
<div>
<p class=3D"MsoNormal" style=3D"background:white"><span style>Thanks<u></u>=
<u></u></span></p>
</div>
<div>
<p class=3D"MsoNormal" style=3D"background:white"><span style>Sai<u></u><u>=
</u></span></p>
</div>
</div>
</div></div></div>
</div>

</blockquote></div><br></div>

--20cf3071ca88b8dabd04de92815a--