Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of java8964@hotmail.com designates
 65.55.111.87 as permitted sender)
Message-ID: <BLU162-W63C0AA70404A8095DDE56FD09A0@phx.gbl>
Content-Type: multipart/alternative;
	boundary="_70fe9469-12d9-4832-8424-ac6f805cfee7_"
From: java8964 java8964 <java8964@hotmail.com>
To: <user@hadoop.apache.org>
Subject: why hadoop does not provide a round robin partitioner
Date: Thu, 20 Sep 2012 15:01:39 -0400
Importance: Normal
MIME-Version: 1.0

--_70fe9469-12d9-4832-8424-ac6f805cfee7_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable


Hi=2C
During my development of ETLs on hadoop platform=2C there is one question I=
 want to ask=2C why hadoop didn't provide a round robin partitioner?
>From my experience=2C it is very powerful option for small limited distinct=
 value keys case=2C and balance the ETL resource. Here is what I want to sa=
y:
1) Sometimes=2C you will have an ETL with small number of Keys=2C for examp=
le=2C partitioned the data by Dates=2C or by Hours etc. So in every ETL loa=
d=2C I will have very limited count of unique key values (Maybe 10=2C if I =
load 10 days data=2C or 24 if I load one days data and use the hour as the =
key).2) The HashPartitioner is good=2C given it will randomly generate the =
partitioner number=2C if you have a large number of distinct keys.3) A lot =
of times=2C I have enough spare reducers=2C but because the hashCode() meth=
od happens to return several keys into one partitioner=2C all the data of t=
hose keys will go to the same reducer process=2C which is not very efficien=
tly as there are some spare reducers just happen to get nothing to do.4) Of=
 course I can implement my own partitioner to control this=2C but I wonder =
it should not to be too harder to implements a round robin partitioner as i=
n general case=2C which will equally distribute the different keys into the=
 available reducers. Of course=2C with the distinct count of keys grows=2C =
the performance of this partitioner decrease badly. But if we know the coun=
t of distinct keys is small enough=2C use this kind of parittioner will be =
a good option=2C right?
Thanks
Yong  		 	   		  =

--_70fe9469-12d9-4832-8424-ac6f805cfee7_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<style><!--
.hmmessage P
{
margin:0px=3B
padding:0px
}
body.hmmessage
{
font-size: 10pt=3B
font-family:Tahoma
}
--></style></head>
<body class=3D'hmmessage'><div dir=3D'ltr'>
Hi=2C<div><br></div><div>During my development of ETLs on hadoop platform=
=2C there is one question I want to ask=2C why hadoop didn't provide a roun=
d robin partitioner?</div><div><br></div><div>From my experience=2C it is v=
ery powerful option for small limited distinct value keys case=2C and balan=
ce the ETL resource. Here is what I want to say:</div><div><br></div><div>1=
) Sometimes=2C you will have an ETL with small number of Keys=2C for exampl=
e=2C partitioned the data by Dates=2C or by Hours etc. So in every ETL load=
=2C I will have very limited count of unique key values (Maybe 10=2C if I l=
oad 10 days data=2C or 24 if I load one days data and use the hour as the k=
ey).</div><div>2) The HashPartitioner is good=2C given it will randomly gen=
erate the partitioner number=2C if you have a large number of distinct keys=
.</div><div>3) A lot of times=2C I have enough spare reducers=2C but becaus=
e the hashCode() method happens to return several keys into one partitioner=
=2C all the data of those keys will go to the same reducer process=2C which=
 is not very efficiently as there are some spare reducers just happen to ge=
t nothing to do.</div><div>4) Of course I can implement my own partitioner =
to control this=2C but I wonder it should not to be too harder to implement=
s a round robin partitioner as in general case=2C which will equally distri=
bute the different keys into the available reducers. Of course=2C with the =
distinct count of keys grows=2C the performance of this partitioner decreas=
e badly. But if we know the count of distinct keys is small enough=2C use t=
his kind of parittioner will be a good option=2C right?</div><div><br></div=
><div>Thanks</div><div><br></div><div>Yong&nbsp=3B</div> 		 	   		  </div><=
/body>
</html>=

--_70fe9469-12d9-4832-8424-ac6f805cfee7_--