Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of dechouxb@gmail.com designates
 209.85.216.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <BLU162-W63C0AA70404A8095DDE56FD09A0@phx.gbl>
References: <BLU162-W63C0AA70404A8095DDE56FD09A0@phx.gbl>
Date: Thu, 20 Sep 2012 22:21:43 +0200
Message-ID: 
 <CAO6W-2e6WuosNXOpQarWe5pgERGM5mKChyPZeoxCiPqBXcX-ZA@mail.gmail.com>
Subject: Re: why hadoop does not provide a round robin partitioner
From: Bertrand Dechoux <dechouxb@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=bcaec51a81ec32ccb204ca27dfd2

--bcaec51a81ec32ccb204ca27dfd2
Content-Type: text/plain; charset=ISO-8859-1

I am not sure what you mean.

I asume that by round robin you want the first key value to go to the first
reducer, the second to the second... modulo the number of reducers. I don't
think you will have access to the rank of the values. You could have a
state into your partitioner but I don't think you have any garante that
always the same instance of your partitioner will be used. Anyway if the
map1 emits key1 et key3 and map2 emits key1 and key2 and key3, how would
you ensure that every information about the same key is thrown to the same
reducer?

If I am correctly understanding, you are saying that given you know your
data, the provided hash function does not distribute it uniformly enough.
The answer to do that is to implement a better hash function. You could
built it generically if you can provide the partitioner with stats about
its inputs. But that would not be into Hadoop scope. You should look at
Hive/Pig or something equivalent.

Regards

Bertrand

On Thu, Sep 20, 2012 at 9:01 PM, java8964 java8964 <java8964@hotmail.com>wrote:

>  Hi,
>
> During my development of ETLs on hadoop platform, there is one question I
> want to ask, why hadoop didn't provide a round robin partitioner?
>
> From my experience, it is very powerful option for small limited distinct
> value keys case, and balance the ETL resource. Here is what I want to say:
>
> 1) Sometimes, you will have an ETL with small number of Keys, for example,
> partitioned the data by Dates, or by Hours etc. So in every ETL load, I
> will have very limited count of unique key values (Maybe 10, if I load 10
> days data, or 24 if I load one days data and use the hour as the key).
> 2) The HashPartitioner is good, given it will randomly generate the
> partitioner number, if you have a large number of distinct keys.
> 3) A lot of times, I have enough spare reducers, but because the
> hashCode() method happens to return several keys into one partitioner, all
> the data of those keys will go to the same reducer process, which is not
> very efficiently as there are some spare reducers just happen to get
> nothing to do.
> 4) Of course I can implement my own partitioner to control this, but I
> wonder it should not to be too harder to implements a round robin
> partitioner as in general case, which will equally distribute the different
> keys into the available reducers. Of course, with the distinct count of
> keys grows, the performance of this partitioner decrease badly. But if we
> know the count of distinct keys is small enough, use this kind of
> parittioner will be a good option, right?
>
> Thanks
>
> Yong
>


-- 
Bertrand Dechoux

--bcaec51a81ec32ccb204ca27dfd2
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I am not sure what you mean.<div><br></div><div>I asume that by round robin=
 you want the first key value to go to the first reducer, the second to the=
 second... modulo the number of reducers. I don&#39;t think you will have a=
ccess to the rank of the values. You could have a state into your partition=
er but I don&#39;t think you have any garante that always the same instance=
 of your partitioner will be used. Anyway if the map1 emits key1 et key3 an=
d map2 emits key1 and key2 and key3, how would you ensure that every inform=
ation about the same key is thrown to the same reducer?</div>
<div><br></div><div>If I am correctly understanding, you are saying that gi=
ven you know your data, the provided hash function does not distribute it u=
niformly enough. The answer to do that is to implement a better hash functi=
on. You could built it generically if you can provide the partitioner with =
stats about its inputs. But that would not be into Hadoop scope. You should=
 look at Hive/Pig or something equivalent.</div>
<div><br></div><div>Regards</div><div><br></div><div>Bertrand<br><br><div c=
lass=3D"gmail_quote">On Thu, Sep 20, 2012 at 9:01 PM, java8964 java8964 <sp=
an dir=3D"ltr">&lt;<a href=3D"mailto:java8964@hotmail.com" target=3D"_blank=
">java8964@hotmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">


<div><div dir=3D"ltr">
Hi,<div><br></div><div>During my development of ETLs on hadoop platform, th=
ere is one question I want to ask, why hadoop didn&#39;t provide a round ro=
bin partitioner?</div><div><br></div><div>From my experience, it is very po=
werful option for small limited distinct value keys case, and balance the E=
TL resource. Here is what I want to say:</div>
<div><br></div><div>1) Sometimes, you will have an ETL with small number of=
 Keys, for example, partitioned the data by Dates, or by Hours etc. So in e=
very ETL load, I will have very limited count of unique key values (Maybe 1=
0, if I load 10 days data, or 24 if I load one days data and use the hour a=
s the key).</div>
<div>2) The HashPartitioner is good, given it will randomly generate the pa=
rtitioner number, if you have a large number of distinct keys.</div><div>3)=
 A lot of times, I have enough spare reducers, but because the hashCode() m=
ethod happens to return several keys into one partitioner, all the data of =
those keys will go to the same reducer process, which is not very efficient=
ly as there are some spare reducers just happen to get nothing to do.</div>
<div>4) Of course I can implement my own partitioner to control this, but I=
 wonder it should not to be too harder to implements a round robin partitio=
ner as in general case, which will equally distribute the different keys in=
to the available reducers. Of course, with the distinct count of keys grows=
, the performance of this partitioner decrease badly. But if we know the co=
unt of distinct keys is small enough, use this kind of parittioner will be =
a good option, right?</div>
<div><br></div><div>Thanks</div><div><br></div><div>Yong=A0</div> 		 	   		=
  </div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>Bertrand Dec=
houx<br>
</div>

--bcaec51a81ec32ccb204ca27dfd2--