hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 java8964 <java8...@hotmail.com>
Subject why hadoop does not provide a round robin partitioner
Date Thu, 20 Sep 2012 19:01:39 GMT

During my development of ETLs on hadoop platform, there is one question I want to ask, why
hadoop didn't provide a round robin partitioner?
>From my experience, it is very powerful option for small limited distinct value keys case,
and balance the ETL resource. Here is what I want to say:
1) Sometimes, you will have an ETL with small number of Keys, for example, partitioned the
data by Dates, or by Hours etc. So in every ETL load, I will have very limited count of unique
key values (Maybe 10, if I load 10 days data, or 24 if I load one days data and use the hour
as the key).2) The HashPartitioner is good, given it will randomly generate the partitioner
number, if you have a large number of distinct keys.3) A lot of times, I have enough spare
reducers, but because the hashCode() method happens to return several keys into one partitioner,
all the data of those keys will go to the same reducer process, which is not very efficiently
as there are some spare reducers just happen to get nothing to do.4) Of course I can implement
my own partitioner to control this, but I wonder it should not to be too harder to implements
a round robin partitioner as in general case, which will equally distribute the different
keys into the available reducers. Of course, with the distinct count of keys grows, the performance
of this partitioner decrease badly. But if we know the count of distinct keys is small enough,
use this kind of parittioner will be a good option, right?
View raw message