hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bertrand Dechoux <decho...@gmail.com>
Subject Re: why hadoop does not provide a round robin partitioner
Date Thu, 20 Sep 2012 20:21:43 GMT
I am not sure what you mean.

I asume that by round robin you want the first key value to go to the first
reducer, the second to the second... modulo the number of reducers. I don't
think you will have access to the rank of the values. You could have a
state into your partitioner but I don't think you have any garante that
always the same instance of your partitioner will be used. Anyway if the
map1 emits key1 et key3 and map2 emits key1 and key2 and key3, how would
you ensure that every information about the same key is thrown to the same

If I am correctly understanding, you are saying that given you know your
data, the provided hash function does not distribute it uniformly enough.
The answer to do that is to implement a better hash function. You could
built it generically if you can provide the partitioner with stats about
its inputs. But that would not be into Hadoop scope. You should look at
Hive/Pig or something equivalent.



On Thu, Sep 20, 2012 at 9:01 PM, java8964 java8964 <java8964@hotmail.com>wrote:

>  Hi,
> During my development of ETLs on hadoop platform, there is one question I
> want to ask, why hadoop didn't provide a round robin partitioner?
> From my experience, it is very powerful option for small limited distinct
> value keys case, and balance the ETL resource. Here is what I want to say:
> 1) Sometimes, you will have an ETL with small number of Keys, for example,
> partitioned the data by Dates, or by Hours etc. So in every ETL load, I
> will have very limited count of unique key values (Maybe 10, if I load 10
> days data, or 24 if I load one days data and use the hour as the key).
> 2) The HashPartitioner is good, given it will randomly generate the
> partitioner number, if you have a large number of distinct keys.
> 3) A lot of times, I have enough spare reducers, but because the
> hashCode() method happens to return several keys into one partitioner, all
> the data of those keys will go to the same reducer process, which is not
> very efficiently as there are some spare reducers just happen to get
> nothing to do.
> 4) Of course I can implement my own partitioner to control this, but I
> wonder it should not to be too harder to implements a round robin
> partitioner as in general case, which will equally distribute the different
> keys into the available reducers. Of course, with the distinct count of
> keys grows, the performance of this partitioner decrease badly. But if we
> know the count of distinct keys is small enough, use this kind of
> parittioner will be a good option, right?
> Thanks
> Yong

Bertrand Dechoux

View raw message