hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: why hadoop does not provide a round robin partitioner
Date Thu, 20 Sep 2012 22:19:46 GMT
The simplest solution for the situation as stated is to use an identity
hash function.  Of course, you can't split things any finer than the number
of keys with this approach.

If you can process different time periods independently, you may be able to
add a small number of bits to your key to get lots of bins which will then
be split relatively evenly.  If you can do this, however, you probably can
use a combiner and get better results.

On Thu, Sep 20, 2012 at 3:21 PM, Bertrand Dechoux <dechouxb@gmail.com>wrote:

> If I am correctly understanding, you are saying that given you know your
> data, the provided hash function does not distribute it uniformly enough.
> The answer to do that is to implement a better hash function. You could
> built it generically if you can provide the partitioner with stats about
> its inputs. But that would not be into Hadoop scope. You should look at
> Hive/Pig or something equivalent.
>

Mime
View raw message