hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Beaudreault <bbeaudrea...@hubspot.com>
Subject Re: Why/When partitioner is used.
Date Fri, 07 Jun 2013 15:58:29 GMT
There are practical applications for defining your own partitioner as well:

1) Controlling database concurrency.  For instance, lets say you have a
distributed datastore like HBase or even your own mysql sharding scheme.
 Using the default HashPartitioner, keys will get for the most part
randomly distributed across your reducers.  If your reduce code does
database saves or gets, this could cause periods where all reducers are
hitting a single database.  This may be more concurrency than your database
can handle, so you could use a partitioner to send all keys you know would
hit Shard A to reducers 1,2,3, and and all that would hit Shard B to
reducers 4,5,6.

2) I've also used partitioners when I want to do some cross-key operations
such as deduping, counting, or otherwise.  You can further combine the
custom partitioner with your own custom comparator and grouping comparator
to do many advanced operations based the application you are working on.

Since a single Reducer instance is used to reduce() all tuples in a
partition, being able to control exactly which records make it onto a
partition is a hugely valuable tool.

On Fri, Jun 7, 2013 at 10:03 AM, John Lilley <john.lilley@redpoint.net>wrote:

>  There are kind of two parts to this.  The semantics of MapReduce promise
> that all tuples sharing the same key value are sent to the same reducer, so
> that you can write useful MR applications that do things like “count words”
> or “summarize by date”.  In order to accomplish that, the shuffle phase of
> MR performs a partitioning by key to move tuples sharing the same key to
> the same node where they can be processed together.  You can think of
> key-partitioning as a strategy that assists in parallel distributed sorting.
> ****
> john****
> ** **
> *From:* Sai Sai [mailto:saigraph@yahoo.in]
> *Sent:* Friday, June 07, 2013 5:17 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Why/When partitioner is used.****
> ** **
> I always get confused why we should partition and what is the use of it.**
> **
> Why would one want to send all the keys starting with A to Reducer1 and B
> to R2 and so on...****
> Is it just to parallelize the reduce process.****
> Please help.****
> Thanks****
> Sai****

View raw message