Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 05235C92D for ; Fri, 7 Jun 2013 15:59:29 +0000 (UTC) Received: (qmail 8798 invoked by uid 500); 7 Jun 2013 15:59:22 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 8637 invoked by uid 500); 7 Jun 2013 15:59:21 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 8624 invoked by uid 99); 7 Jun 2013 15:59:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Jun 2013 15:59:21 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of bbeaudreault@hubspot.com designates 74.125.149.205 as permitted sender) Received: from [74.125.149.205] (HELO na3sys009aog111.obsmtp.com) (74.125.149.205) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 07 Jun 2013 15:59:14 +0000 Received: from mail-ve0-f176.google.com ([209.85.128.176]) (using TLSv1) by na3sys009aob111.postini.com ([74.125.148.12]) with SMTP ID DSNKUbIDOwdAzc53Gp+l+TH48qcWTAP274zR@postini.com; Fri, 07 Jun 2013 08:58:53 PDT Received: by mail-ve0-f176.google.com with SMTP id c13so3174609vea.35 for ; Fri, 07 Jun 2013 08:58:49 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:x-gm-message-state; bh=bWJCY7XuljS/ys1KF4NxdtWnc8HnWRMwhILnEAPtJ6k=; b=WZ4wc/YA6Fd6CmQfhVtqIqw4k1saqwJd9aF/tTPEEdKIJgPUKGm4hapW2sATmucbTa wCEt/L++Qabtv+3Z5wI6F0rVxizQWjR26R8GkGAwIgucy6ckeGtFvCXw7owZ7tZvq9+q pi+KX8zMZ8lqAqPjPI5k+S4pmbWdiLCy79tQqva22TF21L53Q/nejg57WpaQ0ejwli1D hRpoFEj9DxAsAjcZG1o9jcaMsvd1rOf9ybqdeyJ8y01pDEcCv2mPCOpwDP7WC1UtzluT swox3XjxUAelW10mhWM/WeLaji/dy1LfEGKeTSLJdjahqn5uDhKbofw/lyHalVT61gyF LKvg== X-Received: by 10.52.65.38 with SMTP id u6mr20227181vds.59.1370620729913; Fri, 07 Jun 2013 08:58:49 -0700 (PDT) X-Received: by 10.52.65.38 with SMTP id u6mr20227178vds.59.1370620729790; Fri, 07 Jun 2013 08:58:49 -0700 (PDT) MIME-Version: 1.0 Received: by 10.220.93.65 with HTTP; Fri, 7 Jun 2013 08:58:29 -0700 (PDT) In-Reply-To: <869970D71E26D7498BDAC4E1CA92226B658BE365@MBX021-E3-NJ-2.exch021.domain.local> References: <1369191774.59557.YahooMailNeo@web190702.mail.sg3.yahoo.com> <1370603836.37168.YahooMailNeo@web190704.mail.sg3.yahoo.com> <869970D71E26D7498BDAC4E1CA92226B658BE365@MBX021-E3-NJ-2.exch021.domain.local> From: Bryan Beaudreault Date: Fri, 7 Jun 2013 11:58:29 -0400 Message-ID: Subject: Re: Why/When partitioner is used. To: "hbase-user@hadoop.apache.org" Cc: Sai Sai Content-Type: multipart/alternative; boundary=20cf3071ca88b8dabd04de92815a X-Gm-Message-State: ALoCoQkX8rT1HMxjmV8N+NIhKNo+U4zvBIqYR9hC4XgLXiD+qt06RAt3Vr38QWmLTbOXxLCWdaeq1d6YH8jthoVztZ6jiitxvqLmTjzSYAig56X7XMBX4Ua2hbSONStONVtChNF9VpzDNXdhGSEDcSdz4q0JXAKwyj3KN+LSetUx1rVRMdFN7kU= X-Virus-Checked: Checked by ClamAV on apache.org --20cf3071ca88b8dabd04de92815a Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable There are practical applications for defining your own partitioner as well: 1) Controlling database concurrency. For instance, lets say you have a distributed datastore like HBase or even your own mysql sharding scheme. Using the default HashPartitioner, keys will get for the most part randomly distributed across your reducers. If your reduce code does database saves or gets, this could cause periods where all reducers are hitting a single database. This may be more concurrency than your database can handle, so you could use a partitioner to send all keys you know would hit Shard A to reducers 1,2,3, and and all that would hit Shard B to reducers 4,5,6. 2) I've also used partitioners when I want to do some cross-key operations such as deduping, counting, or otherwise. You can further combine the custom partitioner with your own custom comparator and grouping comparator to do many advanced operations based the application you are working on. Since a single Reducer instance is used to reduce() all tuples in a partition, being able to control exactly which records make it onto a partition is a hugely valuable tool. On Fri, Jun 7, 2013 at 10:03 AM, John Lilley wrot= e: > There are kind of two parts to this. The semantics of MapReduce promise > that all tuples sharing the same key value are sent to the same reducer, = so > that you can write useful MR applications that do things like =93count wo= rds=94 > or =93summarize by date=94. In order to accomplish that, the shuffle pha= se of > MR performs a partitioning by key to move tuples sharing the same key to > the same node where they can be processed together. You can think of > key-partitioning as a strategy that assists in parallel distributed sorti= ng. > **** > > john**** > > ** ** > > *From:* Sai Sai [mailto:saigraph@yahoo.in] > *Sent:* Friday, June 07, 2013 5:17 AM > *To:* user@hadoop.apache.org > *Subject:* Re: Why/When partitioner is used.**** > > ** ** > > I always get confused why we should partition and what is the use of it.*= * > ** > > Why would one want to send all the keys starting with A to Reducer1 and B > to R2 and so on...**** > > Is it just to parallelize the reduce process.**** > > Please help.**** > > Thanks**** > > Sai**** > --20cf3071ca88b8dabd04de92815a Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable
There are practical applications for defining your own par= titioner as well:

1) Controlling database concurre= ncy. =A0For instance, lets say you have a distributed datastore like HBase = or even your own mysql sharding scheme. =A0Using the default HashPartitione= r, keys will get for the most part randomly distributed across your reducer= s. =A0If your reduce code does database saves or gets, this could cause per= iods where all reducers are hitting a single database. =A0This may be more = concurrency than your database can handle, so you could use a partitioner t= o send all keys you know would hit Shard A to reducers 1,2,3, and and all t= hat would hit Shard B to reducers 4,5,6.

2) I've also used partitioners when I w= ant to do some cross-key operations such as deduping, counting, or otherwis= e. =A0You can further combine the custom partitioner with your own custom c= omparator and grouping comparator to do many advanced operations based the = application you are working on.

Since a single Reducer instance is used to = reduce() all tuples in a partition, being able to control exactly which rec= ords make it onto a partition is a hugely valuable tool.


On Fri, Jun 7, 2013 at 10:03 AM, John Li= lley <john.lilley@redpoint.net> wrote:

There are kind of two par= ts to this.=A0 The semantics of MapReduce promise that all tuples sharing t= he same key value are sent to the same reducer, so that you can write useful MR applications that do things like =93count words=94 or = =93summarize by date=94.=A0 In order to accomplish that, the shuffle phase = of MR performs a partitioning by key to move tuples sharing the same key to= the same node where they can be processed together.=A0 You can think of key-partitioning as a strategy that assists = in parallel distributed sorting.

john=

=A0<= /p>

From: Sai Sai = [mailto:saigraph@yah= oo.in]
Sent: Friday, June 07, 2013 5:17 AM
To: user= @hadoop.apache.org
Subject: Re: Why/When partitioner is used.

=A0

I always get = confused why we should partition and what is the use of it.

Why would one= want to send all the keys starting with A to Reducer1 and B to R2 and so o= n...

Is it just to= parallelize the reduce process.

Please help.<= u>

Thanks=

Sai=


--20cf3071ca88b8dabd04de92815a--