Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B4EDCD8AC for ; Thu, 20 Sep 2012 19:02:14 +0000 (UTC) Received: (qmail 22140 invoked by uid 500); 20 Sep 2012 19:02:10 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 22019 invoked by uid 500); 20 Sep 2012 19:02:10 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 22011 invoked by uid 99); 20 Sep 2012 19:02:10 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Sep 2012 19:02:10 +0000 X-ASF-Spam-Status: No, hits=2.4 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of java8964@hotmail.com designates 65.55.111.87 as permitted sender) Received: from [65.55.111.87] (HELO blu0-omc2-s12.blu0.hotmail.com) (65.55.111.87) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Sep 2012 19:01:59 +0000 Received: from BLU162-W63 ([65.55.111.72]) by blu0-omc2-s12.blu0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Thu, 20 Sep 2012 12:01:39 -0700 Message-ID: Content-Type: multipart/alternative; boundary="_70fe9469-12d9-4832-8424-ac6f805cfee7_" X-Originating-IP: [192.100.104.17] From: java8964 java8964 To: Subject: why hadoop does not provide a round robin partitioner Date: Thu, 20 Sep 2012 15:01:39 -0400 Importance: Normal MIME-Version: 1.0 X-OriginalArrivalTime: 20 Sep 2012 19:01:39.0138 (UTC) FILETIME=[5D4ABA20:01CD9762] --_70fe9469-12d9-4832-8424-ac6f805cfee7_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi=2C During my development of ETLs on hadoop platform=2C there is one question I= want to ask=2C why hadoop didn't provide a round robin partitioner? >From my experience=2C it is very powerful option for small limited distinct= value keys case=2C and balance the ETL resource. Here is what I want to sa= y: 1) Sometimes=2C you will have an ETL with small number of Keys=2C for examp= le=2C partitioned the data by Dates=2C or by Hours etc. So in every ETL loa= d=2C I will have very limited count of unique key values (Maybe 10=2C if I = load 10 days data=2C or 24 if I load one days data and use the hour as the = key).2) The HashPartitioner is good=2C given it will randomly generate the = partitioner number=2C if you have a large number of distinct keys.3) A lot = of times=2C I have enough spare reducers=2C but because the hashCode() meth= od happens to return several keys into one partitioner=2C all the data of t= hose keys will go to the same reducer process=2C which is not very efficien= tly as there are some spare reducers just happen to get nothing to do.4) Of= course I can implement my own partitioner to control this=2C but I wonder = it should not to be too harder to implements a round robin partitioner as i= n general case=2C which will equally distribute the different keys into the= available reducers. Of course=2C with the distinct count of keys grows=2C = the performance of this partitioner decrease badly. But if we know the coun= t of distinct keys is small enough=2C use this kind of parittioner will be = a good option=2C right? Thanks Yong = --_70fe9469-12d9-4832-8424-ac6f805cfee7_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
Hi=2C

During my development of ETLs on hadoop platform= =2C there is one question I want to ask=2C why hadoop didn't provide a roun= d robin partitioner?

From my experience=2C it is v= ery powerful option for small limited distinct value keys case=2C and balan= ce the ETL resource. Here is what I want to say:

1= ) Sometimes=2C you will have an ETL with small number of Keys=2C for exampl= e=2C partitioned the data by Dates=2C or by Hours etc. So in every ETL load= =2C I will have very limited count of unique key values (Maybe 10=2C if I l= oad 10 days data=2C or 24 if I load one days data and use the hour as the k= ey).
2) The HashPartitioner is good=2C given it will randomly gen= erate the partitioner number=2C if you have a large number of distinct keys= .
3) A lot of times=2C I have enough spare reducers=2C but becaus= e the hashCode() method happens to return several keys into one partitioner= =2C all the data of those keys will go to the same reducer process=2C which= is not very efficiently as there are some spare reducers just happen to ge= t nothing to do.
4) Of course I can implement my own partitioner = to control this=2C but I wonder it should not to be too harder to implement= s a round robin partitioner as in general case=2C which will equally distri= bute the different keys into the available reducers. Of course=2C with the = distinct count of keys grows=2C the performance of this partitioner decreas= e badly. But if we know the count of distinct keys is small enough=2C use t= his kind of parittioner will be a good option=2C right?

Thanks

Yong =3B
<= /body> = --_70fe9469-12d9-4832-8424-ac6f805cfee7_--