Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ED64CDAB2 for ; Thu, 20 Sep 2012 20:22:15 +0000 (UTC) Received: (qmail 49599 invoked by uid 500); 20 Sep 2012 20:22:11 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 49466 invoked by uid 500); 20 Sep 2012 20:22:11 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 49459 invoked by uid 99); 20 Sep 2012 20:22:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Sep 2012 20:22:11 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dechouxb@gmail.com designates 209.85.216.176 as permitted sender) Received: from [209.85.216.176] (HELO mail-qc0-f176.google.com) (209.85.216.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Sep 2012 20:22:04 +0000 Received: by qcsc21 with SMTP id c21so2539574qcs.35 for ; Thu, 20 Sep 2012 13:21:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=tv2LZH1afyannOcYHK+X0f4/AXVxvOsUe5Fz/6J1hGc=; b=Kn9u7i/7+W2FLReECi7Xs/9f6AJDDSNd9UuFJcKZ7RRZaJSYZzknciubmur3aqrJb7 uTTNQlHMQQC1JkX7+UO8//amhrsNQu1cAYC9btpnxvbRPtS6nhLTC8+0/ej70HFKMdc1 28rvoeAS0oNj5OcdkUZegVUNJf6PavOqKo5RdfjTdvbd41EySdg4PF/aFu/UiB884ctn Erfks2nQYZbVIqpa1kEwv71qUSLyS+SjGj59JHv/x3e103VoWO1lZLDYimULEWpu0DiN zQ+yQy3TwTFX7MIjOc04Nz7tNtUE+YwdCzbsHqfEZ6lGMbUonPFM08BdW1XeQ5PHvr+o 6/MQ== MIME-Version: 1.0 Received: by 10.224.70.138 with SMTP id d10mr7446370qaj.12.1348172504025; Thu, 20 Sep 2012 13:21:44 -0700 (PDT) Received: by 10.49.71.231 with HTTP; Thu, 20 Sep 2012 13:21:43 -0700 (PDT) In-Reply-To: References: Date: Thu, 20 Sep 2012 22:21:43 +0200 Message-ID: Subject: Re: why hadoop does not provide a round robin partitioner From: Bertrand Dechoux To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=bcaec51a81ec32ccb204ca27dfd2 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec51a81ec32ccb204ca27dfd2 Content-Type: text/plain; charset=ISO-8859-1 I am not sure what you mean. I asume that by round robin you want the first key value to go to the first reducer, the second to the second... modulo the number of reducers. I don't think you will have access to the rank of the values. You could have a state into your partitioner but I don't think you have any garante that always the same instance of your partitioner will be used. Anyway if the map1 emits key1 et key3 and map2 emits key1 and key2 and key3, how would you ensure that every information about the same key is thrown to the same reducer? If I am correctly understanding, you are saying that given you know your data, the provided hash function does not distribute it uniformly enough. The answer to do that is to implement a better hash function. You could built it generically if you can provide the partitioner with stats about its inputs. But that would not be into Hadoop scope. You should look at Hive/Pig or something equivalent. Regards Bertrand On Thu, Sep 20, 2012 at 9:01 PM, java8964 java8964 wrote: > Hi, > > During my development of ETLs on hadoop platform, there is one question I > want to ask, why hadoop didn't provide a round robin partitioner? > > From my experience, it is very powerful option for small limited distinct > value keys case, and balance the ETL resource. Here is what I want to say: > > 1) Sometimes, you will have an ETL with small number of Keys, for example, > partitioned the data by Dates, or by Hours etc. So in every ETL load, I > will have very limited count of unique key values (Maybe 10, if I load 10 > days data, or 24 if I load one days data and use the hour as the key). > 2) The HashPartitioner is good, given it will randomly generate the > partitioner number, if you have a large number of distinct keys. > 3) A lot of times, I have enough spare reducers, but because the > hashCode() method happens to return several keys into one partitioner, all > the data of those keys will go to the same reducer process, which is not > very efficiently as there are some spare reducers just happen to get > nothing to do. > 4) Of course I can implement my own partitioner to control this, but I > wonder it should not to be too harder to implements a round robin > partitioner as in general case, which will equally distribute the different > keys into the available reducers. Of course, with the distinct count of > keys grows, the performance of this partitioner decrease badly. But if we > know the count of distinct keys is small enough, use this kind of > parittioner will be a good option, right? > > Thanks > > Yong > -- Bertrand Dechoux --bcaec51a81ec32ccb204ca27dfd2 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I am not sure what you mean.

I asume that by round robin= you want the first key value to go to the first reducer, the second to the= second... modulo the number of reducers. I don't think you will have a= ccess to the rank of the values. You could have a state into your partition= er but I don't think you have any garante that always the same instance= of your partitioner will be used. Anyway if the map1 emits key1 et key3 an= d map2 emits key1 and key2 and key3, how would you ensure that every inform= ation about the same key is thrown to the same reducer?

If I am correctly understanding, you are saying that gi= ven you know your data, the provided hash function does not distribute it u= niformly enough. The answer to do that is to implement a better hash functi= on. You could built it generically if you can provide the partitioner with = stats about its inputs. But that would not be into Hadoop scope. You should= look at Hive/Pig or something equivalent.

Regards

Bertrand

On Thu, Sep 20, 2012 at 9:01 PM, java8964 java8964 <java8964@hotmail.com> wrote:
Hi,

During my development of ETLs on hadoop platform, th= ere is one question I want to ask, why hadoop didn't provide a round ro= bin partitioner?

From my experience, it is very po= werful option for small limited distinct value keys case, and balance the E= TL resource. Here is what I want to say:

1) Sometimes, you will have an ETL with small number of= Keys, for example, partitioned the data by Dates, or by Hours etc. So in e= very ETL load, I will have very limited count of unique key values (Maybe 1= 0, if I load 10 days data, or 24 if I load one days data and use the hour a= s the key).
2) The HashPartitioner is good, given it will randomly generate the pa= rtitioner number, if you have a large number of distinct keys.
3)= A lot of times, I have enough spare reducers, but because the hashCode() m= ethod happens to return several keys into one partitioner, all the data of = those keys will go to the same reducer process, which is not very efficient= ly as there are some spare reducers just happen to get nothing to do.
4) Of course I can implement my own partitioner to control this, but I= wonder it should not to be too harder to implements a round robin partitio= ner as in general case, which will equally distribute the different keys in= to the available reducers. Of course, with the distinct count of keys grows= , the performance of this partitioner decrease badly. But if we know the co= unt of distinct keys is small enough, use this kind of parittioner will be = a good option, right?

Thanks

Yong=A0
=



--
Bertrand Dec= houx
--bcaec51a81ec32ccb204ca27dfd2--