Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 859 invoked from network); 18 Apr 2007 18:01:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 18 Apr 2007 18:01:01 -0000 Received: (qmail 39318 invoked by uid 500); 18 Apr 2007 18:01:06 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 39292 invoked by uid 500); 18 Apr 2007 18:01:06 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 39283 invoked by uid 99); 18 Apr 2007 18:01:06 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Apr 2007 11:01:06 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Apr 2007 11:00:59 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id A971B714080 for ; Wed, 18 Apr 2007 11:00:37 -0700 (PDT) Message-ID: <23082769.1176919237691.JavaMail.jira@brutus> Date: Wed, 18 Apr 2007 11:00:37 -0700 (PDT) From: "Runping Qi (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-485) allow a different comparator for grouping keys in calls to reduce In-Reply-To: <18308270.1156578742395.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12489837 ] Runping Qi commented on HADOOP-485: ----------------------------------- You want to control the partitioning as well. For example, let's we have the following map output: PK1, SK11, V1 PK1, SK12, V2 PK2, SK21, V3 PK2, SK22, V4 where PKi are primary keys and SKij are secondary keys. What you want is that all the tuples with same primary keys will go to the same reducer. Within the same reducer, the tuples are sorted by the primary/secondary keys. When passing to reduce calls, you want the primary key wil be the key to reduce call, and the values with the same primary key will be passed to the reduce call, in the sorted order by the secondary keys. > allow a different comparator for grouping keys in calls to reduce > ----------------------------------------------------------------- > > Key: HADOOP-485 > URL: https://issues.apache.org/jira/browse/HADOOP-485 > Project: Hadoop > Issue Type: New Feature > Components: mapred > Affects Versions: 0.5.0 > Reporter: Owen O'Malley > Assigned To: Tahir Hashmi > Attachments: Hadoop-485-pre.patch, TestUserValueGrouping.java.patch > > > Some algorithms require that the values to the reduce be sorted in a particular order, but extending the key with the additional fields causes them to be handled by different calls to reduce. (The user then collects the values until they detect a "real" key change and then processes them.) > It would be much easier if the framework let you define a second comparator that did the grouping of values for reduces. So your reduce inputs look like: > A1, V1 > A2, V2 > A3, V3 > B1, V4 > B2, V5 > instead of getting calls to reduce that look like: > reduce(A1, {V1}); reduce(A2, {V2}); reduce(A3, {V3}); reduce(B1, {V4}); reduce(B2, {V5}); > you could define the grouping comparator to just compare the letters and end up with: > reduce(A1, {V1,V2,V3}); reduce(B1, {V4,V5}); > which is the desired outcome. Note that this assumes that the "extra" part of the key is just for sorting because the reduce will only see the first representative of each equivalence class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.