Return-Path: X-Original-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 23789C1F6 for ; Fri, 3 Aug 2012 07:35:07 +0000 (UTC) Received: (qmail 32078 invoked by uid 500); 3 Aug 2012 07:35:07 -0000 Delivered-To: apmail-incubator-crunch-dev-archive@incubator.apache.org Received: (qmail 32004 invoked by uid 500); 3 Aug 2012 07:35:05 -0000 Mailing-List: contact crunch-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-dev@incubator.apache.org Delivered-To: mailing list crunch-dev@incubator.apache.org Received: (qmail 30713 invoked by uid 99); 3 Aug 2012 07:35:02 -0000 Received: from issues-vm.apache.org (HELO issues-vm) (140.211.11.160) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Aug 2012 07:35:02 +0000 Received: from isssues-vm.apache.org (localhost [127.0.0.1]) by issues-vm (Postfix) with ESMTP id 5885114052C for ; Fri, 3 Aug 2012 07:35:02 +0000 (UTC) Date: Fri, 3 Aug 2012 07:35:02 +0000 (UTC) From: "Rahul Sharma (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: <490840429.9203.1343979302364.JavaMail.jiratomcat@issues-vm> In-Reply-To: <1853483155.85795.1342870774642.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Updated] (CRUNCH-23) PCollection#sort doesn't do a full sort on values MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rahul Sharma updated CRUNCH-23: ------------------------------- Attachment: 0001-CRUNCH-23-fix-sorting.patch Josh, I have implemented the same solution. But I feel that the data distribution is not perfect there. The solution is based on reservoir sampling. So the keys that are used in the Partioner are selected from a subset of the data. It solely depends on how good is the subset sample. > PCollection#sort doesn't do a full sort on values > ------------------------------------------------- > > Key: CRUNCH-23 > URL: https://issues.apache.org/jira/browse/CRUNCH-23 > Project: Crunch > Issue Type: Bug > Reporter: Gabriel Reid > Assignee: Rahul Sharma > Attachments: 0001-CRUNCH-23-fix-sorting.patch, CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch, SortTest.java > > > When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira