Return-Path: Delivered-To: apmail-hadoop-pig-dev-archive@www.apache.org Received: (qmail 91125 invoked from network); 11 Nov 2009 08:29:04 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 Nov 2009 08:29:04 -0000 Received: (qmail 19020 invoked by uid 500); 11 Nov 2009 08:29:03 -0000 Delivered-To: apmail-hadoop-pig-dev-archive@hadoop.apache.org Received: (qmail 18918 invoked by uid 500); 11 Nov 2009 08:29:03 -0000 Mailing-List: contact pig-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pig-dev@hadoop.apache.org Delivered-To: mailing list pig-dev@hadoop.apache.org Received: (qmail 18890 invoked by uid 99); 11 Nov 2009 08:29:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Nov 2009 08:29:02 +0000 X-ASF-Spam-Status: No, hits=-10.5 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_HI X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Nov 2009 08:29:00 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id B7A23234C4B3 for ; Wed, 11 Nov 2009 00:28:39 -0800 (PST) Message-ID: <1392912799.1257928119751.JavaMail.jira@brutus> Date: Wed, 11 Nov 2009 08:28:39 +0000 (UTC) From: "Daniel Dai (JIRA)" To: pig-dev@hadoop.apache.org Subject: [jira] Updated: (PIG-1038) Optimize nested distinct/sort to use secondary key In-Reply-To: <176669728.1256162219484.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1038: ---------------------------- Status: Patch Available (was: Open) > Optimize nested distinct/sort to use secondary key > -------------------------------------------------- > > Key: PIG-1038 > URL: https://issues.apache.org/jira/browse/PIG-1038 > Project: Pig > Issue Type: Improvement > Components: impl > Affects Versions: 0.4.0 > Reporter: Olga Natkovich > Assignee: Daniel Dai > Fix For: 0.6.0 > > Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch > > > If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. > Eg1: > A = load 'mydata'; > B = group A by $0; > C = foreach B { > D = order A by $1; > generate group, D; > } > store C into 'myresult'; > We can specify a secondary sort on A.$1, and drop "order A by $1". > Eg2: > A = load 'mydata'; > B = group A by $0; > C = foreach B { > D = A.$1; > E = distinct D; > generate group, E; > } > store C into 'myresult'; > We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct D" to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.