Return-Path: Delivered-To: apmail-incubator-pig-dev-archive@locus.apache.org Received: (qmail 16636 invoked from network); 18 Jun 2008 12:21:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 18 Jun 2008 12:21:06 -0000 Received: (qmail 53072 invoked by uid 500); 18 Jun 2008 12:21:09 -0000 Delivered-To: apmail-incubator-pig-dev-archive@incubator.apache.org Received: (qmail 52916 invoked by uid 500); 18 Jun 2008 12:21:08 -0000 Mailing-List: contact pig-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pig-dev@incubator.apache.org Delivered-To: mailing list pig-dev@incubator.apache.org Received: (qmail 52893 invoked by uid 99); 18 Jun 2008 12:21:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Jun 2008 05:21:08 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Jun 2008 12:20:26 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 8A5E9234C13C for ; Wed, 18 Jun 2008 05:20:45 -0700 (PDT) Message-ID: <87883203.1213791645561.JavaMail.jira@brutus> Date: Wed, 18 Jun 2008 05:20:45 -0700 (PDT) From: "Pi Song (JIRA)" To: pig-dev@incubator.apache.org Subject: [jira] Commented: (PIG-273) Need to optimize the ways splits are handled, both in the top level plan and in nested plans. In-Reply-To: <1706380616.1213717784958.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/PIG-273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605923#action_12605923 ] Pi Song commented on PIG-273: ----------------------------- I assume this is the next step after pipeline rework. We can keep inner plans as DAGs so that we still maintain all the knowledge needed for optimization (One might say we can store inner plans separately and merge them but I think that's more difficult). Idea on how to optimize will come soon! > Need to optimize the ways splits are handled, both in the top level plan and in nested plans. > --------------------------------------------------------------------------------------------- > > Key: PIG-273 > URL: https://issues.apache.org/jira/browse/PIG-273 > Project: Pig > Issue Type: Improvement > Components: impl > Reporter: Alan Gates > Priority: Minor > > Currently, in the new pipeline rework (see PIG-157), splits in the data flow are not handled efficiently. > In the top level plans splits cause all the output data to be written to hdfs and then reread by each leg of the split. This forces both a read/write and a new map/reduce pass when it is not always necessary. For example, consider: > A = load 'myfile'; > split A into B if $0 < 100, C if $0 >= 100; > B1 = group B by $0; > ... > C1 = group B by $1; > ... > In this case A will be loaded, and then immediately stored again. Then a plan will be executed that handles the B* part of the script, and then another executed that will handle the C* part of the script. > In nested plans, each projection of the generate is computed separately, even if they share common steps in the plan. For example: > B = group A by $0; > C= foreach B { > C1 = distinct $1; > C2 = filter C1 by $1 > 0; > generate group, COUNT(C1), COUNT(C2); > } > That will currently be executed with two nested plans, distinct->COUNT(C1) and distinct->filter->COUNT(C2). The same distinct will be computed twice. Ideally we would like to compute the distinct once and then split the output. > I suspect that optimizing the inner plan is more important because there are more situations where this occurs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.