Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 44188 invoked from network); 10 Sep 2008 16:17:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Sep 2008 16:17:08 -0000 Received: (qmail 20967 invoked by uid 500); 10 Sep 2008 16:17:02 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 20926 invoked by uid 500); 10 Sep 2008 16:17:02 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 20915 invoked by uid 99); 10 Sep 2008 16:17:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Sep 2008 09:17:02 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Sep 2008 16:16:12 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 88A10234C1D4 for ; Wed, 10 Sep 2008 09:16:44 -0700 (PDT) Message-ID: <1080987620.1221063404558.JavaMail.jira@brutus> Date: Wed, 10 Sep 2008 09:16:44 -0700 (PDT) From: "Ashish Thusoo (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-4139) [Hive] multi group by statement is not optimized In-Reply-To: <167188638.1221004064193.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629845#action_12629845 ] Ashish Thusoo commented on HADOOP-4139: --------------------------------------- I should be done reviewing this in couple of hours... A few minor comments though: 1. In the tests we should drop the created destination tables. At some point we want to ensure that the cleanup code for a test is isolated within the test. (This is minor - I am ok with it as is for now). 2. The check to disallow different distincts - can that be moved up and potentially even before we generate the groupbyPlan. No point going through the entire processing stuff if we can disallow it right up front. 3. Also a comment describing the algorithm somewhere would be great > [Hive] multi group by statement is not optimized > ------------------------------------------------ > > Key: HADOOP-4139 > URL: https://issues.apache.org/jira/browse/HADOOP-4139 > Project: Hadoop Core > Issue Type: Bug > Components: contrib/hive > Reporter: Namit Jain > Assignee: Namit Jain > Attachments: patch1 > > > A simple multi-group by statement is not optimized. A simple statement like: > FROM SRC > INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct SUBSTR(SRC.value,4)) GROUP BY SRC.key > INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct SUBSTR(SRC.value,4)) GROUP BY SRC.key; > results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. > The first step can be common to all group bys. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.