Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Message-ID: <1080987620.1221063404558.JavaMail.jira@brutus>
Date: Wed, 10 Sep 2008 09:16:44 -0700 (PDT)
From: "Ashish Thusoo (JIRA)" <jira@apache.org>
To: core-dev@hadoop.apache.org
Subject: [jira] Commented: (HADOOP-4139) [Hive] multi group by statement is
 not optimized
In-Reply-To: <167188638.1221004064193.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629845#action_12629845 ] 

Ashish Thusoo commented on HADOOP-4139:
---------------------------------------

I should be done reviewing this in couple of hours...

A few minor comments though:

1. In the tests we should drop the created destination tables. At some point we want to ensure that the cleanup code for a test is isolated within the test. (This is minor - I am ok with it as is for now).
2. The check to disallow different distincts - can that be moved up and potentially even before we generate the groupbyPlan. No point going through the entire processing stuff if we can disallow it right up front.
3. Also a comment describing the algorithm somewhere would be great


> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.