hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Namit Jain (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-3276) optimize union sub-queries
Date Mon, 06 Aug 2012 06:50:05 GMT

    [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428996#comment-13428996
] 

Namit Jain commented on HIVE-3276:
----------------------------------

Looked at it in more detail.

It might be cleaner to add a optimization step for this, which changes the operator tree.
The current approach is simpler to quickly get the code out, but may not be a good idea in
the long run.
We have tried to keep all the optimizations pluggable, it helps with roll-outs, fixing bugs
slowly etc.
With the current approach, it is very difficult to make this pluggable. Again, it it possible
to check
the new conf. in GenMRUnion1, but it looks like a hacky approach.
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in
at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message