hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward Capriolo (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-3027) The optimizer architecture of Hive is terrible, need code refactoring
Date Sat, 26 May 2012 00:49:23 GMT

    [ https://issues.apache.org/jira/browse/HIVE-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283856#comment-13283856
] 

Edward Capriolo commented on HIVE-3027:
---------------------------------------

Patches welcome. I am sure if you re factor the code and make it better no one will be adverse
.
                
> The optimizer architecture of Hive is terrible, need code refactoring
> ---------------------------------------------------------------------
>
>                 Key: HIVE-3027
>                 URL: https://issues.apache.org/jira/browse/HIVE-3027
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>    Affects Versions: 0.4.0, 0.4.1, 0.5.0, 0.6.0, 0.7.0, 0.7.1, 0.8.0, 0.8.1
>            Reporter: anders
>              Labels: architecture, optimizer, ysmart
>
> Now I want to add a complete cost-based optimization for hive. but when I begin the work,
I found it very difficult to do using current hive optimization framework. The current code
of hive, optimizations are all done after generating DAG of operators. It is a awful design
and makes me mad. For example, the map-side optimization, it scans the whole operators' DAG
and try to find the operators that can be replaced by map-operation and then replace it. How
terrible and stupid the code is!!! The terrible code expands to 1000 lines, and only implements
the map-side optimizations!!! 
> In my opinion, optimization shouldn't be done in a separated step, differnt optimization
should be done in appropriate time. For example, join reorder, should be done when we parse
the input query, and we can generate Map-Reduce operators or only Map-Operator for each join
according to the cost estimation. And, in the process, we can do join and aggreagation merge,
and, we shoud push down predicate in proper time and generate proper data sturcture, to insure
the cose-estimation module can fetch corresponding predicate of each base table for estimating
JOIN cost. How concise and graceful the code will be if we do the optimization this way!!!
 But Now, in order to complying with the Optimiser framework of Hive, I have to write lots
of ugly code with amazing redundancy, and, the code is very very difficult to debug!!!! Now
there is a patch of cost-based JOIN reorder and merge optimizer called YSMART, I glance at
it. It use 6000+ code and is difficult to read!! And it's optimization is incompleted.
> The optimizer architecture of Hive is terrible, How can I do now?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message