hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongzhi Chen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-12189) The list in pushdownPreds of ppd.ExprWalkerInfo should not be allowed to grow very large
Date Mon, 19 Oct 2015 13:15:05 GMT

    [ https://issues.apache.org/jira/browse/HIVE-12189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963281#comment-14963281
] 

Yongzhi Chen commented on HIVE-12189:
-------------------------------------

I did some comparing tests:
1. Backported HIVE-11652 to HIVE 1.1 version, does not make compile much faster, it still
took 102 seconds. So HIVE-11652 alone can not make egw.startWalking as fast as HIVE 2.0. I
need find more jiras to backport.
2. Only apply the patch for this jira(HIVE-12189), the compile time for the query drop to
6.2 seconds for HIVE 1.1. 
3. For HIVE 2.0, this patch help drop the compile time from 6.6 second to 2.3


> The list in pushdownPreds of ppd.ExprWalkerInfo should not be allowed to grow very large
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-12189
>                 URL: https://issues.apache.org/jira/browse/HIVE-12189
>             Project: Hive
>          Issue Type: Bug
>          Components: Logical Optimizer
>    Affects Versions: 1.1.0, 2.0.0
>            Reporter: Yongzhi Chen
>            Assignee: Yongzhi Chen
>         Attachments: HIVE-12189.1.patch
>
>
> Some queries are very slow in compile time, for example following query
> {noformat}
> select * from tt1 nf 
> join tt2 a1 on (nf.col1 = a1.col1 and nf.hdp_databaseid = a1.hdp_databaseid) 
> join tt3 a2 on        (a2.col2 = a1.col2 and a2.col3 = nf.col3 and a2.hdp_databaseid
= nf.hdp_databaseid) 
> join tt4 a3 on              (a3.col4 = a2.col4 and a3.col3 = a2.col3) 
> join tt5 a4 on     (a4.col4 = a2.col4 and a4.col5 = a2.col5 and a4.col3 = a2.col3 and
a4.hdp_databaseid = nf.hdp_databaseid) 
> join tt6 a5 on              (a5.col3 = a2.col3 and a5.col2 = a2.col2 and a5.hdp_databaseid
= nf.hdp_databaseid) 
> JOIN tt7 a6 ON (a2.col3 = a6.col3 and a2.col2 = a6.col2 and a6.hdp_databaseid = nf.hdp_databaseid)

> JOIN tt8 a7 ON (a2.col3 = a7.col3 and a2.col2 = a7.col2 and a7.hdp_databaseid = nf.hdp_databaseid)
> where nf.hdp_databaseid = 102 limit 10;
> {noformat}
> takes around 120 seconds to compile in hive 1.1 when
> hive.mapred.mode=strict;
> hive.optimize.ppd=true;
> and hive is not in test mode.
> All the above tables are tables with one column as partition. But all the tables are
empty table. If the tables are not empty, it is reported that the compile so slow that it
looks like hive is hanging. 
> In hive 2.0, the compile is much faster, explain takes 6.6 seconds. But it is still a
lot of time. One of the problem slows ppd down is that list in pushdownPreds can grow very
large which makes extractPushdownPreds bad performance:
> {noformat}
> public static ExprWalkerInfo extractPushdownPreds(OpWalkerInfo opContext,
>     Operator<? extends OperatorDesc> op, List<ExprNodeDesc> preds)
> {noformat}
> During run the query above, in the following break point preds  has size of 12051, and
most entry of the list is: GenericUDFOPEqual(Column[hdp_databaseid], Const int 102), GenericUDFOPEqual(Column[hdp_databaseid],
Const int 102), GenericUDFOPEqual(Column[hdp_databaseid], Const int 102), GenericUDFOPEqual(Column[hdp_databaseid],
Const int 102), ....
> Following code in extractPushdownPreds will clone all the nodes in preds and do the walk.
Hive 2.0 is faster because HIVE-11652(and other jiras) makes startWalking much faster, but
we still clone thousands of nodes with same expression. Should we store so many same predicates
in the list or just one is good enough?  
> {noformat}
>     List<Node> startNodes = new ArrayList<Node>();
>     List<ExprNodeDesc> clonedPreds = new ArrayList<ExprNodeDesc>();
>     for (ExprNodeDesc node : preds) {
>       ExprNodeDesc clone = node.clone();
>       clonedPreds.add(clone);
>       exprContext.getNewToOldExprMap().put(clone, node);
>     }
>     startNodes.addAll(clonedPreds);
>     egw.startWalking(startNodes, null);
> {noformat}
> Should we change java/org/apache/hadoop/hive/ql/ppd/ExprWalkerInfo.java
> method 
> public void addFinalCandidate(String alias, ExprNodeDesc expr) 
> and
> public void addPushDowns(String alias, List<ExprNodeDesc> pushDowns) 
> to only add expr which is not in the PushDown list for an alias?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message