hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (Jira)" <j...@apache.org>
Subject [jira] [Work logged] (HIVE-23493) Rewrite plan to join back tables with many projected columns joined multiple times
Date Tue, 16 Jun 2020 15:11:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-23493?focusedWorklogId=446575&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-446575
]

ASF GitHub Bot logged work on HIVE-23493:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 16/Jun/20 15:10
            Start Date: 16/Jun/20 15:10
    Worklog Time Spent: 10m 
      Work Description: jcamachor commented on a change in pull request #1124:
URL: https://github.com/apache/hive/pull/1124#discussion_r440927072



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveCardinalityPreservingJoinRule.java
##########
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptCost;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.metadata.JaninoRelMetadataProvider;
+import org.apache.calcite.rel.metadata.RelMetadataQuery;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveDefaultTezModelRelMetadataProvider;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Rule to trigger {@link HiveCardinalityPreservingJoinOptimization} on top of the plan.
+ */
+public class HiveCardinalityPreservingJoinRule extends HiveFieldTrimmerRule {
+  private static final Logger LOG = LoggerFactory.getLogger(HiveCardinalityPreservingJoinRule.class);
+
+  private final double factor;
+
+  public HiveCardinalityPreservingJoinRule(double factor) {
+    super(false, "HiveCardinalityPreservingJoinRule");
+    this.factor = Math.max(factor, 0.0);
+  }
+
+  @Override
+  protected RelNode trim(RelOptRuleCall call, RelNode node) {
+    RelNode optimized = new HiveCardinalityPreservingJoinOptimization().trim(call.builder(),
node);
+    if (optimized == node) {
+      return node;
+    }
+
+    JaninoRelMetadataProvider original = RelMetadataQuery.THREAD_PROVIDERS.get();
+    RelMetadataQuery.THREAD_PROVIDERS.set(getJaninoRelMetadataProvider());
+    RelMetadataQuery metadataQuery = RelMetadataQuery.instance();
+
+    RelOptCost optimizedCost = metadataQuery.getCumulativeCost(optimized);
+    RelOptCost originalCost = metadataQuery.getCumulativeCost(node);
+    originalCost = originalCost.multiplyBy(factor);
+    LOG.debug("Original plan cost {} Optimized plan cost {}", originalCost, optimizedCost);

Review comment:
       nit. `Original plan cost: {} vs Optimized plan cost: {}` ? Or something like that so
we can read it more clearly.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 446575)
    Time Spent: 1h  (was: 50m)

> Rewrite plan to join back tables with many projected columns joined multiple times
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-23493
>                 URL: https://issues.apache.org/jira/browse/HIVE-23493
>             Project: Hive
>          Issue Type: New Feature
>          Components: CBO
>            Reporter: Krisztian Kasa
>            Assignee: Krisztian Kasa
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-23493.1.patch
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Queries with a pattern where one or more tables joins with a fact table in a CTE. Many
columns are projected out those tables and then grouped in the CTE.  The main query joins
multiple instances of the CTE and may project a subset of these.
> The optimization is to rewrite the CTE to include only key (PK, non null Unique Key)
columns and join the tables back to the resultset of the main query to fetch the rest of the
wide columns. This reduces the datasize of the joined back tables that is broadcast/shuffled
throughout the DAG processing.
> Example query, tpc-ds query4
> {code}
> with year_total as (
>  select c_customer_id customer_id
>        ,c_first_name customer_first_name
>        ,c_last_name customer_last_name
>        ,c_preferred_cust_flag customer_preferred_cust_flag
>        ,c_birth_country customer_birth_country
>        ,c_login customer_login
>        ,c_email_address customer_email_address
>        ,d_year dyear
>        ,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2)
year_total
>        ,'s' sale_type
>  from customer
>      ,store_sales
>      ,date_dim
>  where c_customer_sk = ss_customer_sk
>    and ss_sold_date_sk = d_date_sk
>  group by c_customer_id
>          ,c_first_name
>          ,c_last_name
>          ,c_preferred_cust_flag
>          ,c_birth_country
>          ,c_login
>          ,c_email_address
>          ,d_year
>  union all
>  select c_customer_id customer_id
>        ,c_first_name customer_first_name
>        ,c_last_name customer_last_name
>        ,c_preferred_cust_flag customer_preferred_cust_flag
>        ,c_birth_country customer_birth_country
>        ,c_login customer_login
>        ,c_email_address customer_email_address
>        ,d_year dyear
>        ,sum((((cs_ext_list_price-cs_ext_wholesale_cost-cs_ext_discount_amt)+cs_ext_sales_price)/2)
) year_total
>        ,'c' sale_type
>  from customer
>      ,catalog_sales
>      ,date_dim
>  where c_customer_sk = cs_bill_customer_sk
>    and cs_sold_date_sk = d_date_sk
>  group by c_customer_id
>          ,c_first_name
>          ,c_last_name
>          ,c_preferred_cust_flag
>          ,c_birth_country
>          ,c_login
>          ,c_email_address
>          ,d_year
> union all
>  select c_customer_id customer_id
>        ,c_first_name customer_first_name
>        ,c_last_name customer_last_name
>        ,c_preferred_cust_flag customer_preferred_cust_flag
>        ,c_birth_country customer_birth_country
>        ,c_login customer_login
>        ,c_email_address customer_email_address
>        ,d_year dyear
>        ,sum((((ws_ext_list_price-ws_ext_wholesale_cost-ws_ext_discount_amt)+ws_ext_sales_price)/2)
) year_total
>        ,'w' sale_type
>  from customer
>      ,web_sales
>      ,date_dim
>  where c_customer_sk = ws_bill_customer_sk
>    and ws_sold_date_sk = d_date_sk
>  group by c_customer_id
>          ,c_first_name
>          ,c_last_name
>          ,c_preferred_cust_flag
>          ,c_birth_country
>          ,c_login
>          ,c_email_address
>          ,d_year
>          )
>   select  
>                   t_s_secyear.customer_id
>                  ,t_s_secyear.customer_first_name
>                  ,t_s_secyear.customer_last_name
>                  ,t_s_secyear.customer_birth_country
>  from year_total t_s_firstyear
>      ,year_total t_s_secyear
>      ,year_total t_c_firstyear
>      ,year_total t_c_secyear
>      ,year_total t_w_firstyear
>      ,year_total t_w_secyear
>  where t_s_secyear.customer_id = t_s_firstyear.customer_id
>    and t_s_firstyear.customer_id = t_c_secyear.customer_id
>    and t_s_firstyear.customer_id = t_c_firstyear.customer_id
>    and t_s_firstyear.customer_id = t_w_firstyear.customer_id
>    and t_s_firstyear.customer_id = t_w_secyear.customer_id
>    and t_s_firstyear.sale_type = 's'
>    and t_c_firstyear.sale_type = 'c'
>    and t_w_firstyear.sale_type = 'w'
>    and t_s_secyear.sale_type = 's'
>    and t_c_secyear.sale_type = 'c'
>    and t_w_secyear.sale_type = 'w'
>    and t_s_firstyear.dyear =  1999
>    and t_s_secyear.dyear = 1999+1
>    and t_c_firstyear.dyear =  1999
>    and t_c_secyear.dyear =  1999+1
>    and t_w_firstyear.dyear = 1999
>    and t_w_secyear.dyear = 1999+1
>    and t_s_firstyear.year_total > 0
>    and t_c_firstyear.year_total > 0
>    and t_w_firstyear.year_total > 0
>    and case when t_c_firstyear.year_total > 0 then t_c_secyear.year_total / t_c_firstyear.year_total
else null end
>            > case when t_s_firstyear.year_total > 0 then t_s_secyear.year_total
/ t_s_firstyear.year_total else null end
>    and case when t_c_firstyear.year_total > 0 then t_c_secyear.year_total / t_c_firstyear.year_total
else null end
>            > case when t_w_firstyear.year_total > 0 then t_w_secyear.year_total
/ t_w_firstyear.year_total else null end
>  order by t_s_secyear.customer_id
>          ,t_s_secyear.customer_first_name
>          ,t_s_secyear.customer_last_name
>          ,t_s_secyear.customer_birth_country
> limit 100;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message