hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liyunzhang_intel (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]
Date Wed, 31 May 2017 02:39:04 GMT

     [ https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

liyunzhang_intel updated HIVE-11297:
------------------------------------
    Attachment: HIVE-11297.1.patch

[~csun]: update patch, as in my environment,[case "multiple sources, single key"|https://issues.apache.org/jira/browse/HIVE-16780]
in spark_dynamic_pruning.q fails, i could not generate new spark_dynamic_partition_pruning.q.out.
I extract the test case about "multi columns, single source" in a new qfile "spark_dynamic_partition_pruning_combine.q"(
here i create a configuration item " hive.spark.dynamic.partition.pruning.combine" ,so if
this config item is not enabled, combine op trees for partiition info will not happen)
{code}
set hive.optimize.ppd=true;
set hive.ppd.remove.duplicatefilters=true;
set hive.spark.dynamic.partition.pruning=true;
set hive.optimize.metadataonly=false;
set hive.optimize.index.filter=true;
set hive.strict.checks.cartesian.product=false;
set hive.spark.dynamic.partition.pruning=true;
set hive.spark.dynamic.partition.pruning.combine=true;


-- SORT_QUERY_RESULTS
create table srcpart_date_hour as select ds as ds, ds as `date`, hr as hr, hr as hour from
srcpart group by ds, hr;
-- multiple columns single source
EXPLAIN select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = srcpart_date_hour.ds
and srcpart.hr = srcpart_date_hour.hr) where srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour
= 11;
select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = srcpart_date_hour.ds
and srcpart.hr = srcpart_date_hour.hr) where srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour
= 11;
set hive.spark.dynamic.partition.pruning.combine=false;
EXPLAIN select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = srcpart_date_hour.ds
and srcpart.hr = srcpart_date_hour.hr) where srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour
= 11;
select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = srcpart_date_hour.ds
and srcpart.hr = srcpart_date_hour.hr) where srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour
= 11;
{code}

I think we can parallel, you can review and i continue to fix HIVE-16780. after fixing HIVE-16780
in my environment, i can update the spark_dynamic_partition_pruning.q.out with the change
of HIVE-11297.

> Combine op trees for partition info generating tasks [Spark branch]
> -------------------------------------------------------------------
>
>                 Key: HIVE-11297
>                 URL: https://issues.apache.org/jira/browse/HIVE-11297
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: spark-branch
>            Reporter: Chao Sun
>            Assignee: liyunzhang_intel
>         Attachments: HIVE-11297.1.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates partition
info for more than one partition columns, multiple operator trees are created, which all start
from the same table scan op, but have different spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do table scan
multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message