hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Damien Carol (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-7826) Dynamic partition pruning on Tez
Date Mon, 01 Sep 2014 08:52:21 GMT

     [ https://issues.apache.org/jira/browse/HIVE-7826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Damien Carol updated HIVE-7826:
-------------------------------
    Description: 
It's natural in a star schema to map one or more dimensions to partition columns. Time or
location are likely candidates. 

It can also useful to be to compute the partitions one would like to scan via a subquery (where
p in select ... from ...).

The resulting joins in hive require a full table scan of the large table though, because partition
pruning takes place before the corresponding values are known.

On Tez it's relatively straight forward to send the values needed to prune to the application
master - where splits are generated and tasks are submitted. Using these values we can strip
out any unneeded partitions dynamically, while the query is running.

The approach is straight forward:

- Insert synthetic conditions for each join representing "x in (keys of other side in join)"
- This conditions will be pushed as far down as possible
- If the condition hits a table scan and the column involved is a partition column:
   - Setup Operator to send key events to AM
- else:
   - Remove synthetic predicate

Add  these properties :
||Property||Default Value||Com||
|{{hive.tez.dynamic.partition.pruning}}|true||
|{{hive.tez.dynamic.partition.pruning.max.event.size}}|1*1024*1024L||
|{{hive.tez.dynamic.partition.pruning.max.event.size}}|1*1024*1024L||

  was:
It's natural in a star schema to map one or more dimensions to partition columns. Time or
location are likely candidates. 

It can also useful to be to compute the partitions one would like to scan via a subquery (where
p in select ... from ...).

The resulting joins in hive require a full table scan of the large table though, because partition
pruning takes place before the corresponding values are known.

On Tez it's relatively straight forward to send the values needed to prune to the application
master - where splits are generated and tasks are submitted. Using these values we can strip
out any unneeded partitions dynamically, while the query is running.

The approach is straight forward:

- Insert synthetic conditions for each join representing "x in (keys of other side in join)"
- This conditions will be pushed as far down as possible
- If the condition hits a table scan and the column involved is a partition column:
   - Setup Operator to send key events to AM
- else:
   - Remove synthetic predicate




> Dynamic partition pruning on Tez
> --------------------------------
>
>                 Key: HIVE-7826
>                 URL: https://issues.apache.org/jira/browse/HIVE-7826
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Gunther Hagleitner
>            Assignee: Gunther Hagleitner
>              Labels: TODOC14, tez
>         Attachments: HIVE-7826.1.patch, HIVE-7826.2.patch, HIVE-7826.3.patch, HIVE-7826.4.patch,
HIVE-7826.5.patch
>
>
> It's natural in a star schema to map one or more dimensions to partition columns. Time
or location are likely candidates. 
> It can also useful to be to compute the partitions one would like to scan via a subquery
(where p in select ... from ...).
> The resulting joins in hive require a full table scan of the large table though, because
partition pruning takes place before the corresponding values are known.
> On Tez it's relatively straight forward to send the values needed to prune to the application
master - where splits are generated and tasks are submitted. Using these values we can strip
out any unneeded partitions dynamically, while the query is running.
> The approach is straight forward:
> - Insert synthetic conditions for each join representing "x in (keys of other side in
join)"
> - This conditions will be pushed as far down as possible
> - If the condition hits a table scan and the column involved is a partition column:
>    - Setup Operator to send key events to AM
> - else:
>    - Remove synthetic predicate
> Add  these properties :
> ||Property||Default Value||Com||
> |{{hive.tez.dynamic.partition.pruning}}|true||
> |{{hive.tez.dynamic.partition.pruning.max.event.size}}|1*1024*1024L||
> |{{hive.tez.dynamic.partition.pruning.max.event.size}}|1*1024*1024L||



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message