hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maciek Kocon (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-9523) For partitioned tables same optimizations should be available as for bucketed tables and vice versa: ①[Sort Merge] PARTITION Map join and ②BUCKET pruning
Date Wed, 18 Mar 2015 16:59:38 GMT

     [ https://issues.apache.org/jira/browse/HIVE-9523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Maciek Kocon updated HIVE-9523:
-------------------------------
          Description: 
Logically and functionally bucketing and partitioning are quite similar - both provide mechanism
to segregate and separate the table's data based on its content. Thanks to that significant
further optimisations like [partition] PRUNING or [bucket] MAP JOIN are possible.
The difference seems to be imposed by design where the PARTITIONing is open/explicit while
BUCKETing is discrete/implicit.
Partitioning seems to be very common if not a standard feature in all current RDBMS while
BUCKETING seems to be HIVE specific only.
In a way BUCKETING could be also called by "hashing" or simply "IMPLICIT PARTITIONING".

Regardless of the fact that these two are recognised as two separate features available in
Hive there should be nothing to prevent leveraging same existing query/join optimisations
across the two.


①[Sort Merge] PARTITION Map join
Enable Bucket Map Join or better, the Sort Merge Bucket Map Join equivalent optimisations
when PARTITIONING is used exclusively or in combination with BUCKETING.

For JOIN conditions where partitioning criteria are used respectively:
            ⋮ 
FROM TabA JOIN TabB
   ON TabA.partCol1 = TabB.partCol2
   AND TabA.partCol2 = TabB.partCol2

the optimizer could/should choose to treat it the same way as with bucketed tables: ⋮ 
FROM TabC
  JOIN TabD
     ON TabC.clusteredByCol1 = TabD.clusteredByCol2
   AND TabC.clusteredByCol2 = TabD.clusteredByCol2

and use either Bucket Map Join or better, the Sort Merge Bucket Map Join.

This is based on fact that same way as buckets translate to separate files, the partitions
essentially provide the same mapping.
When data locality is known the optimizer could focus only on joining corresponding partitions
rather than whole data sets.

②BUCKET pruning
Enable partition PRUNING equivalent optimisation for queries on BUCKETED tables

Simplest example is for queries like:
"SELECT … FROM x WHERE colA=123123"
to read only the relevant bucket file rather than all file-buckets that belong to a table.

  was:
For JOIN conditions where partitioning criteria are used respectively:
            ⋮ 
FROM TabA JOIN TabB
   ON TabA.partCol1 = TabB.partCol2
   AND TabA.partCol2 = TabB.partCol2

the optimizer could/should choose to treat it the same way as with bucketed tables: ⋮ 
FROM TabC
  JOIN TabD
     ON TabC.clusteredByCol1 = TabD.clusteredByCol2
   AND TabC.clusteredByCol2 = TabD.clusteredByCol2

and use either Bucket Map Join or better, the Sort Merge Bucket Map Join.

This is based on fact that same way as buckets translate to separate files, the partitions
essentially provide the same mapping.
When data locality is known the optimizer could focus only on joining corresponding partitions
rather than whole data sets.

#side notes:
⦿ Currently Table DDL Syntax where Partitioning and Bucketing defined at the same time is
allowed:
CREATE TABLE
 ⋮
PARTITIONED BY(…) CLUSTERED BY(…) INTO … BUCKETS;

But in this case optimizer never chooses to use Bucket Map Join or Sort Merge Bucket Map Join
which defeats the purpose of creating BUCKETed tables in such scenarios. Should that be raised
as a separate BUG?

⦿ Currently partitioning and bucketing are two separate things but serve same purpose -
shouldn't the concept be merged (explicit/implicit partitions?)

    Affects Version/s: 1.1.0
                       1.0.0
              Summary: For partitioned tables same optimizations should be available as for
bucketed tables and vice versa: ①[Sort Merge] PARTITION Map join and ②BUCKET pruning 
(was: when columns on which tables are partitioned are used in the join condition same join
optimizations as for bucketed tables should be applied)

> For partitioned tables same optimizations should be available as for bucketed tables
and vice versa: ①[Sort Merge] PARTITION Map join and ②BUCKET pruning
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-9523
>                 URL: https://issues.apache.org/jira/browse/HIVE-9523
>             Project: Hive
>          Issue Type: Improvement
>          Components: Logical Optimizer, Physical Optimizer, SQL
>    Affects Versions: 0.13.0, 0.14.0, 0.13.1, 1.0.0, 1.1.0
>            Reporter: Maciek Kocon
>              Labels: gsoc2015
>
> Logically and functionally bucketing and partitioning are quite similar - both provide
mechanism to segregate and separate the table's data based on its content. Thanks to that
significant further optimisations like [partition] PRUNING or [bucket] MAP JOIN are possible.
> The difference seems to be imposed by design where the PARTITIONing is open/explicit
while BUCKETing is discrete/implicit.
> Partitioning seems to be very common if not a standard feature in all current RDBMS while
BUCKETING seems to be HIVE specific only.
> In a way BUCKETING could be also called by "hashing" or simply "IMPLICIT PARTITIONING".
> Regardless of the fact that these two are recognised as two separate features available
in Hive there should be nothing to prevent leveraging same existing query/join optimisations
across the two.
> ①[Sort Merge] PARTITION Map join
> Enable Bucket Map Join or better, the Sort Merge Bucket Map Join equivalent optimisations
when PARTITIONING is used exclusively or in combination with BUCKETING.
> For JOIN conditions where partitioning criteria are used respectively:
>             ⋮ 
> FROM TabA JOIN TabB
>    ON TabA.partCol1 = TabB.partCol2
>    AND TabA.partCol2 = TabB.partCol2
> the optimizer could/should choose to treat it the same way as with bucketed tables: ⋮

> FROM TabC
>   JOIN TabD
>      ON TabC.clusteredByCol1 = TabD.clusteredByCol2
>    AND TabC.clusteredByCol2 = TabD.clusteredByCol2
> and use either Bucket Map Join or better, the Sort Merge Bucket Map Join.
> This is based on fact that same way as buckets translate to separate files, the partitions
essentially provide the same mapping.
> When data locality is known the optimizer could focus only on joining corresponding partitions
rather than whole data sets.
> ②BUCKET pruning
> Enable partition PRUNING equivalent optimisation for queries on BUCKETED tables
> Simplest example is for queries like:
> "SELECT … FROM x WHERE colA=123123"
> to read only the relevant bucket file rather than all file-buckets that belong to a table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message