hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maciek Kocon (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-12336) Sort Merge Partition Map Join
Date Wed, 04 Nov 2015 15:49:27 GMT

     [ https://issues.apache.org/jira/browse/HIVE-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Maciek Kocon updated HIVE-12336:
--------------------------------
    Description: 
Logically and functionally bucketing and partitioning are quite similar - both provide mechanism
to segregate and separate the table's data based on its content. Thanks to that significant
further optimisations like [partition] PRUNING or [bucket] MAP JOIN are possible.
The difference seems to be imposed by design where the PARTITIONing is open/explicit while
BUCKETing is discrete/implicit.
Partitioning seems to be very common if not a standard feature in all current RDBMS while
BUCKETING seems to be HIVE specific only.
In a way BUCKETING could be also called by "hashing" or simply "IMPLICIT PARTITIONING".

Regardless of the fact that these two are recognised as two separate features available in
Hive there should be nothing to prevent leveraging same existing query/join optimisations
across the two.

PARTITION SORT MERGE MAPJOIN
Use the same type of optimization as in SORT MERGE BUCKETED MAP JOIN for partitioned tables.
The sort-merge join optimization could be performed when PARTITIONED tables being joined are
sorted and partitioned on the join columns.

The corresponding partitions are joined with each other at the mapper. If both A and B have
partitions set on their columns KEY, the following join
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM A a JOIN B b ON a.key = b.key
can be done on the mapper only. The mapper for the partition key='201512' for A will traverse
the corresponding partition for B. Traversing is possible if the corresponding partitions
are sorted on the same columns. This is dependent on (taken care by [HIVE-11525|https://issues.apache.org/jira/browse/HIVE-12337])

  was:
Logically and functionally bucketing and partitioning are quite similar - both provide mechanism
to segregate and separate the table's data based on its content. Thanks to that significant
further optimisations like [partition] PRUNING or [bucket] MAP JOIN are possible.
The difference seems to be imposed by design where the PARTITIONing is open/explicit while
BUCKETing is discrete/implicit.
Partitioning seems to be very common if not a standard feature in all current RDBMS while
BUCKETING seems to be HIVE specific only.
In a way BUCKETING could be also called by "hashing" or simply "IMPLICIT PARTITIONING".

Regardless of the fact that these two are recognised as two separate features available in
Hive there should be nothing to prevent leveraging same existing query/join optimisations
across the two.

PARTITION MAPJOIN
Use the same type of optimization as in SORT MERGE BUCKETED MAP JOIN for partitioned tables.
The sort-merge join optimization could be performed when PARTITIONED tables being joined are
sorted and partitioned on the join columns.

The corresponding partitions are joined with each other at the mapper. If both A and B have
partitions set on their columns KEY, the following join
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM A a JOIN B b ON a.key = b.key
can be done on the mapper only. The mapper for the partition key='201512' for A will traverse
the corresponding partition for B. Traversing is possible if the corresponding partitions
are sorted on the same columns. This is dependent on (taken care by [HIVE-11525|https://issues.apache.org/jira/browse/HIVE-12337])


> Sort Merge Partition Map Join
> -----------------------------
>
>                 Key: HIVE-12336
>                 URL: https://issues.apache.org/jira/browse/HIVE-12336
>             Project: Hive
>          Issue Type: Improvement
>          Components: Logical Optimizer, Physical Optimizer, SQL
>    Affects Versions: 0.13.0, 0.14.0, 0.13.1, 1.0.0, 1.1.0
>            Reporter: Maciek Kocon
>              Labels: gsoc2015
>
> Logically and functionally bucketing and partitioning are quite similar - both provide
mechanism to segregate and separate the table's data based on its content. Thanks to that
significant further optimisations like [partition] PRUNING or [bucket] MAP JOIN are possible.
> The difference seems to be imposed by design where the PARTITIONing is open/explicit
while BUCKETing is discrete/implicit.
> Partitioning seems to be very common if not a standard feature in all current RDBMS while
BUCKETING seems to be HIVE specific only.
> In a way BUCKETING could be also called by "hashing" or simply "IMPLICIT PARTITIONING".
> Regardless of the fact that these two are recognised as two separate features available
in Hive there should be nothing to prevent leveraging same existing query/join optimisations
across the two.
> PARTITION SORT MERGE MAPJOIN
> Use the same type of optimization as in SORT MERGE BUCKETED MAP JOIN for partitioned
tables.
> The sort-merge join optimization could be performed when PARTITIONED tables being joined
are sorted and partitioned on the join columns.
> The corresponding partitions are joined with each other at the mapper. If both A and
B have partitions set on their columns KEY, the following join
> SELECT /*+ MAPJOIN(b) */ a.key, a.value
> FROM A a JOIN B b ON a.key = b.key
> can be done on the mapper only. The mapper for the partition key='201512' for A will
traverse the corresponding partition for B. Traversing is possible if the corresponding partitions
are sorted on the same columns. This is dependent on (taken care by [HIVE-11525|https://issues.apache.org/jira/browse/HIVE-12337])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message