hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HIVE-870) semi joins
Date Wed, 04 Nov 2009 07:19:32 GMT

     [ https://issues.apache.org/jira/browse/HIVE-870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ning Zhang updated HIVE-870:
----------------------------

    Attachment: Hive-870.patch

Uploading Hive-870.patch. This patch includes the following changes:

1) enhance the HiveQL syntax to support left semi join.
2) introduce a new left semi join type in the CommonJoinOperator. This join operator implements
early-exit whenever a match is found in the right-hand-side table of the left semi join.
3) At the map side, add a select operator to project the join keys only of the RHS table,
followed by a map-side partial group-by operator that eliminate duplicate keys. We only need
the key, the value is NULL. 
4) if the RHS is used as map-side join, only the selection operator is introduced. The map-side
groupby operator is not necessary.
5) some misc clean ups (e.g., allowing '--' comments appear in any place in the unit test
qfiles). A lot of unit test diffs are due to this change. All the unit tests for semi join
are in semijoin.q.

> semi joins
> ----------
>
>                 Key: HIVE-870
>                 URL: https://issues.apache.org/jira/browse/HIVE-870
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: Hive-870.patch
>
>
> Semi-join is an efficient way to unnest an IN/EXISTS subquery. For example,
> select * 
> from A
> where A.id IN 
>    (select id
>     from B
>     where B.date> '2009-10-01');
> returns from A whose ID is in the set of IDs found in B, whose date is greater than a
certain date. This query can be unnested using a INNER join or LEFT OUTER JOIN, but we need
to deduplicate the IDs returned by the subquery on table B. The semantics of LEFT SEMI JOIN
is that as long as there is ANY row in the right-hand table that matches the join key, the
left-hand table row will be emitted as a result w/o necessarily looking further in the right-hand
table for further matches. This is exactly the semantics of the IN subquery. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message