hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lefty Leverenz (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-9277) Hybrid Hybrid Grace Hash Join
Date Fri, 23 Jan 2015 10:01:34 GMT

    [ https://issues.apache.org/jira/browse/HIVE-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289035#comment-14289035
] 

Lefty Leverenz commented on HIVE-9277:
--------------------------------------

[~wzheng] put the design doc on the wiki here:  [Hybrid Hybrid Grace Hash Join, v1.0 | https://cwiki.apache.org/confluence/display/Hive/Hybrid+Hybrid+Grace+Hash+Join,+v1.0].

_Review comment:_  The final graphic in "Recursive Hashing and Spilling" says ...

bq.  Now we probe using Matchfile 1 against HT 3. Matching values go into result. Non-matching
values go to Matchfile 4.

... but it shows non-matching values from HT4, not HT3, going to Matchfile4.  A dashed line
from HT3 to Matchfile4 is missing.  And should the text say "probe using Matchfile 1 against
HT3 and HT4 (if it fits in memory)"?

> Hybrid Hybrid Grace Hash Join
> -----------------------------
>
>                 Key: HIVE-9277
>                 URL: https://issues.apache.org/jira/browse/HIVE-9277
>             Project: Hive
>          Issue Type: New Feature
>          Components: Physical Optimizer
>            Reporter: Wei Zheng
>            Assignee: Wei Zheng
>              Labels: join
>         Attachments: High-leveldesignforHybridHybridGraceHashJoinv1.0.pdf
>
>
> We are proposing an enhanced hash join algorithm called “hybrid hybrid grace hash join”.
We can benefit from this feature as illustrated below:
> o The query will not fail even if the estimated memory requirement is slightly wrong
> o Expensive garbage collection overhead can be avoided when hash table grows
> o Join execution using a Map join operator even though the small table doesn't fit in
memory as spilling some data from the build and probe sides will still be cheaper than having
to shuffle the large fact table
> The design was based on Hadoop’s parallel processing capability and significant amount
of memory available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message