hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mostafa Mokhtar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-7616) pre-size mapjoin hashtable based on statistics
Date Wed, 06 Aug 2014 18:30:14 GMT

    [ https://issues.apache.org/jira/browse/HIVE-7616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087996#comment-14087996
] 

Mostafa Mokhtar commented on HIVE-7616:
---------------------------------------

This will work for most of the TPC-DS queries since joins with the dimension tables is always
on key columns and there is a PK/FK relationship between the dimension tables and the fact
tables , hence for most cases the number of rows for the broadcast table will be equal to
the number of keys. (One to Many joins)

In MapJoins where tables don't naturally have a PK/FK relation (Many to Many joins) the number
of rows can be significantly higher than the number of keys.

Can you add the following perflogging to track such potential issue:
1) Number of keys in hash table after load Vs. Number of keys at init
2) Number of times expandAndRehash was called and total amount of time spent there

Using these metrics we can track the performance and behavior of the hash table.


> pre-size mapjoin hashtable based on statistics
> ----------------------------------------------
>
>                 Key: HIVE-7616
>                 URL: https://issues.apache.org/jira/browse/HIVE-7616
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-7616.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message