drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5293) Poor performance of Hash Table due to same hash value as distribution below
Date Wed, 01 Mar 2017 00:22:45 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889186#comment-15889186

ASF GitHub Bot commented on DRILL-5293:

GitHub user Ben-Zvi reopened a pull request:


    DRILL-5293: Change the seed of the hash function for distribution

    The fix is to use a different _seed_ for the hash function in the cases the hash is computed
for distribution (i.e., when calling the method `getHashExpression(List<DistributionField>
fields, RelDataType rowType)` ).
       The implementation -- added a third parameter _seed_ to createCall() (on line 102 in
HashPrelUtil.java), which makes it use the "with seed" versions of the generated code functions
(before that it was using the versions that had a built-in seed of zero).
       Then made all the calling places pass in a seed -- zero in the case of a Hash Table
, or a large fixed prime ( 1301011 ) for distribution. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Ben-Zvi/drill DRILL-5293

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #765
commit de2ba4c0dc31c2c1f36be4ea220986d7e8eedc88
Author: Boaz Ben-Zvi <boazben-zvi@bbenzvi-e754-mbp13.local>
Date:   2017-02-25T00:48:42Z

    DRILL-5293: Change seed for distribution hash function to differ from that of the hash


> Poor performance of Hash Table due to same hash value as distribution below
> ---------------------------------------------------------------------------
>                 Key: DRILL-5293
>                 URL: https://issues.apache.org/jira/browse/DRILL-5293
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Codegen
>    Affects Versions: 1.8.0
>            Reporter: Boaz Ben-Zvi
>            Assignee: Boaz Ben-Zvi
> The computation of the hash value is basically the same whether for the Hash Table (used
by Hash Agg, and Hash Join), or for distribution of rows at the exchange. As a result, a specific
Hash Table (in a parallel minor fragment) gets only rows "filtered out" by the partition below
("upstream"), so the pattern of this filtering leads to a non uniform usage of the hash buckets
in the table.
>   Here is a simplified example: An exchange partitions into TWO (minor fragments), each
running a Hash Agg. So the partition sends rows of EVEN hash values to the first, and rows
of ODD hash values to the second. Now the first recomputes the _same_ hash value for its Hash
table -- and only the even buckets get used !!  (Or with a partition into EIGHT -- possibly
only one eighth of the buckets would be used !! ) 
>    This would lead to longer hash chains and thus a _poor performance_ !
> A possible solution -- add a distribution function distFunc (only for partitioning) that
takes the hash value and "scrambles" it so that the entropy in all the bits effects the low
bits of the output. This function should be applied (in HashPrelUtil) over the generated code
that produces the hash value, like:
>    distFunc( hash32(field1, hash32(field2, hash32(field3, 0))) );
> Tested with a huge hash aggregate (64 M rows) and a parallelism of 8 ( planner.width.max_per_node
= 8 ); minor fragments 0 and 4 used only 1/8 of their buckets, the others used 1/4 of their
buckets.  Maybe the reason for this variance is that distribution is using "hash32AsDouble"
and hash agg is using "hash32".  

This message was sent by Atlassian JIRA

View raw message