spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mgaido91 <>
Subject [GitHub] spark pull request #22008: [SPARK-24928][SQL] Optimize cross join according ...
Date Mon, 06 Aug 2018 16:28:53 GMT
GitHub user mgaido91 opened a pull request:

    [SPARK-24928][SQL] Optimize cross join according to stats

    ## What changes were proposed in this pull request?
    The cartesian product of 2 RDDs perform a nested loop. This means that the iterator for
the inner RDD is built as many times as the number of rows of the outer one. If the two RDDs
have a very different size, the performance difference can be huge.
    As there is no way to know which is the best RDD to choose as outer one (since we don't
know the sizes), this cannot be addressed at RDD level. Only a comment has been added to warn/help
the user to be careful about how they write their code.
    The PR proposed to add an optimizer rule which uses statistics collected on tables in
order to change the sides of the cartesian product so that the outer table is the smaller
    ## How was this patch tested?
    added test suite

You can merge this pull request into a Git repository by running:

    $ git pull SPARK-24928

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22008
commit 1caf2567694c56cca019e6608609b81ac70deefa
Author: Marco Gaido <marcogaido91@...>
Date:   2018-08-06T15:51:58Z

    [SPARK-24928][SQL] Optimize cross join according to stats



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message