spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yhuai <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-7871] [SQL] Improve the outputPartition...
Date Mon, 03 Aug 2015 06:16:07 GMT
GitHub user yhuai opened a pull request:

    https://github.com/apache/spark/pull/7886

    [SPARK-7871] [SQL] Improve the outputPartitioning for outer joins.

    https://issues.apache.org/jira/browse/SPARK-7871
    
    This PR adds the concept of `nullSafe` to `ClusteredDistribution` and `HashPartitioning`.
For a `ClusteredDistribution`, if its `nullSafe` field is false, it does not require all rows
whose `clustering expressions` have nulls be clustered. For a `HashPartitioning`, if its `nullSafe`
field is false, it does not guarantee that rows whose `clustering expressions` have nulls
be clustered.
    
    This concept can be used with equal joins. A shuffled equal join operator (`ShuffledHashJoin`,
`ShuffledHashOuterJoin`, and `SortMergeJoin`) can use `ClusteredDistribution`s with `nullSafe
= false`. By adding this concept, we can avoid shuffle data when we have outer joins. For
example, we only need three `Exchange` operators for a query like `SELECT ... A LEFT OUTER
JOIN B ON (A.key = B.key) LEFT OUTER JOIN (B.key = C.key)` instead of four `Exchange` operators.
    
    BTW, this PR does not shuffle rows with null partition keys randomly (#7685 has that part.
We can add that part later).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yhuai/spark nullSafe

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7886.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7886
    
----
commit 2bc9be3562101a11c5b0cf47994f203118ba7104
Author: Yin Huai <yhuai@databricks.com>
Date:   2015-08-03T06:07:48Z

    Add the concept of nullSafe to ClusteredDistribution and Partitioning.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message