spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From hvanhovell <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-8682][SQL][WIP] Range Join
Date Fri, 17 Jul 2015 16:28:33 GMT
Github user hvanhovell commented on the pull request:

    https://github.com/apache/spark/pull/7379#issuecomment-122333357
  
    No problem.
    
    ### Supporting N-Ary Predicates.
    In order to make the range join work we need the predicates to define a single interval
for each side of the join. For instance the clause: ```a.low < b.high && b.low
< a.high``` implies that there are two intervals: [a.low, a.high] & [b.low, b.high].
An open interval, for instance ```a.low < b.high```, would also work. 
    
    When we use more than two clauses, we can potentially have multiple intervals, in your
example for instance ```a.key < b.key and a.key2 > b.key2 and a.key3>=b.key3``` would
yield the following intervals: [a.key1, a.key2], [a.key1, a.key3], [b.key2, b.key1] &
[b.key2, b.key3]. Creating a working index, that can deal with the (partially) uncorrelated
intervals, will be quite a challenge (I haven't really looked into this yet). We could offcourse
pick join on one pair of intervals and use filtering to take of the rest.
    
    I think the Unary and Binary cases are the most common. Let's start there, and see if
there is demand for N-ary designs.
    
    ### Generalization
    If you consider the fact that we are joining intervals (Ranges if you will), range partitioning
will not work because this assumes both intervals will be entirely in the same partition (they
can span multiple partitions). When dealing with larger tables we would have to use a special
interval-aware partitioning, this would create partitions for a number of fully covering non-overlapping
intervals, and would multicast the rows to each interval it belongs to. The subsequent step
would be using an index or doing a cartesian/BNL join.
    
    Doing a Cartesian Join in a single partition performs horrible. I thought it wouldn't
be a problem either, but this completely killed the performance of an analysis I was doing
for a client (account balances at specific dates). 
    
    I do see opportunities for code re-use. But this would be by generalizing HashedRelation
and the BroadCast join family.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message