spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mgaido91 <...@git.apache.org>
Subject [GitHub] spark pull request #22008: [SPARK-24928][SQL] Optimize cross join according ...
Date Tue, 07 Aug 2018 20:47:52 GMT
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22008#discussion_r208379788
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
---
    @@ -152,3 +153,45 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper
{
           if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
       }
     }
    +
    +/**
    + * Swaps right and left logical plans of a join when left is bigger than right. This
is useful
    + * because underlying cartesian product performs a nested loop, thus if the outer table
is
    + * smaller there are less iterator initialization.
    --- End diff --
    
    This is indeed an interesting point. I am not sure how/if we can measure the cost in the
creation of the involved iterator and the cost of creating it.
    
    Anyway, actually this will optimize not only the initialization cost for the iterator,
but also the overall number of record read/processed. Let's take an example. Imagine that
we have a table A with 10M record and a table B with 100 records. The total number of record
retrieved is:
    
     - if A is the left table, we process: 10M (all the records from A) + 100 * 10M (all the
records from B for every record from A) = 101 * 10M
     - if B is the left table, we process: 100 (all the records from B) + 100 * 10M (all the
records from A for each record from B) = ~ 100 * 10M
    
    So in the second case we process size of A - size B less records (same applies to number
of bytes read).
    
    And there is another good point for the second option: ie. Spark is much better at computing/reading
10 times 10M records that 10M times 2 records as it can exploits its parallelism.
    
    That said, your comment still applies, ie. there may be cases in which one side is very
onerous despite is the one with less records involved. Do you have any suggestion about how
to estimate this? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message