spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [spark] AngersZhuuuu commented on issue #26437: [SPARK-29800][SQL] Plan non-correlated Exists 's subquery in PlanSubqueries
Date Fri, 22 Nov 2019 18:14:19 GMT
AngersZhuuuu commented on issue #26437: [SPARK-29800][SQL] Plan non-correlated Exists 's subquery
in PlanSubqueries
URL: https://github.com/apache/spark/pull/26437#issuecomment-557636283
 
 
   cc @cloud-fan 
   Simply look at the calculation process, the calculation of non-correlated exists sub-query
is very fast.  And remove one shuffle, I will try this in our env with real production case.
   **With this pr**
   ```
   scala> (1 to 10000).toDF("id").createOrReplaceTempView("s1")
   scala> (0 to 50000).toDF("id").createOrReplaceTempView("s2")
   scala> (0 to 1000000).map(_ * 2).toDF("id").createOrReplaceTempView("s3")
   scala>       val df = sql(
        |         """
        |           | SELECT s1.id  FROM s1
        |           | WHERE EXISTS (SELECT * from s3)
        |         """.stripMargin)
   df: org.apache.spark.sql.DataFrame = [id: int]
   scala>       var start = System.currentTimeMillis()
   start: Long = 1574445595283
   scala>       df.show(5)
   +---+
   | id|
   +---+
   |  1|
   |  2|
   |  3|
   |  4|
   |  5|
   +---+
   only showing top 5 rows
   scala>       var end = System.currentTimeMillis()
   end: Long = 1574445609103
   scala>       println(s"duration = ${end - start}")
   duration = 13820
   ```
   
   ![image](https://user-images.githubusercontent.com/46485123/69449609-46a9a580-0d96-11ea-9755-847e4b75c99c.png)
   ![image](https://user-images.githubusercontent.com/46485123/69449578-32fe3f00-0d96-11ea-8126-1e06d0353851.png)
   
   **Without this pr current master:**
   ```
   scala> (1 to 10000).toDF("id").createOrReplaceTempView("s1")
   scala> (0 to 50000).toDF("id").createOrReplaceTempView("s2")
   scala> (0 to 1000000).map(_ * 2).toDF("id").createOrReplaceTempView("s3")
   scala>       val df = sql(
        |         """
        |           | SELECT s1.id  FROM s1
        |           | WHERE EXISTS (SELECT * from s3)
        |         """.stripMargin)
   df: org.apache.spark.sql.DataFrame = [id: int]
   scala>       var start = System.currentTimeMillis()
   start: Long = 1574445708886
   scala>       df.show(5)
   +---+
   | id|
   +---+
   |  1|
   |  2|
   |  3|
   |  4|
   |  5|
   +---+
   only showing top 5 rows
   scala>       var end = System.currentTimeMillis()
   end: Long = 1574445730126
   scala>       println(s"duration = ${end - start}")
   duration = 21240
   ```
   
   ![image](https://user-images.githubusercontent.com/46485123/69449638-4f9a7700-0d96-11ea-96ce-61a3bab87a0e.png)
   ![image](https://user-images.githubusercontent.com/46485123/69449559-2a0d6d80-0d96-11ea-9b61-9c0e30310d71.png)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message