spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [spark] apapi opened a new pull request #27077: [SPARK-30408][SQL] Should not remove orderBy in sortBy clause in Optimizer
Date Fri, 03 Jan 2020 02:18:22 GMT
apapi opened a new pull request #27077: [SPARK-30408][SQL] Should not remove orderBy in sortBy
clause in Optimizer
URL: https://github.com/apache/spark/pull/27077
 
 
   ### What changes were proposed in this pull request?
   Fix defect [SPARK-30408](https://issues.apache.org/jira/browse/SPARK-30408) in EliminateSorts:
 orderBy in sortBy clause was removed by EliminateSorts.
   code to reproduce:
   ```
   val dataset = Seq( ("a", 1, 4), ("b", 2, 5), ("c", 3, 6) ).toDF("a", "b", "c") 
   val groupData = dataset.orderBy("b")
   val sortData = groupData.sortWithinPartitions("c")
   ```
   The content of groupData is:
   ```
   partition 0: 
       [a,1,4]
   partition 1: 
       [b,2,5]
   partition 2: 
       [c,3,6]
   ```
   The content of sortData is:
   ```
   partition 0: 
       [a,1,4]
   partition 1: 
       [b,2,5], 
       [c,3,6]
   ```
   The content of sortData is not correct because of orderBy was removed by EliminateSorts.
   The content of sortData should be same as groupData.
   
   ### Why are the changes needed?
   This PR fixed defect [SPARK-30408](https://issues.apache.org/jira/browse/SPARK-30408).
   Without this fix, the output of 
       ```rdd.orderBy("b").sortWithinPartitions("c")```
   is same as 
       ```rdd.sortWithinPartitions("c")```
   which is not correct.
   
   ### Does this PR introduce any user-facing change?
   No.
   
   ### How was this patch tested?
   I add an UT in ```EliminateSortsSuite``` to test this patch.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message