spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wenchen Fan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-26024) Dataset API: repartitionByRange(...) has inconsistent behaviour
Date Tue, 13 Nov 2018 17:49:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-26024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685532#comment-16685532
] 

Wenchen Fan commented on SPARK-26024:
-------------------------------------

range partitioner is usually not very accurate due to performance reasons. You are welcome
to send a patch to improve the doc. But I will not increase `sampleSizePerPartition` too much,
as it may hurt performance or even OOM. Why would you need a super accurate range partitioner
for your (large) data set?  

> Dataset API: repartitionByRange(...) has inconsistent behaviour
> ---------------------------------------------------------------
>
>                 Key: SPARK-26024
>                 URL: https://issues.apache.org/jira/browse/SPARK-26024
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0, 2.3.1, 2.3.2
>         Environment: Spark version 2.3.2
>            Reporter: Julien Peloton
>            Priority: Major
>              Labels: dataFrame, partitioning, repartition, spark-sql
>
> Hi,
> I recently played with the {{repartitionByRange}} method for DataFrame introduced in
SPARK-22614. For DataFrames larger than the one tested in the code (which has only 10 elements),
the code sends back random results.
> As a test for showing the inconsistent behaviour, I start as the unit code used to test
{{repartitionByRange}} ([here|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala#L352])
but I increase the size of the initial array to 1000, repartition using 3 partitions, and
count the number of element per-partitions:
>  
> {code}
> // Shuffle numbers from 0 to 1000, and make a DataFrame
> val df = Random.shuffle(0.to(1000)).toDF("val")
> // Repartition it using 3 partitions
> // Sum up number of elements in each partition, and collect it.
> // And do it several times
> for (i <- 0 to 9) {
>   var counts = df.repartitionByRange(3, col("val"))
>     .mapPartitions{part => Iterator(part.size)}
>     .collect()
>   println(counts.toList)
> }
> // -> the number of elements in each partition varies...
> {code}
> I do not know whether it is expected (I will dig further in the code), but it sounds
like a bug.
>  Or I just misinterpret what {{repartitionByRange}} is for?
>  Any ideas?
> Thanks!
>  Julien



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message