spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ulanov, Alexander" <alexander.ula...@hp.com>
Subject Increase partition count (repartition) without shuffle
Date Thu, 18 Jun 2015 21:26:00 GMT
Hi,

Is there a way to increase the amount of partition of RDD without causing shuffle? I've found
JIRA issue https://issues.apache.org/jira/browse/SPARK-5997 however there is no implementation
yet.

Just in case, I am reading data from ~300 big binary files, which results in 300 partitions,
then I need to sort my RDD, but it crashes with outofmemory exception. If I change the number
of partitions to 2000, sort works OK, but repartition itself takes a lot of time due to shuffle.

Best regards, Alexander

Mime
View raw message