spark-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject spark git commit: [SPARK-17417][CORE] Fix # of partitions for Reliable RDD checkpointing
Date Mon, 10 Oct 2016 15:57:38 GMT
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 d27df3579 -> d719e9a08

[SPARK-17417][CORE] Fix # of partitions for Reliable RDD checkpointing

## What changes were proposed in this pull request?
Currently the no. of partition files are limited to 10000 files (%05d format). If there are
more than 10000 part files, the logic goes for a toss while recreating the RDD as it sorts
them by string. More details can be found in the JIRA desc [here](

## How was this patch tested?
I tested this patch by checkpointing a RDD and then manually renaming part files to the old
format and tried to access the RDD. It was successfully created from the old format. Also
verified loading a sample parquet file and saving it as multiple formats - CSV, JSON, Text,
Parquet, ORC and read them successfully back from the saved files. I couldn't launch the unit
test from my local box, so will wait for the Jenkins output.

Author: Dhruve Ashar <>

Closes #15370 from dhruve/bug/SPARK-17417.

(cherry picked from commit 4bafacaa5f50a3e986c14a38bc8df9bae303f3a0)
Signed-off-by: Tom Graves <>


Branch: refs/heads/branch-2.0
Commit: d719e9a080a909a6a56db938750d553668743f8f
Parents: d27df35
Author: Dhruve Ashar <>
Authored: Mon Oct 10 10:55:57 2016 -0500
Committer: Tom Graves <>
Committed: Mon Oct 10 10:57:30 2016 -0500

 .../main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala  | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala b/core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala
index fddb935..b73214f 100644
--- a/core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala
@@ -69,10 +69,10 @@ private[spark] class ReliableCheckpointRDD[T: ClassTag](
     val inputFiles = fs.listStatus(cpath)
-      .sortBy(_.toString)
+      .sortBy(_.getName.stripPrefix("part-").toInt)
     // Fail fast if input files are invalid
     inputFiles.zipWithIndex.foreach { case (path, i) =>
-      if (!path.toString.endsWith(ReliableCheckpointRDD.checkpointFileName(i))) {
+      if (path.getName != ReliableCheckpointRDD.checkpointFileName(i)) {
         throw new SparkException(s"Invalid checkpoint file: $path")

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message