spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Commented] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice
Date Tue, 03 Nov 2015 01:40:27 GMT


Apache Spark commented on SPARK-8582:

User 'zsxwing' has created a pull request for this issue:

> Optimize checkpointing to avoid computing an RDD twice
> ------------------------------------------------------
>                 Key: SPARK-8582
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.0
>            Reporter: Andrew Or
>            Assignee: Shixiong Zhu
> In Spark, checkpointing allows the user to truncate the lineage of his RDD and save the
intermediate contents to HDFS for fault tolerance. However, this is not currently implemented
super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during the action
that triggered the checkpointing in the first place, and once while we checkpoint (we iterate
through an RDD's partitions and write them to disk). See this line for more detail:
> Instead, we should have a `CheckpointingInterator` that writes checkpoint data to HDFS
while we run the action. This will speed up many usages of `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but this is
not always viable for very large input data. It's also not a great API to use in general.)

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message