spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lev Katzav (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Deleted] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice
Date Sat, 26 Aug 2017 05:00:08 GMT

     [ https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lev Katzav updated SPARK-8582:
------------------------------
    Comment: was deleted

(was: Any update on this?
what are the plans for spark 2?

thanks)

> Optimize checkpointing to avoid computing an RDD twice
> ------------------------------------------------------
>
>                 Key: SPARK-8582
>                 URL: https://issues.apache.org/jira/browse/SPARK-8582
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.0
>            Reporter: Andrew Or
>            Assignee: Shixiong Zhu
>
> In Spark, checkpointing allows the user to truncate the lineage of his RDD and save the
intermediate contents to HDFS for fault tolerance. However, this is not currently implemented
super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during the action
that triggered the checkpointing in the first place, and once while we checkpoint (we iterate
through an RDD's partitions and write them to disk). See this line for more detail: https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.
> Instead, we should have a `CheckpointingInterator` that writes checkpoint data to HDFS
while we run the action. This will speed up many usages of `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but this is
not always viable for very large input data. It's also not a great API to use in general.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message