spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hari Shreedharan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed
Date Tue, 10 Mar 2015 23:46:38 GMT

    [ https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14355965#comment-14355965
] 

Hari Shreedharan commented on SPARK-6222:
-----------------------------------------

Another option is to change the way we delete old block data - we delete the data only for
the lastProcessedBatch time. Even that should fix this issue (and does not change much of
the checkpoint time logic). I am testing that out now. I, though, prefer the logic currently
in the PR because it makes the checkpointing more deterministic - at checkpoint time "t",
the batch generated at time "t" has been processed, while currently at the checkpoint time
"t" - the batch may or may not have been processed, which is a bit non-deterministic than
I'd like.

> [STREAMING] All data may not be recovered from WAL when driver is killed
> ------------------------------------------------------------------------
>
>                 Key: SPARK-6222
>                 URL: https://issues.apache.org/jira/browse/SPARK-6222
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.3.0
>            Reporter: Hari Shreedharan
>            Priority: Blocker
>         Attachments: AfterPatch.txt, CleanWithoutPatch.txt, SPARK-6122.patch
>
>
> When testing for our next release, our internal tests written by [~wypoon] caught a regression
in Spark Streaming between 1.2.0 and 1.3.0. The test runs FlumePolling stream to read data
from Flume, then kills the Application Master. Once YARN restarts it, the test waits until
no more data is to be written and verifies the original against the data on HDFS. This was
passing in 1.2.0, but is failing now.
> Since the test ties into Cloudera's internal infrastructure and build process, it cannot
be directly run on an Apache build. But I have been working on isolating the commit that may
have caused the regression. I have confirmed that it was caused by SPARK-5147 (PR # [4149|https://github.com/apache/spark/pull/4149]).
I confirmed this several times using the test and the failure is consistently reproducible.

> To re-confirm, I reverted just this one commit (and Clock consolidation one to avoid
conflicts), and the issue was no longer reproducible.
> Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0
> /cc [~tdas], [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message