beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amit Sela (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (BEAM-1582) ResumeFromCheckpointStreamingTest flakes with what appears as a second firing.
Date Fri, 17 Mar 2017 22:21:41 GMT

    [ https://issues.apache.org/jira/browse/BEAM-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895861#comment-15895861
] 

Amit Sela edited comment on BEAM-1582 at 3/17/17 10:21 PM:
-----------------------------------------------------------

Could be related to SPARK-14701 and/or SPARK-14930 so that the {{CheckpointMark}} is not properly
checkpointed.
If for some reason the runtime environment was so slow it failed to start execution until
timeout was hit, graceful stop would force to at least finish the first batch, and if this
first batch included the read from Kafka on one hand, while failing to checkpoint the {{Reader}}
mark on the other, resuming from checkpoint would read all the Kafka back log again causing
the failures we see.

I'll have a look at failed tests execution time to figure out if that seems to be the case,
and if so I will simply move this test to post commit because This issue in Spark was only
resolved for v2.0


was (Author: amitsela):
Could be related to SPARK-14701 and/or SPARK-14930 so that the last {{CheckpointMark}} is
not properly checkpointed.
If for some reason the runtime environment was so slow it failed to start execution until
timeout was hit, graceful stop would force to at least finish the first batch, and if this
first batch included the read from Kafka on one hand, while failing to checkpoint the {{Reader}}
mark on the other, resuming from checkpoint would read all the Kafka back log again causing
the failures we see.

I'll have a look at failed tests execution time to figure out if that seems to be the case,
and if so I will simply move this test to post commit because This issue in Spark was only
resolved for v2.0

> ResumeFromCheckpointStreamingTest flakes with what appears as a second firing.
> ------------------------------------------------------------------------------
>
>                 Key: BEAM-1582
>                 URL: https://issues.apache.org/jira/browse/BEAM-1582
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Amit Sela
>            Assignee: Amit Sela
>             Fix For: First stable release
>
>
> See: https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-spark/2788/testReport/junit/org.apache.beam.runners.spark.translation.streaming/ResumeFromCheckpointStreamingTest/testWithResume/
> After some digging in it appears that a second firing occurs (though only one is expected)
but it doesn't come from a stale state (state is empty before it fires).
> Might be a retry happening for some reason, which is OK in terms of fault-tolerance guarantees
(at-least-once), but not so much in terms of flaky tests. 
> I'm looking into this hoping to fix this ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message