flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Metzger <rmetz...@apache.org>
Subject Re: flink snapshotting fault-tolerance
Date Tue, 17 May 2016 12:49:31 GMT
Hi Stravos,

I haven't implemented our checkpointing mechanism and I didn't participate
in the design decisions while implementing it, so I can not compare it in
detail to other approaches.

>From a "does it work perspective": Checkpoints are only confirmed if all
parallel subtasks successfully created a valid snapshot of the state. So if
there is a failure in the checkpointing mechanism, no valid checkpoint will
be created. The system will recover from the last valid checkpoint.
There is a timeout for checkpoints. So if a barrier doesn't pass through
the system for a certain period of time, the checkpoint is cancelled. The
default timeout is 10 minutes.


On Mon, May 16, 2016 at 1:22 PM, Stavros Kontopoulos <
st.kontopoulos@gmail.com> wrote:

> Hi,
> I was looking into the flink snapshotting algorithm details also mentioned
> here:
> http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
> https://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed-dataflows/
> http://mail-archives.apache.org/mod_mbox/flink-user/201601.mbox/%3CCANC1h_s6MCWSuDf2zSnEeD66LszDoLx0jt64++0kBOKTjkAv7w%40mail.gmail.com%3E
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/About-exactly-once-question-td2545.html
> From other sources i understand that it assumes no failures to work for
> message delivery or for example a process hanging for ever:
> https://en.wikipedia.org/wiki/Snapshot_algorithm
> https://blog.acolyer.org/2015/04/22/distributed-snapshots-determining-global-states-of-distributed-systems/
> So my understanding (maybe wrong) is that this is a solution which seems
> not to address the fault tolerance issue in a strong manner like for
> example if it was to use a 3pc protocol for local state propagation and
> global agreement. I know the latter is not efficient just mentioning it for
> comparison.
> How the algorithm behaves in practical terms under the presence of its own
> failures (this is a background process collecting partial states)? Are
> there timeouts for reaching a barrier?
> PS. have not looked deep into the code details yet, planning to.
> Best,
> Stavros

View raw message