flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stavros Kontopoulos <st.kontopou...@gmail.com>
Subject Re: flink snapshotting fault-tolerance
Date Thu, 19 May 2016 18:22:07 GMT
Hey thnx for the links. There are assumptions though like reliable
channels... since you rely on tcp in practice and if a checkpoint fails or
is very slow then you need to deal with it.... thats why i asked previously
what happens then..  3cp does not need assumptions i think, but engineering
is more practical (it should be) and a different story in general.
The [2] mentions also the assumptions...

Best,
Stavros

On Thu, May 19, 2016 at 9:14 PM, Paris Carbone <parisc@kth.se> wrote:

> Hi Stavros,
>
> Currently, rollback failure recovery in Flink works in the pipeline level,
> not in the task level (see Millwheel [1]). It further builds on repayable
> stream logs (i.e. Kafka), thus, there is no need for 3pc or backup in the
> pipeline sources. You can also check this presentation [2] which explains
> the basic concepts more in detail I hope. Mind that many upcoming
> optimisation opportunities are going to be addressed in the not so
> long-term Flink roadmap.
>
> Paris
>
> [1]
> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41378.pdf
> [2]
> http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha
>
>
> <http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha>
>
>
> <http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha>
>
> On 19 May 2016, at 19:43, Stavros Kontopoulos <st.kontopoulos@gmail.com>
> wrote:
>
> Cool thnx. So if a checkpoint expires the pipeline will block or fail in
> total or only the specific task related to the operator (running along with
> the checkpoint task) or nothing happens?
>
> On Tue, May 17, 2016 at 3:49 PM, Robert Metzger <rmetzger@apache.org>
> wrote:
>
>> Hi Stravos,
>>
>> I haven't implemented our checkpointing mechanism and I didn't
>> participate in the design decisions while implementing it, so I can not
>> compare it in detail to other approaches.
>>
>> From a "does it work perspective": Checkpoints are only confirmed if all
>> parallel subtasks successfully created a valid snapshot of the state. So if
>> there is a failure in the checkpointing mechanism, no valid checkpoint will
>> be created. The system will recover from the last valid checkpoint.
>> There is a timeout for checkpoints. So if a barrier doesn't pass through
>> the system for a certain period of time, the checkpoint is cancelled. The
>> default timeout is 10 minutes.
>>
>> Regards,
>> Robert
>>
>>
>> On Mon, May 16, 2016 at 1:22 PM, Stavros Kontopoulos <
>> st.kontopoulos@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I was looking into the flink snapshotting algorithm details also
>>> mentioned here:
>>>
>>> http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
>>>
>>> https://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed-dataflows/
>>>
>>> http://mail-archives.apache.org/mod_mbox/flink-user/201601.mbox/%3CCANC1h_s6MCWSuDf2zSnEeD66LszDoLx0jt64++0kBOKTjkAv7w%40mail.gmail.com%3E
>>>
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/About-exactly-once-question-td2545.html
>>>
>>> From other sources i understand that it assumes no failures to work for
>>> message delivery or for example a process hanging for ever:
>>> https://en.wikipedia.org/wiki/Snapshot_algorithm
>>>
>>> https://blog.acolyer.org/2015/04/22/distributed-snapshots-determining-global-states-of-distributed-systems/
>>>
>>> So my understanding (maybe wrong) is that this is a solution which seems
>>> not to address the fault tolerance issue in a strong manner like for
>>> example if it was to use a 3pc protocol for local state propagation and
>>> global agreement. I know the latter is not efficient just mentioning it for
>>> comparison.
>>>
>>> How the algorithm behaves in practical terms under the presence of its
>>> own failures (this is a background process collecting partial states)? Are
>>> there timeouts for reaching a barrier?
>>>
>>> PS. have not looked deep into the code details yet, planning to.
>>>
>>> Best,
>>> Stavros
>>>
>>>
>>
>
>

Mime
View raw message