flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stavros Kontopoulos <st.kontopou...@gmail.com>
Subject Re: flink snapshotting fault-tolerance
Date Thu, 19 May 2016 18:26:09 GMT
"Checkpoints are only confirmed if all parallel subtasks successfully
created a valid snapshot of the state." as stated by Robert. So to rephrase
my question... how confirmation that all snapshots are finished is done and
what happens if some task is very slow...or is blocked?
If you have N tasks confirmed and one missing what do you do? You start a
new checkpoint for that one? or a global new checkpoint for the rest of N
tasks as well?

On Thu, May 19, 2016 at 9:21 PM, Paris Carbone <parisc@kth.se> wrote:

>
> Regarding your last question,
> If a checkpoint expires it just gets invalidated and a new complete
> checkpoint will eventually occur that can be used for recovery. If I am
> wrong, or something has changed please correct me.
>
> Paris
>
> On 19 May 2016, at 20:14, Paris Carbone <parisc@kth.se> wrote:
>
> Hi Stavros,
>
> Currently, rollback failure recovery in Flink works in the pipeline level,
> not in the task level (see Millwheel [1]). It further builds on repayable
> stream logs (i.e. Kafka), thus, there is no need for 3pc or backup in the
> pipeline sources. You can also check this presentation [2] which explains
> the basic concepts more in detail I hope. Mind that many upcoming
> optimisation opportunities are going to be addressed in the not so
> long-term Flink roadmap.
>
> Paris
>
> [1]
> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41378.pdf
> [2]
> http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha
>
>
> <http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha>
>
>
> <http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha>
>
> On 19 May 2016, at 19:43, Stavros Kontopoulos <st.kontopoulos@gmail.com>
> wrote:
>
> Cool thnx. So if a checkpoint expires the pipeline will block or fail in
> total or only the specific task related to the operator (running along with
> the checkpoint task) or nothing happens?
>
> On Tue, May 17, 2016 at 3:49 PM, Robert Metzger <rmetzger@apache.org>
> wrote:
>
>> Hi Stravos,
>>
>> I haven't implemented our checkpointing mechanism and I didn't
>> participate in the design decisions while implementing it, so I can not
>> compare it in detail to other approaches.
>>
>> From a "does it work perspective": Checkpoints are only confirmed if all
>> parallel subtasks successfully created a valid snapshot of the state. So if
>> there is a failure in the checkpointing mechanism, no valid checkpoint will
>> be created. The system will recover from the last valid checkpoint.
>> There is a timeout for checkpoints. So if a barrier doesn't pass through
>> the system for a certain period of time, the checkpoint is cancelled. The
>> default timeout is 10 minutes.
>>
>> Regards,
>> Robert
>>
>>
>> On Mon, May 16, 2016 at 1:22 PM, Stavros Kontopoulos <
>> st.kontopoulos@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I was looking into the flink snapshotting algorithm details also
>>> mentioned here:
>>>
>>> http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
>>>
>>> https://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed-dataflows/
>>>
>>> http://mail-archives.apache.org/mod_mbox/flink-user/201601.mbox/%3CCANC1h_s6MCWSuDf2zSnEeD66LszDoLx0jt64++0kBOKTjkAv7w%40mail.gmail.com%3E
>>>
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/About-exactly-once-question-td2545.html
>>>
>>> From other sources i understand that it assumes no failures to work for
>>> message delivery or for example a process hanging for ever:
>>> https://en.wikipedia.org/wiki/Snapshot_algorithm
>>>
>>> https://blog.acolyer.org/2015/04/22/distributed-snapshots-determining-global-states-of-distributed-systems/
>>>
>>> So my understanding (maybe wrong) is that this is a solution which seems
>>> not to address the fault tolerance issue in a strong manner like for
>>> example if it was to use a 3pc protocol for local state propagation and
>>> global agreement. I know the latter is not efficient just mentioning it for
>>> comparison.
>>>
>>> How the algorithm behaves in practical terms under the presence of its
>>> own failures (this is a background process collecting partial states)? Are
>>> there timeouts for reaching a barrier?
>>>
>>> PS. have not looked deep into the code details yet, planning to.
>>>
>>> Best,
>>> Stavros
>>>
>>>
>>
>
>
>

Mime
View raw message