Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
From: Paris Carbone <parisc@kth.se>
To: Stavros Kontopoulos <st.kontopoulos@gmail.com>
CC: "user@flink.apache.org" <user@flink.apache.org>
Subject: Re: flink snapshotting fault-tolerance
Thread-Topic: flink snapshotting fault-tolerance
Thread-Index: AQHRsfo51gLy1fifLkOFEAHcpbdDpQ==
Date: Thu, 19 May 2016 18:48:52 +0000
Message-ID: <50FE65F1-1668-4E68-B07A-010C56333D5B@kth.se>
References: <CACTd3c9meNB7OtVX88emVfU7VzvCGaGMWCPoF7-+0SQeGVbVXQ@mail.gmail.com>
 <CAGr9p8BivffpVrFbqnMH+7zwHH-38wG5Eh5T-UZwGyFudbg1Qw@mail.gmail.com>
 <CACTd3c_sz18=S7wtbfctjO5i0s5=MqxJgJD29EDdurR5r9+6qQ@mail.gmail.com>
 <59949853-89C5-489C-8AFA-622B697F5892@kth.se>
 <8E662FE6-E75F-4676-87AF-442C657B0E05@tetrationanalytics.com>
 <CACTd3c_wvKum6ML4Uubd9orUjkkSNYRrFb31=SQgtsXEQmTcRA@mail.gmail.com>
In-Reply-To: <CACTd3c_wvKum6ML4Uubd9orUjkkSNYRrFb31=SQgtsXEQmTcRA@mail.gmail.com>
Accept-Language: en-US, sv-SE
Content-Language: en-US
x-ms-exchange-messagesentrepresentingtype: 1
x-ms-exchange-transport-fromentityheader: Hosted
x-originating-ip: [85.224.102.36]
Content-Type: multipart/alternative;
	boundary="_000_50FE65F116684E68B07A010C56333D5Bkthse_"
MIME-Version: 1.0
archived-at: Thu, 19 May 2016 18:49:02 -0000

--_000_50FE65F116684E68B07A010C56333D5Bkthse_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Sure, in practice you can set a threshold of retries since an operator impl=
ementation could cause this indefinitely or any other reason can make snaps=
hotting generally infeasible. If I recall correctly that threshold exists i=
n the Flink configuration.

On 19 May 2016, at 20:42, Stavros Kontopoulos <st.kontopoulos@gmail.com<mai=
lto:st.kontopoulos@gmail.com>> wrote:

The problem here is different though if something is keep failing (permanen=
tly) in practice someone needs to be notified. If the user loses snapshotti=
ng he must know.

On Thu, May 19, 2016 at 9:36 PM, Abhishek R. Singh <abhishsi@tetrationanaly=
tics.com<mailto:abhishsi@tetrationanalytics.com>> wrote:
I was wondering how checkpoints can be async? Because your state is constan=
tly mutating. You probably need versioned state, or immutable data structs?

-Abhishek-

On May 19, 2016, at 11:14 AM, Paris Carbone <parisc@kth.se<mailto:parisc@kt=
h.se>> wrote:

Hi Stavros,

Currently, rollback failure recovery in Flink works in the pipeline level, =
not in the task level (see Millwheel [1]). It further builds on repayable s=
tream logs (i.e. Kafka), thus, there is no need for 3pc or backup in the pi=
peline sources. You can also check this presentation [2] which explains the=
 basic concepts more in detail I hope. Mind that many upcoming optimisation=
 opportunities are going to be addressed in the not so long-term Flink road=
map.

Paris

[1] http://static.googleusercontent.com/media/research.google.com/en//pubs/=
archive/41378.pdf
[2] http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-=
tolerance-and-ha

<http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tol=
erance-and-ha>

<http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tol=
erance-and-ha>
On 19 May 2016, at 19:43, Stavros Kontopoulos <st.kontopoulos@gmail.com<mai=
lto:st.kontopoulos@gmail.com>> wrote:

Cool thnx. So if a checkpoint expires the pipeline will block or fail in to=
tal or only the specific task related to the operator (running along with t=
he checkpoint task) or nothing happens?

On Tue, May 17, 2016 at 3:49 PM, Robert Metzger <rmetzger@apache.org<mailto=
:rmetzger@apache.org>> wrote:
Hi Stravos,

I haven't implemented our checkpointing mechanism and I didn't participate =
in the design decisions while implementing it, so I can not compare it in d=
etail to other approaches.

From a "does it work perspective": Checkpoints are only confirmed if all pa=
rallel subtasks successfully created a valid snapshot of the state. So if t=
here is a failure in the checkpointing mechanism, no valid checkpoint will =
be created. The system will recover from the last valid checkpoint.
There is a timeout for checkpoints. So if a barrier doesn't pass through th=
e system for a certain period of time, the checkpoint is cancelled. The def=
ault timeout is 10 minutes.

Regards,
Robert


On Mon, May 16, 2016 at 1:22 PM, Stavros Kontopoulos <st.kontopoulos@gmail.=
com<mailto:st.kontopoulos@gmail.com>> wrote:
Hi,

I was looking into the flink snapshotting algorithm details also mentioned =
here:
http://data-artisans.com/high-throughput-low-latency-and-exactly-once-strea=
m-processing-with-apache-flink/
https://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-=
distributed-dataflows/
http://mail-archives.apache.org/mod_mbox/flink-user/201601.mbox/%3CCANC1h_s=
6MCWSuDf2zSnEeD66LszDoLx0jt64++0kBOKTjkAv7w%40mail.gmail.com%3E
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/About-e=
xactly-once-question-td2545.html

From other sources i understand that it assumes no failures to work for mes=
sage delivery or for example a process hanging for ever:
https://en.wikipedia.org/wiki/Snapshot_algorithm
https://blog.acolyer.org/2015/04/22/distributed-snapshots-determining-globa=
l-states-of-distributed-systems/

So my understanding (maybe wrong) is that this is a solution which seems no=
t to address the fault tolerance issue in a strong manner like for example =
if it was to use a 3pc protocol for local state propagation and global agre=
ement. I know the latter is not efficient just mentioning it for comparison=
.

How the algorithm behaves in practical terms under the presence of its own =
failures (this is a background process collecting partial states)? Are ther=
e timeouts for reaching a barrier?

PS. have not looked deep into the code details yet, planning to.

Best,
Stavros


--_000_50FE65F116684E68B07A010C56333D5Bkthse_
Content-Type: text/html; charset="us-ascii"
Content-ID: <06D0E04C86243A4AB078E9676CDAA0A7@ug.kth.se>
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
</head>
<body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-lin=
e-break: after-white-space;" class=3D"">
Sure, in practice you can set a threshold of retries since an operator impl=
ementation could cause this indefinitely or any other reason can make snaps=
hotting generally infeasible. If I recall correctly that threshold exists i=
n the Flink configuration.
<div class=3D""><br class=3D"">
<div>
<blockquote type=3D"cite" class=3D"">
<div class=3D"">On 19 May 2016, at 20:42, Stavros Kontopoulos &lt;<a href=
=3D"mailto:st.kontopoulos@gmail.com" class=3D"">st.kontopoulos@gmail.com</a=
>&gt; wrote:</div>
<br class=3D"Apple-interchange-newline">
<div class=3D"">
<div dir=3D"ltr" class=3D"">The problem here is different though if somethi=
ng is keep failing (permanently) in practice someone needs to be notified. =
If the user loses snapshotting he must know.<br class=3D"">
</div>
<div class=3D"gmail_extra"><br class=3D"">
<div class=3D"gmail_quote">On Thu, May 19, 2016 at 9:36 PM, Abhishek R. Sin=
gh <span dir=3D"ltr" class=3D"">
&lt;<a href=3D"mailto:abhishsi@tetrationanalytics.com" target=3D"_blank" cl=
ass=3D"">abhishsi@tetrationanalytics.com</a>&gt;</span> wrote:<br class=3D"=
">
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div style=3D"word-wrap:break-word" class=3D"">I was wondering how checkpoi=
nts can be async? Because your state is constantly mutating. You probably n=
eed versioned state, or immutable data structs?<span class=3D"HOEnZb"><font=
 color=3D"#888888" class=3D"">
<div class=3D""><br class=3D"">
</div>
<div class=3D"">-Abhishek-</div>
</font></span>
<div class=3D"">
<div class=3D"h5">
<div class=3D""><br class=3D"">
<div class=3D"">
<blockquote type=3D"cite" class=3D"">
<div class=3D"">On May 19, 2016, at 11:14 AM, Paris Carbone &lt;<a href=3D"=
mailto:parisc@kth.se" target=3D"_blank" class=3D"">parisc@kth.se</a>&gt; wr=
ote:</div>
<br class=3D"">
<div class=3D"">
<div style=3D"word-wrap:break-word" class=3D"">Hi Stavros,
<div class=3D""><br class=3D"">
</div>
<div class=3D"">Currently, rollback failure recovery in Flink works in the =
pipeline level, not in the task level (see Millwheel [1]). It further build=
s on repayable stream logs (i.e. Kafka), thus, there is no need for 3pc or =
backup in the pipeline sources. You
 can also check this presentation [2] which explains the basic concepts mor=
e in detail I hope. Mind that many upcoming optimisation opportunities are =
going to be addressed in the not so long-term Flink roadmap.</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">Paris</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">
<div class=3D"">[1]&nbsp;<a href=3D"http://static.googleusercontent.com/med=
ia/research.google.com/en//pubs/archive/41378.pdf" target=3D"_blank" class=
=3D"">http://static.googleusercontent.com/media/research.google.com/en//pub=
s/archive/41378.pdf</a></div>
<div class=3D"">[2]&nbsp;<a href=3D"http://www.slideshare.net/ParisCarbone/=
tech-talk-google-on-flink-fault-tolerance-and-ha" target=3D"_blank" class=
=3D"">http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-faul=
t-tolerance-and-ha<br class=3D"">
</a></div>
<div class=3D""><a href=3D"http://www.slideshare.net/ParisCarbone/tech-talk=
-google-on-flink-fault-tolerance-and-ha" target=3D"_blank" class=3D""><br c=
lass=3D"">
</a></div>
<div class=3D""><a href=3D"http://www.slideshare.net/ParisCarbone/tech-talk=
-google-on-flink-fault-tolerance-and-ha" target=3D"_blank" class=3D""><br c=
lass=3D"">
</a></div>
<div class=3D"">
<blockquote type=3D"cite" class=3D"">
<div class=3D"">On 19 May 2016, at 19:43, Stavros Kontopoulos &lt;<a href=
=3D"mailto:st.kontopoulos@gmail.com" target=3D"_blank" class=3D"">st.kontop=
oulos@gmail.com</a>&gt; wrote:</div>
<br class=3D"">
<div class=3D"">
<div dir=3D"ltr" class=3D"">Cool thnx. So if a checkpoint expires the pipel=
ine will block or fail in total or only the specific task related to the op=
erator (running along with the checkpoint task) or nothing happens?<br clas=
s=3D"">
</div>
<div class=3D"gmail_extra"><br class=3D"">
<div class=3D"gmail_quote">On Tue, May 17, 2016 at 3:49 PM, Robert Metzger =
<span dir=3D"ltr" class=3D"">
&lt;<a href=3D"mailto:rmetzger@apache.org" target=3D"_blank" class=3D"">rme=
tzger@apache.org</a>&gt;</span> wrote:<br class=3D"">
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div dir=3D"ltr" class=3D"">Hi Stravos,
<div class=3D""><br class=3D"">
</div>
<div class=3D"">I haven't implemented our checkpointing mechanism and I did=
n't participate in the design decisions while implementing it, so I can not=
 compare it in detail to other approaches.</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">From a &quot;does it work perspective&quot;: Checkpoints ar=
e only confirmed if all parallel subtasks successfully created a valid snap=
shot of the state. So if there is a failure in the checkpointing mechanism,=
 no valid checkpoint will be created. The system
 will recover from the last valid checkpoint.</div>
<div class=3D"">There is a timeout for checkpoints. So if a barrier doesn't=
 pass through the system for a certain period of time, the checkpoint is ca=
ncelled. The default timeout is 10 minutes.</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">Regards,</div>
<div class=3D"">Robert</div>
<div class=3D""><br class=3D"">
</div>
</div>
<div class=3D"">
<div class=3D"">
<div class=3D"gmail_extra"><br class=3D"">
<div class=3D"gmail_quote">On Mon, May 16, 2016 at 1:22 PM, Stavros Kontopo=
ulos <span dir=3D"ltr" class=3D"">
&lt;<a href=3D"mailto:st.kontopoulos@gmail.com" target=3D"_blank" class=3D"=
">st.kontopoulos@gmail.com</a>&gt;</span> wrote:<br class=3D"">
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div dir=3D"ltr" class=3D"">
<div class=3D"">
<div class=3D"">
<div class=3D"">Hi,<br class=3D"">
<br class=3D"">
</div>
I was looking into the flink snapshotting algorithm details also mentioned =
here:<br class=3D"">
<a href=3D"http://data-artisans.com/high-throughput-low-latency-and-exactly=
-once-stream-processing-with-apache-flink/" target=3D"_blank" class=3D"">ht=
tp://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-=
processing-with-apache-flink/</a><br class=3D"">
<a href=3D"https://blog.acolyer.org/2015/08/19/asynchronous-distributed-sna=
pshots-for-distributed-dataflows/" target=3D"_blank" class=3D"">https://blo=
g.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed=
-dataflows/</a><br class=3D"">
<a href=3D"http://mail-archives.apache.org/mod_mbox/flink-user/201601.mbox/=
%3CCANC1h_s6MCWSuDf2zSnEeD66LszDoLx0jt64&#43;&#43;0kBOKTjkAv7w%40mail.gmail=
.com%3E" target=3D"_blank" class=3D"">http://mail-archives.apache.org/mod_m=
box/flink-user/201601.mbox/%3CCANC1h_s6MCWSuDf2zSnEeD66LszDoLx0jt64&#43;=
3;0kBOKTjkAv7w%40mail.gmail.com%3E</a><br class=3D"">
<a href=3D"http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.=
com/About-exactly-once-question-td2545.html" target=3D"_blank" class=3D"">h=
ttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/About-ex=
actly-once-question-td2545.html</a><br class=3D"">
<br class=3D"">
</div>
<div class=3D"">From other sources i understand that it assumes no failures=
 to work for message delivery or for example a process hanging for ever:<br=
 class=3D"">
<a href=3D"https://en.wikipedia.org/wiki/Snapshot_algorithm" target=3D"_bla=
nk" class=3D"">https://en.wikipedia.org/wiki/Snapshot_algorithm</a><br clas=
s=3D"">
<a href=3D"https://blog.acolyer.org/2015/04/22/distributed-snapshots-determ=
ining-global-states-of-distributed-systems/" target=3D"_blank" class=3D"">h=
ttps://blog.acolyer.org/2015/04/22/distributed-snapshots-determining-global=
-states-of-distributed-systems/</a><br class=3D"">
<br class=3D"">
</div>
<div class=3D"">So my understanding (maybe wrong) is that this is a solutio=
n which seems not to address the fault tolerance issue in a strong manner l=
ike for example if it was to use a 3pc protocol for local state propagation=
 and global agreement. I know the
 latter is not efficient just mentioning it for comparison. <br class=3D"">
<br class=3D"">
</div>
<div class=3D"">How the algorithm behaves in practical terms under the pres=
ence of its own failures (this is a background process collecting partial s=
tates)? Are there timeouts for reaching a barrier?<br class=3D"">
<br class=3D"">
</div>
<div class=3D"">PS. have not looked deep into the code details yet, plannin=
g to.<br class=3D"">
</div>
<div class=3D""><br class=3D"">
</div>
Best,<br class=3D"">
</div>
Stavros<br class=3D"">
<div class=3D"">
<div class=3D""><br class=3D"">
</div>
</div>
</div>
</blockquote>
</div>
<br class=3D"">
</div>
</div>
</div>
</blockquote>
</div>
<br class=3D"">
</div>
</div>
</blockquote>
</div>
<br class=3D"">
</div>
</div>
</div>
</blockquote>
</div>
<br class=3D"">
</div>
</div>
</div>
</div>
</blockquote>
</div>
<br class=3D"">
</div>
</div>
</blockquote>
</div>
<br class=3D"">
</div>
</body>
</html>

--_000_50FE65F116684E68B07A010C56333D5Bkthse_--