Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <CAMq=OU6EfLVNn=fNUh5iJhAgCWMCCx=gsQ5mjrtBmQ4_8p_zRQ@mail.gmail.com>
References: <CAMq=OU7k4kdur5xKnERQfOHw0u9ycbdF6_huaeEwTxrVpnJNAQ@mail.gmail.com>
 <CAMq=OU60irEK_py6sCe3-EBmdQ+86cW6w5jK6F0idwKPAY7-SQ@mail.gmail.com>
 <CAAdrtT161f9DOAqDyO52qXVB+GaebycWpdNnm_kvAxJ4DiMfVQ@mail.gmail.com>
 <CAMq=OU5Z+tEOmpcr+JbNmHNsA-c6Cc93v0K7nEk3rE54NXtsyw@mail.gmail.com> <CAMq=OU6EfLVNn=fNUh5iJhAgCWMCCx=gsQ5mjrtBmQ4_8p_zRQ@mail.gmail.com>
From: Vishal Santoshi <vishal.santoshi@gmail.com>
Date: Thu, 5 Oct 2017 10:13:43 -0400
Message-ID: <CAMq=OU4DANu-83xMahYz6gZOAkrWpLmRZ=id6zmaRmw0ezk5ow@mail.gmail.com>
Subject: Re: Failing to recover once checkpoint fails
To: Fabian Hueske <fhueske@gmail.com>
Cc: user <user@flink.apache.org>, Stefan Richter <s.richter@data-artisans.com>
Content-Type: multipart/alternative; boundary="94eb2c04f9ecfecd00055acd54fc"
archived-at: Thu, 05 Oct 2017 14:14:03 -0000

--94eb2c04f9ecfecd00055acd54fc
Content-Type: text/plain; charset="UTF-8"

Also note that  the zookeeper recovery did  ( sadly on the same hdfs
cluster ) also showed the same behavior. It had the pointers to the chk
point  ( I  think that is what it does, keeps metadata of where the
checkpoint etc  ) .  It too decided to keep the recovery file from the
failed state.

-rw-r--r--   3 root hadoop       7041 2017-10-04 13:55
/flink-recovery/prod/completedCheckpoint6c9096bb9ed4

-rw-r--r--   3 root hadoop       7044 2017-10-05 10:07
/flink-recovery/prod/completedCheckpoint7c5a19300092

This is getting a little interesting. What say you :)


On Thu, Oct 5, 2017 at 9:26 AM, Vishal Santoshi <vishal.santoshi@gmail.com>
wrote:

> Another thing I noted was this thing
>
> drwxr-xr-x   - root hadoop          0 2017-10-04 13:54
> /flink-checkpoints/prod/c4af8dfa864e2f9a51764de9f0725b39/chk-44286
>
> drwxr-xr-x   - root hadoop          0 2017-10-05 09:15
> /flink-checkpoints/prod/c4af8dfa864e2f9a51764de9f0725b39/chk-45428
>
>
> Generally what Flink does IMHO is that it replaces the chk point directory
> with a new one. I see it happening now. Every minute it replaces the old
> directory.  In this job's case however, it did not delete the 2017-10-04
> 13:54  and hence the chk-44286 directory.  This was the last chk-44286 (  I
> think  )  successfully created before NN had issues but as is usual did not
> delete this  chk-44286. It looks as if it started with a blank slate
> ???????? Does this strike a chord ?????
>
> On Thu, Oct 5, 2017 at 8:56 AM, Vishal Santoshi <vishal.santoshi@gmail.com
> > wrote:
>
>> Hello Fabian,
>>                       First of all congratulations on this fabulous
>> framework. I have worked with GDF and though GDF has some natural pluses
>> Flink's state management is far more advanced. With kafka as a source it
>> negates issues GDF has ( GDF integration with pub/sub is organic and that
>> is to be expected but non FIFO pub/sub is an issue with windows on event
>> time etc )
>>
>>                    Coming back to this issue. We have that same kafka
>> topic feeding a streaming druid datasource and we do not see any issue
>> there, so so data loss on the source, kafka is not applicable. I am totally
>> certain that the "retention" time was not an issue. It is 4 days of
>> retention and we fixed this issue within 30 minutes. We could replay kafka
>> with a new consumer group.id and that worked fine.
>>
>>
>> Note these properties and see if they strike a chord.
>>
>> * The setCommitOffsetsOnCheckpoints(boolean) for kafka consumers is the
>> default true. I bring this up to see whether flink will in any circumstance
>> drive consumption on the kafka perceived offset rather than the one in the
>> checkpoint.
>>
>> * The state.backend.fs.memory-threshold: 0 has not been set.  The state
>> is big enough though therefore IMHO no way the state is stored along with
>> the meta data in JM ( or ZK ? ) . The reason I bring this up is to make
>> sure when you say that the size has to be less than 1024bytes , you are
>> talking about cumulative state of the pipeine.
>>
>> * We have a good sense of SP ( save point )  and CP ( checkpoint ) and
>> certainly understand that they actually are not dissimilar. However in this
>> case there were multiple attempts to restart the pipe before it finally
>> succeeded.
>>
>> * Other hdfs related poperties.
>>
>>  state.backend.fs.checkpointdir: hdfs:///flink-checkpoints/<%=
>> flink_hdfs_root %>
>>
>>  state.savepoints.dir: hdfs:///flink-savepoints/<%= flink_hdfs_root %>
>>
>>  recovery.zookeeper.storageDir: hdfs:///flink-recovery/<%= flink_hdfs_root %>
>>
>>
>>
>> Do these make sense ? Is there anything else I should look at.  Please
>> also note that it is the second time this has happened. The first time I
>> was vacationing and was not privy to the state of the flink pipeline, but
>> the net effect were similar. The counts for the first window after an
>> internal restart dropped.
>>
>>
>>
>>
>> Thank you for you patience and regards,
>>
>> Vishal
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Oct 5, 2017 at 5:01 AM, Fabian Hueske <fhueske@gmail.com> wrote:
>>
>>> Hi Vishal,
>>>
>>> window operators are always stateful because the operator needs to
>>> remember previously received events (WindowFunction) or intermediate
>>> results (ReduceFunction).
>>> Given the program you described, a checkpoint should include the Kafka
>>> consumer offset and the state of the window operator. If the program
>>> eventually successfully (i.e., without an error) recovered from the last
>>> checkpoint, all its state should have been restored. Since the last
>>> checkpoint was before HDFS went into safe mode, the program would have been
>>> reset to that point. If the Kafka retention time is less than the time it
>>> took to fix HDFS you would have lost data because it would have been
>>> removed from Kafka. If that's not the case, we need to investigate this
>>> further because a checkpoint recovery must not result in state loss.
>>>
>>> Restoring from a savepoint is not so much different from automatic
>>> checkpoint recovery. Given that you have a completed savepoint, you can
>>> restart the job from that point. The main difference is that checkpoints
>>> are only used for internal recovery and usually discarded once the job is
>>> terminated while savepoints are retained.
>>>
>>> Regarding your question if a failed checkpoint should cause the job to
>>> fail and recover I'm not sure what the current status is.
>>> Stefan (in CC) should know what happens if a checkpoint fails.
>>>
>>> Best, Fabian
>>>
>>> 2017-10-05 2:20 GMT+02:00 Vishal Santoshi <vishal.santoshi@gmail.com>:
>>>
>>>> To add to it, my pipeline is a simple
>>>>
>>>> keyBy(0)
>>>>         .timeWindow(Time.of(window_size, TimeUnit.MINUTES))
>>>>         .allowedLateness(Time.of(late_by, TimeUnit.SECONDS))
>>>>         .reduce(new ReduceFunction(), new WindowFunction())
>>>>
>>>>
>>>> On Wed, Oct 4, 2017 at 8:19 PM, Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> Hello folks,
>>>>>
>>>>> As far as I know checkpoint failure should be ignored and retried with
>>>>> potentially larger state. I had this situation
>>>>>
>>>>> * hdfs went into a safe mode b'coz of Name Node issues
>>>>> * exception was thrown
>>>>>
>>>>>     org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>>>> Operation category WRITE is not supported in state standby. Visit
>>>>> https://s.apache.org/sbnn-error
>>>>>     ..................
>>>>>
>>>>>     at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.mkdirs(Had
>>>>> oopFileSystem.java:453)
>>>>>         at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.mkdirs(
>>>>> SafetyNetWrapperFileSystem.java:111)
>>>>>         at org.apache.flink.runtime.state.filesystem.
>>>>> FsCheckpointStreamFactory.createBasePath(FsCheckpointStreamFactory.
>>>>> java:132)
>>>>>
>>>>> * The pipeline came back after a few restarts and checkpoint failures,
>>>>> after the hdfs issues were resolved.
>>>>>
>>>>> I would not have worried about the restart, but it was evident that I
>>>>> lost my operator state. Either it was my kafka consumer that kept on
>>>>> advancing it's offset between a start and the next checkpoint failure ( a
>>>>> minute's worth ) or the the operator that had partial aggregates was lost.
>>>>> I have a 15 minute window of counts on a keyed operator
>>>>>
>>>>> I am using ROCKS DB and of course have checkpointing turned on.
>>>>>
>>>>> The questions thus are
>>>>>
>>>>> * Should a pipeline be restarted if checkpoint fails ?
>>>>> * Why on restart did the operator state did not recreate ?
>>>>> * Is the nature of the exception thrown have to do with any of this
>>>>> b'coz suspend and resume from a save point work as expected ?
>>>>> * And though I am pretty sure, are operators like the Window operator
>>>>> stateful by drfault and thus if I have timeWindow(Time.of(window_size,
>>>>> TimeUnit.MINUTES)).reduce(new ReduceFunction(), new WindowFunction()), the
>>>>> state is managed by flink ?
>>>>>
>>>>> Thanks.
>>>>>
>>>>
>>>>
>>>
>>
>

--94eb2c04f9ecfecd00055acd54fc
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Also note that =C2=A0the zookeeper recovery did =C2=
=A0( sadly on the same hdfs cluster ) also showed the same behavior. It had=
 the pointers to the chk point =C2=A0( I =C2=A0think that is what it does, =
keeps metadata of where the checkpoint etc =C2=A0) .=C2=A0 It too decided t=
o keep the recovery file from the failed state.<br></div><div>


<p class=3D"gmail-p1"><span class=3D"gmail-s1">-rw-r--r-- <span class=3D"gm=
ail-Apple-converted-space">=C2=A0 </span>3 root hadoop <span class=3D"gmail=
-Apple-converted-space">=C2=A0 =C2=A0 =C2=A0 </span>7041 2017-10-04 13:55 /=
flink-recovery/prod/completedCheckpoint6c9096bb9ed4</span></p>
<p class=3D"gmail-p1">-rw-r--r-- <span class=3D"gmail-Apple-converted-space=
">=C2=A0 </span>3 root hadoop <span class=3D"gmail-Apple-converted-space">=
=C2=A0 =C2=A0 =C2=A0 </span>7044 2017-10-05 10:07 /flink-recovery/prod/comp=
letedCheckpoint7c5a19300092<br></p><p class=3D"gmail-p1">This is getting a =
little interesting. What say you :)</p><p class=3D"gmail-p1"><br></p></div>=
<div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Thu, Oct 5, 20=
17 at 9:26 AM, Vishal Santoshi <span dir=3D"ltr">&lt;<a href=3D"mailto:vish=
al.santoshi@gmail.com" target=3D"_blank">vishal.santoshi@gmail.com</a>&gt;<=
/span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8=
ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Another th=
ing I noted was this thing<div><br></div><div>


<p class=3D"m_-3356443291438209497gmail-p1"><span class=3D"m_-3356443291438=
209497gmail-s1">drwxr-xr-x <span class=3D"m_-3356443291438209497gmail-Apple=
-converted-space">=C2=A0 </span>- root hadoop<span class=3D"m_-335644329143=
8209497gmail-Apple-converted-space">=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 </sp=
an>0 2017-10-04 13:54 /flink-checkpoints/prod/<wbr>c4af8dfa864e2f9a51764de9=
f0725b<wbr>39/chk-44286</span></p>
<p class=3D"m_-3356443291438209497gmail-p1"><span class=3D"m_-3356443291438=
209497gmail-s1">drwxr-xr-x <span class=3D"m_-3356443291438209497gmail-Apple=
-converted-space">=C2=A0 </span>- root hadoop<span class=3D"m_-335644329143=
8209497gmail-Apple-converted-space">=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 </sp=
an>0 2017-10-05 09:15 /flink-checkpoints/prod/<wbr>c4af8dfa864e2f9a51764de9=
f0725b<wbr>39/chk-45428</span></p><p class=3D"m_-3356443291438209497gmail-p=
1"><span class=3D"m_-3356443291438209497gmail-s1"><br></span></p><p class=
=3D"m_-3356443291438209497gmail-p1">Generally what Flink does IMHO is that =
it replaces the chk point directory with a new one. I see it happening now.=
 Every minute it replaces the old directory.=C2=A0 In this job&#39;s case h=
owever, it did not delete the 2017-10-04 13:54 =C2=A0and hence the chk-4428=
6 directory.=C2=A0 This was the last chk-44286 ( =C2=A0I think =C2=A0) =C2=
=A0successfully created before NN had issues but as is usual did not delete=
 this =C2=A0chk-44286. It looks as if it started with a blank slate ???????=
? Does this strike a chord ?????</p></div></div><div class=3D"HOEnZb"><div =
class=3D"h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On T=
hu, Oct 5, 2017 at 8:56 AM, Vishal Santoshi <span dir=3D"ltr">&lt;<a href=
=3D"mailto:vishal.santoshi@gmail.com" target=3D"_blank">vishal.santoshi@gma=
il.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"=
margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"=
ltr">Hello Fabian,=C2=A0<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 First of all congratulations on this fabulo=
us framework. I have worked with GDF and though GDF has some natural pluses=
 Flink&#39;s state management is far more advanced. With kafka as a source =
it negates issues GDF has ( GDF integration with pub/sub is organic and tha=
t is to be expected but non FIFO pub/sub is an issue with windows on event =
time etc )</div><div><br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Coming back to this issue. We have that same=
 kafka topic feeding a streaming druid datasource and we do not see any iss=
ue there, so so data loss on the source, kafka is not applicable. I am tota=
lly certain that the &quot;<span style=3D"font-size:12.8px">retention&quot;=
 time was not an issue. It is 4 days of retention and we fixed this issue w=
ithin 30 minutes. We could replay kafka with a new consumer <a href=3D"http=
://group.id" target=3D"_blank">group.id</a> and that worked fine.=C2=A0</sp=
an></div><div><span style=3D"font-size:12.8px"><br></span></div><div><span =
style=3D"font-size:12.8px"><br></span></div><div><span style=3D"font-size:1=
2.8px">Note these properties and see if they strike a chord.</span></div><d=
iv><span style=3D"font-size:12.8px"><br></span></div><div><span style=3D"fo=
nt-size:12.8px">* The=C2=A0</span><span style=3D"color:rgb(0,0,0);font-fami=
ly:Menlo,&quot;Lucida Console&quot;,monospace;font-size:12.6px">setCommitOf=
fsetsOnCheckpoi<wbr>nts(boolean)=C2=A0</span><span style=3D"font-size:12.8p=
x">for kafka consumers is the default true. I bring this up to see whether =
flink will in any circumstance drive consumption on the kafka perceived off=
set rather than the one in the checkpoint.</span></div><div><span style=3D"=
font-size:12.8px"><br></span></div><div><span style=3D"font-size:12.8px">* =
The=C2=A0</span><span style=3D"background-color:rgb(239,240,241);color:rgb(=
36,39,41);font-family:Consolas,Menlo,Monaco,&quot;Lucida Console&quot;,&quo=
t;Liberation Mono&quot;,&quot;DejaVu Sans Mono&quot;,&quot;Bitstream Vera S=
ans Mono&quot;,&quot;Courier New&quot;,monospace,sans-serif;font-size:13px;=
white-space:pre-wrap">state.backend.fs.memory-th<wbr>reshold: 0</span><span=
 style=3D"font-size:12.8px">=C2=A0has not been set.=C2=A0 The state is big =
enough though therefore IMHO no way the state is stored along with the meta=
 data in JM ( or ZK ? ) . The reason I bring this up is to make sure when y=
ou say that the size has to be less than 1024bytes , you are talking about =
cumulative=C2=A0state of the pipeine.</span></div><div><span style=3D"font-=
size:12.8px"><br></span></div><div><span style=3D"font-size:12.8px">* We ha=
ve a good sense of SP ( save point ) =C2=A0and CP ( checkpoint ) and certai=
nly understand that they actually are not dissimilar. However in this case =
there were multiple attempts to restart the pipe before it finally succeede=
d.=C2=A0</span></div><div><span style=3D"font-size:12.8px"><br></span></div=
><div><span style=3D"font-size:12.8px">* Other hdfs related poperties.</spa=
n></div><div><span style=3D"font-size:12.8px">=C2=A0</span></div><div><span=
 style=3D"color:rgb(0,0,0);font-family:Menlo;font-size:7.5pt">=C2=A0state.b=
ackend.fs.checkpointdi<wbr>r: hdfs:///flink-checkpoints/&lt;%=3D flink_hdfs=
_root %&gt;</span></div><pre style=3D"color:rgb(0,0,0);font-family:Menlo;fo=
nt-size:7.5pt"> state.savepoints.dir: hdfs:///flink-savepoints/&lt;%=3D fli=
nk_hdfs_root %&gt;<br></pre><pre style=3D"color:rgb(0,0,0);font-family:Menl=
o;font-size:7.5pt"> recovery.zookeeper.storageDir: hdfs:///flink-recovery/&=
lt;%=3D flink_hdfs_root %&gt;</pre><div><br></div><div><span style=3D"font-=
size:12.8px"><br></span></div><div><span style=3D"font-size:12.8px">Do thes=
e make sense ? Is there anything else I should look at.=C2=A0 Please also n=
ote that it is the second time this has happened. The first time I was vaca=
tioning and was not privy to the state of the flink pipeline, but the net e=
ffect were similar. The counts for the first window after an internal resta=
rt dropped.=C2=A0</span></div><div><span style=3D"font-size:12.8px"><br></s=
pan></div><div><span style=3D"font-size:12.8px"><br></span></div><div><span=
 style=3D"font-size:12.8px"><br></span></div><div><span style=3D"font-size:=
12.8px"><br></span></div><div><span style=3D"font-size:12.8px">Thank you fo=
r you patience and regards,</span></div><div><span style=3D"font-size:12.8p=
x"><br></span></div><div><span style=3D"font-size:12.8px">Vishal</span></di=
v><div><span style=3D"font-size:12.8px"><br></span></div><div><span style=
=3D"font-size:12.8px"><br></span></div><div><span style=3D"font-size:12.8px=
"><br></span></div><div><span style=3D"font-size:12.8px"><br></span></div><=
div><span style=3D"font-size:12.8px"><br></span></div><div><span style=3D"f=
ont-size:12.8px"><br></span></div><div><span style=3D"font-size:12.8px"><br=
></span></div><div><span style=3D"font-size:12.8px"><br></span></div><div><=
span style=3D"font-size:12.8px"><br></span></div><div><span style=3D"font-s=
ize:12.8px"><br></span></div></div><div class=3D"m_-3356443291438209497HOEn=
Zb"><div class=3D"m_-3356443291438209497h5"><div class=3D"gmail_extra"><br>=
<div class=3D"gmail_quote">On Thu, Oct 5, 2017 at 5:01 AM, Fabian Hueske <s=
pan dir=3D"ltr">&lt;<a href=3D"mailto:fhueske@gmail.com" target=3D"_blank">=
fhueske@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote=
" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><=
div dir=3D"ltr"><div><div>Hi Vishal,<br><br></div>window operators are alwa=
ys stateful because the operator needs to remember previously received even=
ts (WindowFunction) or intermediate results (ReduceFunction).<br></div><div=
>Given the program you described, a checkpoint should include the Kafka con=
sumer offset and the state of the window operator. If the program eventuall=
y successfully (i.e., without an error) recovered from the last checkpoint,=
 all its state should have been restored. Since the last checkpoint was bef=
ore HDFS went into safe mode, the program would have been reset to that poi=
nt. If the Kafka retention time is less than the time it took to fix HDFS y=
ou would have lost data because it would have been removed from Kafka. If t=
hat&#39;s not the case, we need to investigate this further because a check=
point recovery must not result in state loss.</div><div><br></div><div>Rest=
oring from a savepoint is not so much different from automatic checkpoint r=
ecovery. Given that you have a completed savepoint, you can restart the job=
 from that point. The main difference is that checkpoints are only used for=
 internal recovery and usually discarded once the job is terminated while s=
avepoints are retained. <br></div><div><br></div><div>Regarding your questi=
on if a failed checkpoint should cause the job to fail and recover I&#39;m =
not sure what the current status is.</div><div>Stefan (in CC) should know w=
hat happens if a checkpoint fails.</div><div><br></div><div>Best, Fabian<br=
></div></div><div class=3D"m_-3356443291438209497m_-2969325848596382893HOEn=
Zb"><div class=3D"m_-3356443291438209497m_-2969325848596382893h5"><div clas=
s=3D"gmail_extra"><br><div class=3D"gmail_quote">2017-10-05 2:20 GMT+02:00 =
Vishal Santoshi <span dir=3D"ltr">&lt;<a href=3D"mailto:vishal.santoshi@gma=
il.com" target=3D"_blank">vishal.santoshi@gmail.com</a>&gt;</span>:<br><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #c=
cc solid;padding-left:1ex"><div dir=3D"ltr">To add to it, my pipeline is a =
simple=C2=A0<div><br></div><div><pre style=3D"color:rgb(0,0,0);font-family:=
Menlo;font-size:7.5pt">keyBy(<span style=3D"color:rgb(0,0,255)">0</span>)<b=
r>        .timeWindow(Time.<span style=3D"font-style:italic">of</span>(<spa=
n style=3D"color:rgb(102,14,122);font-weight:bold">window_siz<wbr>e</span>,=
 TimeUnit.<span style=3D"color:rgb(102,14,122);font-weight:bold;font-style:=
italic">MINUTES</span>))<br>        .allowedLateness(Time.<span style=3D"fo=
nt-style:italic">of</span>(<span style=3D"color:rgb(102,14,122);font-weight=
:bold">late_<wbr>by</span>, TimeUnit.<span style=3D"color:rgb(102,14,122);f=
ont-weight:bold;font-style:italic">SECONDS</span>))<br>        .reduce(<spa=
n style=3D"color:rgb(0,0,128);font-weight:bold">new </span>ReduceFunction()=
, <span style=3D"color:rgb(0,0,128);font-weight:bold">new </span>WindowFunc=
tion())</pre></div></div><div class=3D"m_-3356443291438209497m_-29693258485=
96382893m_-1909099818922680838HOEnZb"><div class=3D"m_-3356443291438209497m=
_-2969325848596382893m_-1909099818922680838h5"><div class=3D"gmail_extra"><=
br><div class=3D"gmail_quote">On Wed, Oct 4, 2017 at 8:19 PM, Vishal Santos=
hi <span dir=3D"ltr">&lt;<a href=3D"mailto:vishal.santoshi@gmail.com" targe=
t=3D"_blank">vishal.santoshi@gmail.com</a>&gt;</span> wrote:<br><blockquote=
 class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli=
d;padding-left:1ex"><div dir=3D"ltr"><span style=3D"font-size:12.8px">Hello=
 folks,</span><br style=3D"font-size:12.8px"><br style=3D"font-size:12.8px"=
><span style=3D"font-size:12.8px">As far as I know checkpoint failure shoul=
d be ignored and retried with potentially larger state. I had this situatio=
n</span><br style=3D"font-size:12.8px"><br style=3D"font-size:12.8px"><span=
 style=3D"font-size:12.8px">* hdfs went into a safe mode b&#39;coz of Name =
Node issues</span><br style=3D"font-size:12.8px"><span style=3D"font-size:1=
2.8px">* exception was thrown</span><br style=3D"font-size:12.8px"><br styl=
e=3D"font-size:12.8px"><span style=3D"font-size:12.8px">=C2=A0 =C2=A0 org.a=
pache.hadoop.ipc.</span><span style=3D"font-size:12.8px">RemoteEx<wbr>cepti=
on(org.apache.</span><span style=3D"font-size:12.8px">hadoop.ipc.<wbr>Stand=
byException): Operation category WRITE is not supported in state standby. V=
isit=C2=A0</span><a href=3D"https://s.apache.org/sbnn-error" rel=3D"norefer=
rer" style=3D"font-size:12.8px" target=3D"_blank">https://s.apache.org/sbn<=
wbr>n-error</a><br style=3D"font-size:12.8px"><span style=3D"font-size:12.8=
px">=C2=A0 =C2=A0 ..................</span><br style=3D"font-size:12.8px"><=
br style=3D"font-size:12.8px"><span style=3D"font-size:12.8px">=C2=A0 =C2=
=A0 at org.apache.flink.runtime.fs.</span><span style=3D"font-size:12.8px">=
hd<wbr>fs.HadoopFileSystem.mkdirs(</span><span style=3D"font-size:12.8px">H=
ad<wbr>oopFileSystem.java:453)</span><br style=3D"font-size:12.8px"><span s=
tyle=3D"font-size:12.8px">=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.c=
ore.fs.</span><span style=3D"font-size:12.8px">Safet<wbr>yNetWrapperFileSys=
tem.</span><span style=3D"font-size:12.8px">mkdirs(</span><span style=3D"fo=
nt-size:12.8px">S<wbr>afetyNetWrapperFileSystem.</span><span style=3D"font-=
size:12.8px">java<wbr>:111)</span><br style=3D"font-size:12.8px"><span styl=
e=3D"font-size:12.8px">=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.runt=
ime.</span><span style=3D"font-size:12.8px">state<wbr>.filesystem.</span><s=
pan style=3D"font-size:12.8px">FsCheckpointStream<wbr>Factory.</span><span =
style=3D"font-size:12.8px">createBasePath(</span><span style=3D"font-size:1=
2.8px">FsCheck<wbr>pointStreamFactory.</span><span style=3D"font-size:12.8p=
x">java:132)</span><br style=3D"font-size:12.8px"><br style=3D"font-size:12=
.8px"><span style=3D"font-size:12.8px">* The pipeline came back after a few=
 restarts and checkpoint failures, after the hdfs issues were resolved.</sp=
an><br style=3D"font-size:12.8px"><br style=3D"font-size:12.8px"><span styl=
e=3D"font-size:12.8px">I would not have worried about the restart, but it w=
as evident that I lost my operator state. Either it was my kafka consumer t=
hat kept on advancing it&#39;s offset between a start and the next checkpoi=
nt failure ( a minute&#39;s worth ) or the the operator that had partial ag=
gregates was lost. I have a 15 minute window of counts on a keyed operator<=
/span><br style=3D"font-size:12.8px"><br style=3D"font-size:12.8px"><span s=
tyle=3D"font-size:12.8px">I am using ROCKS DB and of course have checkpoint=
ing turned on.</span><br style=3D"font-size:12.8px"><br style=3D"font-size:=
12.8px"><span style=3D"font-size:12.8px">The questions thus are</span><br s=
tyle=3D"font-size:12.8px"><br style=3D"font-size:12.8px"><span style=3D"fon=
t-size:12.8px">* Should a pipeline be restarted if checkpoint fails ?</span=
><br style=3D"font-size:12.8px"><span style=3D"font-size:12.8px">* Why on r=
estart did the operator state did not recreate ?</span><br style=3D"font-si=
ze:12.8px"><span style=3D"font-size:12.8px">* Is the nature of the exceptio=
n thrown have to do with any of this b&#39;coz suspend and resume from a sa=
ve point work as expected ?</span><br style=3D"font-size:12.8px"><span styl=
e=3D"font-size:12.8px">* And though I am pretty sure, are operators like th=
e Window operator stateful by drfault and thus if I have timeWindow(Time.of=
(window_</span><span style=3D"font-size:12.8px">size<wbr>, TimeUnit.MINUTES=
)).reduce(new ReduceFunction(), new WindowFunction()), the state is managed=
 by flink ?</span><br style=3D"font-size:12.8px"><br style=3D"font-size:12.=
8px"><span style=3D"font-size:12.8px">Thanks.</span><br></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div></div>

--94eb2c04f9ecfecd00055acd54fc--