From user-return-17762-archive-asf-public=cust-asf.ponee.io@flink.apache.org  Tue Jan 23 10:16:11 2018
Return-Path: <user-return-17762-archive-asf-public=cust-asf.ponee.io@flink.apache.org>
X-Original-To: archive-asf-public@eu.ponee.io
Delivered-To: archive-asf-public@eu.ponee.io
Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183])
	by mx-eu-01.ponee.io (Postfix) with ESMTP id 16245180621
	for <archive-asf-public@eu.ponee.io>; Tue, 23 Jan 2018 10:16:11 +0100 (CET)
Received: by cust-asf.ponee.io (Postfix)
	id 05993160C4D; Tue, 23 Jan 2018 09:16:11 +0000 (UTC)
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by cust-asf.ponee.io (Postfix) with SMTP id D5CA1160C17
	for <archive-asf-public@cust-asf.ponee.io>; Tue, 23 Jan 2018 10:16:08 +0100 (CET)
Received: (qmail 84049 invoked by uid 500); 23 Jan 2018 09:16:07 -0000
Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@flink.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@flink.apache.org>
List-Post: <mailto:user@flink.apache.org>
List-Id: <user.flink.apache.org>
Delivered-To: mailing list user@flink.apache.org
Received: (qmail 84039 invoked by uid 99); 23 Jan 2018 09:16:06 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jan 2018 09:16:06 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 6F1731A03B4
	for <user@flink.apache.org>; Tue, 23 Jan 2018 09:16:05 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.898
X-Spam-Level: *
X-Spam-Status: No, score=1.898 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001,
	SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd2-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024)
	with ESMTP id w16fQwqCo1PR for <user@flink.apache.org>;
	Tue, 23 Jan 2018 09:16:00 +0000 (UTC)
Received: from mail-ua0-f172.google.com (mail-ua0-f172.google.com [209.85.217.172])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id B32675F2AB
	for <user@flink.apache.org>; Tue, 23 Jan 2018 09:15:59 +0000 (UTC)
Received: by mail-ua0-f172.google.com with SMTP id n2so7812957uak.9
        for <user@flink.apache.org>; Tue, 23 Jan 2018 01:15:59 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc;
        bh=qjkkA6csc2MrQIxM2LUJoLJQNFbyF3rHUMxFsTyoW7Y=;
        b=MgmpSID7EeovIGqHnVJ4Vbi1w78aIBw4SC2HrjuF7uK5GkLacT7Aynwf27QBf8M0lC
         sVQysMU2J0BEzgt3u0SlfNkiFhcCUCFRk6W1KSMdLHWesbWTdUXKBBDwLGcYTy1joo2q
         c2k6A/zKQeD0IDqbRxzgksW2eWhr4mpeTkiWRGO0hvejOvuiNx0mAtWPGmWZ7+5FyVBX
         L7jaBqGNFhacpXYL/kV4qOVEO4mNFKEbcj/txHZEy9JptqIX6uBJd7OTqwGGvJRVjjQh
         LdmSamYDCvo0kNx4m0saplYG1Ej+oKp5ZGq6vZeto5srk050b3tBLLsMabTm7M4DlB+Q
         7+Vw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:in-reply-to:references:from:date
         :message-id:subject:to:cc;
        bh=qjkkA6csc2MrQIxM2LUJoLJQNFbyF3rHUMxFsTyoW7Y=;
        b=o1b+LNA17BsVdxFc4SxrncDHEsIQ/PWaa/zMphgWIkkmUVxvcAlNoVvT0zci4dYC2E
         14q1ERFzwAyJ0W2djrHGFhyJO4uTcqXCycRAX/3TK8BWVoXmWlI0HO6B9SGJqTJNFruH
         yUWXvb91ezCgXjNFKyL6jCza3/9Kx90EPKrp3zSVo0c9KJNTgqVhZ0MkhlIoJboRwrJd
         M9+9vn3Qrsjuxmavclz0mJE3YpksdV64NFsFs0PNACwexvyG98O6KpjtOAvgYXPjlnhI
         shI2F1vQB4UZWwzrHwBHxk/mIP+mRZJCzhNeAOjtWHIP5Hpqw7jGHSe4KCk1bfojLfyx
         sYOA==
X-Gm-Message-State: AKwxyte9vHe3vRnySVET7EwXcRdhdNYy/lT8qZDku9/HbWMTX2ZdAj3k
	4Td26bBRuoo6XaIBWnI7kCr2vxVy0l1pqSBY1HA=
X-Google-Smtp-Source: AH8x224U3U6IasciaQZDpRCCb/rNp3D9OJADWhOCstwfYGbMxG5AFKUBezOh+5aJbmZx6EKUx2XnE+i7vlZHcnLcZog=
X-Received: by 10.176.2.109 with SMTP id 100mr1410532uas.26.1516698952510;
 Tue, 23 Jan 2018 01:15:52 -0800 (PST)
MIME-Version: 1.0
Received: by 10.103.147.140 with HTTP; Tue, 23 Jan 2018 01:15:11 -0800 (PST)
In-Reply-To: <CAMq=OU4TG-KBePscULjOxBZO3z1JHuz=2ucmrCMCouiN2SxvkA@mail.gmail.com>
References: <CAMq=OU7k4kdur5xKnERQfOHw0u9ycbdF6_huaeEwTxrVpnJNAQ@mail.gmail.com>
 <CAMq=OU60irEK_py6sCe3-EBmdQ+86cW6w5jK6F0idwKPAY7-SQ@mail.gmail.com>
 <CAAdrtT161f9DOAqDyO52qXVB+GaebycWpdNnm_kvAxJ4DiMfVQ@mail.gmail.com>
 <CAMq=OU5Z+tEOmpcr+JbNmHNsA-c6Cc93v0K7nEk3rE54NXtsyw@mail.gmail.com>
 <CAMq=OU6EfLVNn=fNUh5iJhAgCWMCCx=gsQ5mjrtBmQ4_8p_zRQ@mail.gmail.com>
 <CAMq=OU4DANu-83xMahYz6gZOAkrWpLmRZ=id6zmaRmw0ezk5ow@mail.gmail.com>
 <CAMq=OU4d+tnJN3FvwBWGGzNHyHEmcgxnijJnDkvArLxvE+JPHw@mail.gmail.com>
 <CAMq=OU56DOVkzcp04WuYJnKkFN4tGRnNvGBh=Bmp38PmFwGquA@mail.gmail.com>
 <7D6AD151-330F-40E4-B86E-7CBA9F13B9DD@apache.org> <CAMq=OU4CJ4UBO58xSvZ4y0kWUykoDQAQF-CXBm5dPAGR7im45Q@mail.gmail.com>
 <CAAdrtT1wRNaLiDzX=Y6B4YBBP2P6E9Mp+9=p=DHm5KmVYHKp+g@mail.gmail.com>
 <661A4105-BC0A-462B-B62B-0695797CE6FC@apache.org> <CAMq=OU7vTo_hvKUnPvekmzHF5XOKX9U-cYH7kF-XXn2pjHAXZA@mail.gmail.com>
 <CAMq=OU5QBDEkCqFkgbs=KRzRnAafD5+7Mr3r30HXP8BfoexFPg@mail.gmail.com> <CAMq=OU4TG-KBePscULjOxBZO3z1JHuz=2ucmrCMCouiN2SxvkA@mail.gmail.com>
From: Fabian Hueske <fhueske@gmail.com>
Date: Tue, 23 Jan 2018 10:15:11 +0100
Message-ID: <CAAdrtT07renhO6rsiaJeFESe3YmTVKuZtM0y3U6d+yOnKLz13Q@mail.gmail.com>
Subject: Re: Failing to recover once checkpoint fails
To: Vishal Santoshi <vishal.santoshi@gmail.com>
Cc: Aljoscha Krettek <aljoscha@apache.org>, user <user@flink.apache.org>, 
	Stefan Richter <s.richter@data-artisans.com>
Content-Type: multipart/alternative; boundary="001a11467baa4bb7c105636dfe73"

--001a11467baa4bb7c105636dfe73
Content-Type: text/plain; charset="UTF-8"

Sorry for the late reply.

I created FLINK-8487 [1] to track this problem

@Vishal, can you have a look and check if if forgot some details? I logged
the issue for Flink 1.3.2, is that correct?
Please add more information if you think it is relevant.

Thanks,
Fabian

[1] https://issues.apache.org/jira/browse/FLINK-8487

2018-01-18 22:14 GMT+01:00 Vishal Santoshi <vishal.santoshi@gmail.com>:

> Or this one
>
> https://issues.apache.org/jira/browse/FLINK-4815
>
> On Thu, Jan 18, 2018 at 4:13 PM, Vishal Santoshi <
> vishal.santoshi@gmail.com> wrote:
>
>> ping.
>>
>>     This happened again on production and it seems reasonable to abort
>> when a checkpoint is not found rather than behave as if it is a brand new
>> pipeline.
>>
>> On Tue, Jan 16, 2018 at 9:33 AM, Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> Folks sorry for being late on this. Can some body with the knowledge of
>>> this code base create a jira issue for the above ? We have seen this more
>>> than once on production.
>>>
>>> On Mon, Oct 9, 2017 at 10:21 AM, Aljoscha Krettek <aljoscha@apache.org>
>>> wrote:
>>>
>>>> Hi Vishal,
>>>>
>>>> Some relevant Jira issues for you are:
>>>>
>>>>  - https://issues.apache.org/jira/browse/FLINK-4808: Allow skipping
>>>> failed checkpoints
>>>>  - https://issues.apache.org/jira/browse/FLINK-4815: Automatic
>>>> fallback to earlier checkpoint when checkpoint restore fails
>>>>  - https://issues.apache.org/jira/browse/FLINK-7783: Don't always
>>>> remove checkpoints in ZooKeeperCompletedCheckpointStore#recover()
>>>>
>>>> Best,
>>>> Aljoscha
>>>>
>>>>
>>>> On 9. Oct 2017, at 09:06, Fabian Hueske <fhueske@gmail.com> wrote:
>>>>
>>>> Hi Vishal,
>>>>
>>>> it would be great if you could create a JIRA ticket with Blocker
>>>> priority.
>>>> Please add all relevant information of your detailed analysis, add a
>>>> link to this email thread (see [1] for the web archive of the mailing
>>>> list), and post the id of the JIRA issue here.
>>>>
>>>> Thanks for looking into this!
>>>>
>>>> Best regards,
>>>> Fabian
>>>>
>>>> [1] https://lists.apache.org/list.html?user@flink.apache.org
>>>>
>>>> 2017-10-06 15:59 GMT+02:00 Vishal Santoshi <vishal.santoshi@gmail.com>:
>>>>
>>>>> Thank you for confirming.
>>>>>
>>>>>
>>>>>  I think this is a critical bug. In essence any checkpoint store (
>>>>> hdfs/S3/File)  will loose state if it is unavailable at resume. This
>>>>> becomes all the more painful with your confirming that  "failed
>>>>> checkpoints killing the job"  b'coz essentially it mean that if remote
>>>>> store in unavailable  during checkpoint than you have lost state ( till of
>>>>> course you have a retry of none or an unbounded retry delay, a delay that
>>>>> you *hope* the store revives in ) .. Remember  the first retry
>>>>> failure  will cause new state according the code as written iff the remote
>>>>> store is down. We would rather have a configurable property that
>>>>> establishes  our desire to abort something like a
>>>>> "abort_retry_on_chkretrevalfailure"
>>>>>
>>>>>
>>>>> In our case it is very important that we do not undercount a window,
>>>>> one reason we use flink and it's awesome failure guarantees, as various
>>>>> alarms sound ( we do anomaly detection on the time series ).
>>>>>
>>>>> Please create a jira ticket for us to follow or we could do it.
>>>>>
>>>>>
>>>>> PS Not aborting on checkpointing, till a configurable limit is very
>>>>> important too.
>>>>>
>>>>>
>>>>> On Fri, Oct 6, 2017 at 2:36 AM, Aljoscha Krettek <aljoscha@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Vishal,
>>>>>>
>>>>>> I think you're right! And thanks for looking into this so deeply.
>>>>>>
>>>>>> With your last mail your basically saying, that the checkpoint could
>>>>>> not be restored because your HDFS was temporarily down. If Flink had not
>>>>>> deleted that checkpoint it might have been possible to restore it at a
>>>>>> later point, right?
>>>>>>
>>>>>> Regarding failed checkpoints killing the job: yes, this is currently
>>>>>> the expected behaviour but there are plans to change this.
>>>>>>
>>>>>> Best,
>>>>>> Aljoscha
>>>>>>
>>>>>> On 5. Oct 2017, at 17:40, Vishal Santoshi <vishal.santoshi@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> I think this is the offending piece. There is a catch all Exception,
>>>>>> which IMHO should understand a recoverable exception from an unrecoverable
>>>>>> on.
>>>>>>
>>>>>>
>>>>>> try {
>>>>>> completedCheckpoint = retrieveCompletedCheckpoint(ch
>>>>>> eckpointStateHandle);
>>>>>> if (completedCheckpoint != null) {
>>>>>> completedCheckpoints.add(completedCheckpoint);
>>>>>> }
>>>>>> } catch (Exception e) {
>>>>>> LOG.warn("Could not retrieve checkpoint. Removing it from the
>>>>>> completed " +
>>>>>> "checkpoint store.", e);
>>>>>> // remove the checkpoint with broken state handle
>>>>>> removeBrokenStateHandle(checkpointStateHandle.f1,
>>>>>> checkpointStateHandle.f0);
>>>>>> }
>>>>>>
>>>>>> On Thu, Oct 5, 2017 at 10:57 AM, Vishal Santoshi <
>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>
>>>>>>> So this is the issue and tell us that it is wrong. ZK had some state
>>>>>>> ( backed by hdfs ) that referred to a checkpoint ( the same exact last
>>>>>>> successful checkpoint that was successful before NN screwed us ). When the
>>>>>>> JM tried to recreate the state and b'coz NN was down failed to retrieve the
>>>>>>> CHK handle from hdfs and conveniently ( and I think very wrongly ) removed
>>>>>>> the CHK from being considered and cleaned the pointer ( though failed as
>>>>>>> was NN was down and is obvious from the dangling file in recovery ) . The
>>>>>>> metadata itself was on hdfs and failure in retrieving should have been a
>>>>>>> stop all, not going to trying doing magic exception rather than starting
>>>>>>> from a blank state.
>>>>>>>
>>>>>>> org.apache.flink.util.FlinkException: Could not retrieve checkpoint
>>>>>>> 44286 from state handle under /0000000000000044286. This indicates that the
>>>>>>> retrieved state handle is broken. Try cleaning the state handle store.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 5, 2017 at 10:13 AM, Vishal Santoshi <
>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>
>>>>>>>> Also note that  the zookeeper recovery did  ( sadly on the same
>>>>>>>> hdfs cluster ) also showed the same behavior. It had the pointers to the
>>>>>>>> chk point  ( I  think that is what it does, keeps metadata of where the
>>>>>>>> checkpoint etc  ) .  It too decided to keep the recovery file from the
>>>>>>>> failed state.
>>>>>>>>
>>>>>>>> -rw-r--r--   3 root hadoop       7041 2017-10-04 13:55
>>>>>>>> /flink-recovery/prod/completedCheckpoint6c9096bb9ed4
>>>>>>>>
>>>>>>>> -rw-r--r--   3 root hadoop       7044 2017-10-05 10:07
>>>>>>>> /flink-recovery/prod/completedCheckpoint7c5a19300092
>>>>>>>>
>>>>>>>> This is getting a little interesting. What say you :)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Oct 5, 2017 at 9:26 AM, Vishal Santoshi <
>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Another thing I noted was this thing
>>>>>>>>>
>>>>>>>>> drwxr-xr-x   - root hadoop          0 2017-10-04 13:54
>>>>>>>>> /flink-checkpoints/prod/c4af8dfa864e2f9a51764de9f0725b39/chk-44286
>>>>>>>>>
>>>>>>>>> drwxr-xr-x   - root hadoop          0 2017-10-05 09:15
>>>>>>>>> /flink-checkpoints/prod/c4af8dfa864e2f9a51764de9f0725b39/chk-45428
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Generally what Flink does IMHO is that it replaces the chk point
>>>>>>>>> directory with a new one. I see it happening now. Every minute it replaces
>>>>>>>>> the old directory.  In this job's case however, it did not delete the
>>>>>>>>> 2017-10-04 13:54  and hence the chk-44286 directory.  This was the last
>>>>>>>>> chk-44286 (  I think  )  successfully created before NN had issues but as
>>>>>>>>> is usual did not delete this  chk-44286. It looks as if it started with a
>>>>>>>>> blank slate ???????? Does this strike a chord ?????
>>>>>>>>>
>>>>>>>>> On Thu, Oct 5, 2017 at 8:56 AM, Vishal Santoshi <
>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello Fabian,
>>>>>>>>>>                       First of all congratulations on this
>>>>>>>>>> fabulous framework. I have worked with GDF and though GDF has some natural
>>>>>>>>>> pluses Flink's state management is far more advanced. With kafka as a
>>>>>>>>>> source it negates issues GDF has ( GDF integration with pub/sub is organic
>>>>>>>>>> and that is to be expected but non FIFO pub/sub is an issue with windows on
>>>>>>>>>> event time etc )
>>>>>>>>>>
>>>>>>>>>>                    Coming back to this issue. We have that same
>>>>>>>>>> kafka topic feeding a streaming druid datasource and we do not see any
>>>>>>>>>> issue there, so so data loss on the source, kafka is not applicable. I am
>>>>>>>>>> totally certain that the "retention" time was not an issue. It
>>>>>>>>>> is 4 days of retention and we fixed this issue within 30 minutes. We could
>>>>>>>>>> replay kafka with a new consumer group.id and that worked fine.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Note these properties and see if they strike a chord.
>>>>>>>>>>
>>>>>>>>>> * The setCommitOffsetsOnCheckpoints(boolean) for kafka consumers
>>>>>>>>>> is the default true. I bring this up to see whether flink will in any
>>>>>>>>>> circumstance drive consumption on the kafka perceived offset rather than
>>>>>>>>>> the one in the checkpoint.
>>>>>>>>>>
>>>>>>>>>> * The state.backend.fs.memory-threshold: 0 has not been set.
>>>>>>>>>> The state is big enough though therefore IMHO no way the state is stored
>>>>>>>>>> along with the meta data in JM ( or ZK ? ) . The reason I bring this up is
>>>>>>>>>> to make sure when you say that the size has to be less than 1024bytes , you
>>>>>>>>>> are talking about cumulative state of the pipeine.
>>>>>>>>>>
>>>>>>>>>> * We have a good sense of SP ( save point )  and CP ( checkpoint
>>>>>>>>>> ) and certainly understand that they actually are not dissimilar. However
>>>>>>>>>> in this case there were multiple attempts to restart the pipe before it
>>>>>>>>>> finally succeeded.
>>>>>>>>>>
>>>>>>>>>> * Other hdfs related poperties.
>>>>>>>>>>
>>>>>>>>>>  state.backend.fs.checkpointdir: hdfs:///flink-checkpoints/<%=
>>>>>>>>>> flink_hdfs_root %>
>>>>>>>>>>
>>>>>>>>>>  state.savepoints.dir: hdfs:///flink-savepoints/<%= flink_hdfs_root %>
>>>>>>>>>>
>>>>>>>>>>  recovery.zookeeper.storageDir: hdfs:///flink-recovery/<%= flink_hdfs_root %>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Do these make sense ? Is there anything else I should look at.
>>>>>>>>>> Please also note that it is the second time this has happened. The first
>>>>>>>>>> time I was vacationing and was not privy to the state of the flink
>>>>>>>>>> pipeline, but the net effect were similar. The counts for the first window
>>>>>>>>>> after an internal restart dropped.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thank you for you patience and regards,
>>>>>>>>>>
>>>>>>>>>> Vishal
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Oct 5, 2017 at 5:01 AM, Fabian Hueske <fhueske@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Vishal,
>>>>>>>>>>>
>>>>>>>>>>> window operators are always stateful because the operator needs
>>>>>>>>>>> to remember previously received events (WindowFunction) or intermediate
>>>>>>>>>>> results (ReduceFunction).
>>>>>>>>>>> Given the program you described, a checkpoint should include the
>>>>>>>>>>> Kafka consumer offset and the state of the window operator. If the program
>>>>>>>>>>> eventually successfully (i.e., without an error) recovered from the last
>>>>>>>>>>> checkpoint, all its state should have been restored. Since the last
>>>>>>>>>>> checkpoint was before HDFS went into safe mode, the program would have been
>>>>>>>>>>> reset to that point. If the Kafka retention time is less than the time it
>>>>>>>>>>> took to fix HDFS you would have lost data because it would have been
>>>>>>>>>>> removed from Kafka. If that's not the case, we need to investigate this
>>>>>>>>>>> further because a checkpoint recovery must not result in state loss.
>>>>>>>>>>>
>>>>>>>>>>> Restoring from a savepoint is not so much different from
>>>>>>>>>>> automatic checkpoint recovery. Given that you have a completed savepoint,
>>>>>>>>>>> you can restart the job from that point. The main difference is that
>>>>>>>>>>> checkpoints are only used for internal recovery and usually discarded once
>>>>>>>>>>> the job is terminated while savepoints are retained.
>>>>>>>>>>>
>>>>>>>>>>> Regarding your question if a failed checkpoint should cause the
>>>>>>>>>>> job to fail and recover I'm not sure what the current status is.
>>>>>>>>>>> Stefan (in CC) should know what happens if a checkpoint fails.
>>>>>>>>>>>
>>>>>>>>>>> Best, Fabian
>>>>>>>>>>>
>>>>>>>>>>> 2017-10-05 2:20 GMT+02:00 Vishal Santoshi <
>>>>>>>>>>> vishal.santoshi@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>> To add to it, my pipeline is a simple
>>>>>>>>>>>>
>>>>>>>>>>>> keyBy(0)
>>>>>>>>>>>>         .timeWindow(Time.of(window_size, TimeUnit.MINUTES))
>>>>>>>>>>>>         .allowedLateness(Time.of(late_by, TimeUnit.SECONDS))
>>>>>>>>>>>>         .reduce(new ReduceFunction(), new WindowFunction())
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Oct 4, 2017 at 8:19 PM, Vishal Santoshi <
>>>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello folks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> As far as I know checkpoint failure should be ignored and
>>>>>>>>>>>>> retried with potentially larger state. I had this situation
>>>>>>>>>>>>>
>>>>>>>>>>>>> * hdfs went into a safe mode b'coz of Name Node issues
>>>>>>>>>>>>> * exception was thrown
>>>>>>>>>>>>>
>>>>>>>>>>>>>     org.apache.hadoop.ipc.RemoteException(org.apache.
>>>>>>>>>>>>> hadoop.ipc.StandbyException): Operation category WRITE is not
>>>>>>>>>>>>> supported in state standby. Visit https://s.apache.org/sbn
>>>>>>>>>>>>> n-error
>>>>>>>>>>>>>     ..................
>>>>>>>>>>>>>
>>>>>>>>>>>>>     at org.apache.flink.runtime.fs.hd
>>>>>>>>>>>>> fs.HadoopFileSystem.mkdirs(HadoopFileSystem.java:453)
>>>>>>>>>>>>>         at org.apache.flink.core.fs.Safet
>>>>>>>>>>>>> yNetWrapperFileSystem.mkdirs(SafetyNetWrapperFileSystem.java
>>>>>>>>>>>>> :111)
>>>>>>>>>>>>>         at org.apache.flink.runtime.state.filesystem.
>>>>>>>>>>>>> FsCheckpointStreamFactory.createBasePath(FsCheck
>>>>>>>>>>>>> pointStreamFactory.java:132)
>>>>>>>>>>>>>
>>>>>>>>>>>>> * The pipeline came back after a few restarts and checkpoint
>>>>>>>>>>>>> failures, after the hdfs issues were resolved.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would not have worried about the restart, but it was evident
>>>>>>>>>>>>> that I lost my operator state. Either it was my kafka consumer that kept on
>>>>>>>>>>>>> advancing it's offset between a start and the next checkpoint failure ( a
>>>>>>>>>>>>> minute's worth ) or the the operator that had partial aggregates was lost.
>>>>>>>>>>>>> I have a 15 minute window of counts on a keyed operator
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am using ROCKS DB and of course have checkpointing turned on.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The questions thus are
>>>>>>>>>>>>>
>>>>>>>>>>>>> * Should a pipeline be restarted if checkpoint fails ?
>>>>>>>>>>>>> * Why on restart did the operator state did not recreate ?
>>>>>>>>>>>>> * Is the nature of the exception thrown have to do with any of
>>>>>>>>>>>>> this b'coz suspend and resume from a save point work as expected ?
>>>>>>>>>>>>> * And though I am pretty sure, are operators like the Window
>>>>>>>>>>>>> operator stateful by drfault and thus if I have timeWindow(Time.of(window_
>>>>>>>>>>>>> size, TimeUnit.MINUTES)).reduce(new ReduceFunction(), new
>>>>>>>>>>>>> WindowFunction()), the state is managed by flink ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

--001a11467baa4bb7c105636dfe73
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div><div>Sorry for the late reply.<br><br></div=
>I created FLINK-8487 [1] to track this problem<br><br></div>@Vishal, can y=
ou have a look and check if if forgot some details? I logged the issue for =
Flink 1.3.2, is that correct?<br></div><div>Please add more information if =
you think it is relevant.<br></div><div><br></div>Thanks,<br></div>Fabian<b=
r><div><div><div><br>[1] <a href=3D"https://issues.apache.org/jira/browse/F=
LINK-8487">https://issues.apache.org/jira/browse/FLINK-8487</a></div></div>=
</div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">2018-=
01-18 22:14 GMT+01:00 Vishal Santoshi <span dir=3D"ltr">&lt;<a href=3D"mail=
to:vishal.santoshi@gmail.com" target=3D"_blank">vishal.santoshi@gmail.com</=
a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .=
8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Or this o=
ne=C2=A0<div><br></div><div><a href=3D"https://issues.apache.org/jira/brows=
e/FLINK-4815" target=3D"_blank">https://issues.apache.org/<wbr>jira/browse/=
FLINK-4815</a><br></div></div><div class=3D"HOEnZb"><div class=3D"h5"><div =
class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Thu, Jan 18, 2018 a=
t 4:13 PM, Vishal Santoshi <span dir=3D"ltr">&lt;<a href=3D"mailto:vishal.s=
antoshi@gmail.com" target=3D"_blank">vishal.santoshi@gmail.com</a>&gt;</spa=
n> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">ping.=C2=A0<di=
v><br></div><div>=C2=A0 =C2=A0 This happened again on production and it see=
ms reasonable to abort when a checkpoint is not found rather than behave as=
 if it is a brand new pipeline.=C2=A0=C2=A0</div></div><div class=3D"m_-252=
9447214457150904HOEnZb"><div class=3D"m_-2529447214457150904h5"><div class=
=3D"gmail_extra"><br><div class=3D"gmail_quote">On Tue, Jan 16, 2018 at 9:3=
3 AM, Vishal Santoshi <span dir=3D"ltr">&lt;<a href=3D"mailto:vishal.santos=
hi@gmail.com" target=3D"_blank">vishal.santoshi@gmail.com</a>&gt;</span> wr=
ote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border=
-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Folks sorry for bei=
ng late on this. Can some body with the knowledge of this code base create =
a jira issue for the above ? We have seen this more than once on production=
.</div><div class=3D"m_-2529447214457150904m_-8068526654400668846HOEnZb"><d=
iv class=3D"m_-2529447214457150904m_-8068526654400668846h5"><div class=3D"g=
mail_extra"><br><div class=3D"gmail_quote">On Mon, Oct 9, 2017 at 10:21 AM,=
 Aljoscha Krettek <span dir=3D"ltr">&lt;<a href=3D"mailto:aljoscha@apache.o=
rg" target=3D"_blank">aljoscha@apache.org</a>&gt;</span> wrote:<br><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc s=
olid;padding-left:1ex"><div style=3D"word-wrap:break-word;line-break:after-=
white-space">Hi Vishal,<div><br></div><div>Some relevant Jira issues for yo=
u are:</div><div><br></div><div>=C2=A0-=C2=A0<a href=3D"https://issues.apac=
he.org/jira/browse/FLINK-4808:" target=3D"_blank">https://issues.apache.org=
/j<wbr>ira/browse/FLINK-4808:</a>=C2=A0Allow skipping failed checkpoints</d=
iv><div>=C2=A0-=C2=A0<a href=3D"https://issues.apache.org/jira/browse/FLINK=
-4815:" target=3D"_blank">https://issues.apache.org/j<wbr>ira/browse/FLINK-=
4815:</a>=C2=A0Automat<wbr>ic fallback to earlier checkpoint when checkpoin=
t restore fails</div><div>=C2=A0-=C2=A0<a href=3D"https://issues.apache.org=
/jira/browse/FLINK-7783:" target=3D"_blank">https://issues.apache.org/j<wbr=
>ira/browse/FLINK-7783:</a>=C2=A0Don&#39;t always remove checkpoints in Zoo=
KeeperCompletedCheckpointSt<wbr>ore#recover()</div><div><br></div><div>Best=
,</div><div>Aljoscha<div><div class=3D"m_-2529447214457150904m_-80685266544=
00668846m_6999289319972930716h5"><br><div><br><blockquote type=3D"cite"><di=
v>On 9. Oct 2017, at 09:06, Fabian Hueske &lt;<a href=3D"mailto:fhueske@gma=
il.com" target=3D"_blank">fhueske@gmail.com</a>&gt; wrote:</div><br class=
=3D"m_-2529447214457150904m_-8068526654400668846m_6999289319972930716m_2087=
411908087471449Apple-interchange-newline"><div><div dir=3D"ltr"><div><div>H=
i Vishal,<br><br></div>it would be great if you could create a JIRA ticket =
with Blocker priority.<br>Please add all relevant information of your detai=
led analysis, add a link to this email thread (see [1] for the web archive =
of the mailing list), and post the id of the JIRA issue here.<br><br></div>=
Thanks for looking into this!<br><div><br></div><div>Best regards,</div><di=
v>Fabian<br></div><div><br>[1] <a href=3D"https://lists.apache.org/list.htm=
l?user@flink.apache.org" target=3D"_blank">https://lists.apache.org/list.<w=
br>html?user@flink.apache.org</a><br></div></div><div class=3D"gmail_extra"=
><br><div class=3D"gmail_quote">2017-10-06 15:59 GMT+02:00 Vishal Santoshi =
<span dir=3D"ltr">&lt;<a href=3D"mailto:vishal.santoshi@gmail.com" target=
=3D"_blank">vishal.santoshi@gmail.com</a>&gt;</span>:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div dir=3D"ltr">Thank you for confirming.=C2=A0<div>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0</div><div><br></div><div>=C2=A0I think this is a criti=
cal bug. In essence any checkpoint store ( hdfs/S3/File) =C2=A0will loose s=
tate if it is unavailable at resume. This becomes all the more painful with=
 your confirming that =C2=A0&quot;f<span style=3D"font-size:12.8px">ailed c=
heckpoints killing the job&quot; =C2=A0b&#39;coz essentially it mean that i=
f remote store in unavailable =C2=A0during checkpoint than you have lost st=
ate ( till of course you have a retry of none or an unbounded retry delay, =
a delay that you </span><b style=3D"font-size:12.8px">hope</b><span style=
=3D"font-size:12.8px"> the store revives in ) .. Remember =C2=A0the first r=
etry failure =C2=A0will cause new state according the code as written iff t=
he remote store is down. We would rather have a configurable property that =
establishes=C2=A0 our desire to abort something like a &quot;abort_retry_on=
_chkretrevalfai<wbr>lure&quot;</span><div><span style=3D"font-size:12.8px">=
<br></span></div><div><span style=3D"font-size:12.8px"><br></span></div><di=
v><span style=3D"font-size:12.8px">In our case it is very important that we=
 do not undercount a window, one reason we use flink and it&#39;s awesome f=
ailure guarantees, as various alarms sound ( we do anomaly detection on the=
 time series ).</span></div><div><span style=3D"font-size:12.8px"><br></spa=
n></div><div><span style=3D"font-size:12.8px">Please create a jira ticket f=
or us to follow or we could do it.</span></div><div><span style=3D"font-siz=
e:12.8px"><br></span></div><div><span style=3D"font-size:12.8px"><br></span=
></div><div><span style=3D"font-size:12.8px">PS Not aborting on checkpointi=
ng, till a configurable limit is very important too.</span></div><div><span=
 style=3D"font-size:12.8px"><br></span></div></div></div><div class=3D"m_-2=
529447214457150904m_-8068526654400668846m_6999289319972930716m_208741190808=
7471449HOEnZb"><div class=3D"m_-2529447214457150904m_-8068526654400668846m_=
6999289319972930716m_2087411908087471449h5"><div class=3D"gmail_extra"><br>=
<div class=3D"gmail_quote">On Fri, Oct 6, 2017 at 2:36 AM, Aljoscha Krettek=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:aljoscha@apache.org" target=3D"_bl=
ank">aljoscha@apache.org</a>&gt;</span> wrote:<br><blockquote class=3D"gmai=
l_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left=
:1ex"><div style=3D"word-wrap:break-word;line-break:after-white-space">Hi V=
ishal,<div><br></div><div>I think you&#39;re right! And thanks for looking =
into this so deeply.=C2=A0</div><div><br></div><div>With your last mail you=
r basically saying, that the checkpoint could not be restored because your =
HDFS was temporarily down. If Flink had not deleted that checkpoint it migh=
t have been possible to restore it at a later point, right?</div><div><br><=
/div><div>Regarding failed checkpoints killing the job: yes, this is curren=
tly the expected behaviour but there are plans to change this.</div><div><b=
r></div><div>Best,</div><div>Aljoscha</div><div><div class=3D"m_-2529447214=
457150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_=
3993083543651040280h5"><div><div><div><br><blockquote type=3D"cite"><div>On=
 5. Oct 2017, at 17:40, Vishal Santoshi &lt;<a href=3D"mailto:vishal.santos=
hi@gmail.com" target=3D"_blank">vishal.santoshi@gmail.com</a>&gt; wrote:</d=
iv><br class=3D"m_-2529447214457150904m_-8068526654400668846m_6999289319972=
930716m_2087411908087471449m_3993083543651040280m_3771965108974702638Apple-=
interchange-newline"><div><div dir=3D"ltr">I think this is the offending pi=
ece. There is a catch all Exception, which IMHO should understand a recover=
able exception from an unrecoverable on.=C2=A0<div><br></div><div><table cl=
ass=3D"m_-2529447214457150904m_-8068526654400668846m_6999289319972930716m_2=
087411908087471449m_3993083543651040280m_3771965108974702638gmail-highlight=
 m_-2529447214457150904m_-8068526654400668846m_6999289319972930716m_2087411=
908087471449m_3993083543651040280m_3771965108974702638gmail-js-file-line-co=
ntainer m_-2529447214457150904m_-8068526654400668846m_6999289319972930716m_=
2087411908087471449m_3993083543651040280m_3771965108974702638gmail-tab-size=
" style=3D"box-sizing:border-box;border-collapse:collapse;color:rgb(36,41,4=
6);font-family:-apple-system,system-ui,&quot;Segoe UI&quot;,Helvetica,Arial=
,sans-serif,&quot;Apple Color Emoji&quot;,&quot;Segoe UI Emoji&quot;,&quot;=
Segoe UI Symbol&quot;;font-size:14px"><tbody style=3D"box-sizing:border-box=
"><tr style=3D"box-sizing:border-box"><td id=3D"m_-2529447214457150904m_-80=
68526654400668846m_6999289319972930716m_2087411908087471449m_39930835436510=
40280m_3771965108974702638gmail-LC170" class=3D"m_-2529447214457150904m_-80=
68526654400668846m_6999289319972930716m_2087411908087471449m_39930835436510=
40280m_3771965108974702638gmail-js-file-line m_-2529447214457150904m_-80685=
26654400668846m_6999289319972930716m_2087411908087471449m_39930835436510402=
80m_3771965108974702638gmail-blob-code m_-2529447214457150904m_-80685266544=
00668846m_6999289319972930716m_2087411908087471449m_3993083543651040280m_37=
71965108974702638gmail-blob-code-inner" style=3D"box-sizing:border-box;padd=
ing:0px 10px;line-height:20px;vertical-align:top;overflow:visible;font-fami=
ly:SFMono-Regular,Consolas,&quot;Liberation Mono&quot;,Menlo,Courier,monosp=
ace;font-size:12px;word-wrap:normal;white-space:pre-wrap"><br class=3D"m_-2=
529447214457150904m_-8068526654400668846m_6999289319972930716m_208741190808=
7471449m_3993083543651040280m_3771965108974702638gmail-Apple-interchange-ne=
wline">			<span class=3D"m_-2529447214457150904m_-8068526654400668846m_6999=
289319972930716m_2087411908087471449m_3993083543651040280m_3771965108974702=
638gmail-pl-k" style=3D"box-sizing:border-box;color:rgb(215,58,73)">try</sp=
an> {</td></tr><tr style=3D"box-sizing:border-box"><td id=3D"m_-25294472144=
57150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_3=
993083543651040280m_3771965108974702638gmail-L171" class=3D"m_-252944721445=
7150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_39=
93083543651040280m_3771965108974702638gmail-js-line-number m_-2529447214457=
150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_399=
3083543651040280m_3771965108974702638gmail-blob-num" style=3D"box-sizing:bo=
rder-box;padding:0px 10px;width:50px;min-width:50px;font-family:SFMono-Regu=
lar,Consolas,&quot;Liberation Mono&quot;,Menlo,Courier,monospace;font-size:=
12px;line-height:20px;color:rgba(27,31,35,0.3);text-align:right;white-space=
:nowrap;vertical-align:top"></td><td id=3D"m_-2529447214457150904m_-8068526=
654400668846m_6999289319972930716m_2087411908087471449m_3993083543651040280=
m_3771965108974702638gmail-LC171" class=3D"m_-2529447214457150904m_-8068526=
654400668846m_6999289319972930716m_2087411908087471449m_3993083543651040280=
m_3771965108974702638gmail-js-file-line m_-2529447214457150904m_-8068526654=
400668846m_6999289319972930716m_2087411908087471449m_3993083543651040280m_3=
771965108974702638gmail-blob-code m_-2529447214457150904m_-8068526654400668=
846m_6999289319972930716m_2087411908087471449m_3993083543651040280m_3771965=
108974702638gmail-blob-code-inner" style=3D"box-sizing:border-box;padding:0=
px 10px;line-height:20px;vertical-align:top;overflow:visible;font-family:SF=
Mono-Regular,Consolas,&quot;Liberation Mono&quot;,Menlo,Courier,monospace;f=
ont-size:12px;word-wrap:normal;white-space:pre-wrap">				completedCheckpoin=
t <span class=3D"m_-2529447214457150904m_-8068526654400668846m_699928931997=
2930716m_2087411908087471449m_3993083543651040280m_3771965108974702638gmail=
-pl-k" style=3D"box-sizing:border-box;color:rgb(215,58,73)">=3D</span> retr=
ieveCompletedCheckpoint(ch<wbr>eckpointStateHandle);</td></tr><tr style=3D"=
box-sizing:border-box"><td id=3D"m_-2529447214457150904m_-80685266544006688=
46m_6999289319972930716m_2087411908087471449m_3993083543651040280m_37719651=
08974702638gmail-L172" class=3D"m_-2529447214457150904m_-806852665440066884=
6m_6999289319972930716m_2087411908087471449m_3993083543651040280m_377196510=
8974702638gmail-js-line-number m_-2529447214457150904m_-8068526654400668846=
m_6999289319972930716m_2087411908087471449m_3993083543651040280m_3771965108=
974702638gmail-blob-num" style=3D"box-sizing:border-box;padding:0px 10px;wi=
dth:50px;min-width:50px;font-family:SFMono-Regular,Consolas,&quot;Liberatio=
n Mono&quot;,Menlo,Courier,monospace;font-size:12px;line-height:20px;color:=
rgba(27,31,35,0.3);text-align:right;white-space:nowrap;vertical-align:top">=
</td><td id=3D"m_-2529447214457150904m_-8068526654400668846m_69992893199729=
30716m_2087411908087471449m_3993083543651040280m_3771965108974702638gmail-L=
C172" class=3D"m_-2529447214457150904m_-8068526654400668846m_69992893199729=
30716m_2087411908087471449m_3993083543651040280m_3771965108974702638gmail-j=
s-file-line m_-2529447214457150904m_-8068526654400668846m_69992893199729307=
16m_2087411908087471449m_3993083543651040280m_3771965108974702638gmail-blob=
-code m_-2529447214457150904m_-8068526654400668846m_6999289319972930716m_20=
87411908087471449m_3993083543651040280m_3771965108974702638gmail-blob-code-=
inner" style=3D"box-sizing:border-box;padding:0px 10px;line-height:20px;ver=
tical-align:top;overflow:visible;font-family:SFMono-Regular,Consolas,&quot;=
Liberation Mono&quot;,Menlo,Courier,monospace;font-size:12px;word-wrap:norm=
al;white-space:pre-wrap">				<span class=3D"m_-2529447214457150904m_-806852=
6654400668846m_6999289319972930716m_2087411908087471449m_399308354365104028=
0m_3771965108974702638gmail-pl-k" style=3D"box-sizing:border-box;color:rgb(=
215,58,73)">if</span> (completedCheckpoint <span class=3D"m_-25294472144571=
50904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_3993=
083543651040280m_3771965108974702638gmail-pl-k" style=3D"box-sizing:border-=
box;color:rgb(215,58,73)">!=3D</span> <span class=3D"m_-2529447214457150904=
m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_399308354=
3651040280m_3771965108974702638gmail-pl-c1" style=3D"box-sizing:border-box;=
color:rgb(0,92,197)">null</span>) {</td></tr><tr style=3D"box-sizing:border=
-box"><td id=3D"m_-2529447214457150904m_-8068526654400668846m_6999289319972=
930716m_2087411908087471449m_3993083543651040280m_3771965108974702638gmail-=
L173" class=3D"m_-2529447214457150904m_-8068526654400668846m_69992893199729=
30716m_2087411908087471449m_3993083543651040280m_3771965108974702638gmail-j=
s-line-number m_-2529447214457150904m_-8068526654400668846m_699928931997293=
0716m_2087411908087471449m_3993083543651040280m_3771965108974702638gmail-bl=
ob-num" style=3D"box-sizing:border-box;padding:0px 10px;width:50px;min-widt=
h:50px;font-family:SFMono-Regular,Consolas,&quot;Liberation Mono&quot;,Menl=
o,Courier,monospace;font-size:12px;line-height:20px;color:rgba(27,31,35,0.3=
);text-align:right;white-space:nowrap;vertical-align:top"></td><td id=3D"m_=
-2529447214457150904m_-8068526654400668846m_6999289319972930716m_2087411908=
087471449m_3993083543651040280m_3771965108974702638gmail-LC173" class=3D"m_=
-2529447214457150904m_-8068526654400668846m_6999289319972930716m_2087411908=
087471449m_3993083543651040280m_3771965108974702638gmail-js-file-line m_-25=
29447214457150904m_-8068526654400668846m_6999289319972930716m_2087411908087=
471449m_3993083543651040280m_3771965108974702638gmail-blob-code m_-25294472=
14457150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449=
m_3993083543651040280m_3771965108974702638gmail-blob-code-inner" style=3D"b=
ox-sizing:border-box;padding:0px 10px;line-height:20px;vertical-align:top;o=
verflow:visible;font-family:SFMono-Regular,Consolas,&quot;Liberation Mono&q=
uot;,Menlo,Courier,monospace;font-size:12px;word-wrap:normal;white-space:pr=
e-wrap">					completedCheckpoints<span class=3D"m_-2529447214457150904m_-80=
68526654400668846m_6999289319972930716m_2087411908087471449m_39930835436510=
40280m_3771965108974702638gmail-pl-k" style=3D"box-sizing:border-box;color:=
rgb(215,58,73)">.</span>add(compl<wbr>etedCheckpoint);</td></tr><tr style=
=3D"box-sizing:border-box"><td id=3D"m_-2529447214457150904m_-8068526654400=
668846m_6999289319972930716m_2087411908087471449m_3993083543651040280m_3771=
965108974702638gmail-L174" class=3D"m_-2529447214457150904m_-80685266544006=
68846m_6999289319972930716m_2087411908087471449m_3993083543651040280m_37719=
65108974702638gmail-js-line-number m_-2529447214457150904m_-806852665440066=
8846m_6999289319972930716m_2087411908087471449m_3993083543651040280m_377196=
5108974702638gmail-blob-num" style=3D"box-sizing:border-box;padding:0px 10p=
x;width:50px;min-width:50px;font-family:SFMono-Regular,Consolas,&quot;Liber=
ation Mono&quot;,Menlo,Courier,monospace;font-size:12px;line-height:20px;co=
lor:rgba(27,31,35,0.3);text-align:right;white-space:nowrap;vertical-align:t=
op"></td><td id=3D"m_-2529447214457150904m_-8068526654400668846m_6999289319=
972930716m_2087411908087471449m_3993083543651040280m_3771965108974702638gma=
il-LC174" class=3D"m_-2529447214457150904m_-8068526654400668846m_6999289319=
972930716m_2087411908087471449m_3993083543651040280m_3771965108974702638gma=
il-js-file-line m_-2529447214457150904m_-8068526654400668846m_6999289319972=
930716m_2087411908087471449m_3993083543651040280m_3771965108974702638gmail-=
blob-code m_-2529447214457150904m_-8068526654400668846m_6999289319972930716=
m_2087411908087471449m_3993083543651040280m_3771965108974702638gmail-blob-c=
ode-inner" style=3D"box-sizing:border-box;padding:0px 10px;line-height:20px=
;vertical-align:top;overflow:visible;font-family:SFMono-Regular,Consolas,&q=
uot;Liberation Mono&quot;,Menlo,Courier,monospace;font-size:12px;word-wrap:=
normal;white-space:pre-wrap">				}</td></tr><tr style=3D"box-sizing:border-=
box"><td id=3D"m_-2529447214457150904m_-8068526654400668846m_69992893199729=
30716m_2087411908087471449m_3993083543651040280m_3771965108974702638gmail-L=
175" class=3D"m_-2529447214457150904m_-8068526654400668846m_699928931997293=
0716m_2087411908087471449m_3993083543651040280m_3771965108974702638gmail-js=
-line-number m_-2529447214457150904m_-8068526654400668846m_6999289319972930=
716m_2087411908087471449m_3993083543651040280m_3771965108974702638gmail-blo=
b-num" style=3D"box-sizing:border-box;padding:0px 10px;width:50px;min-width=
:50px;font-family:SFMono-Regular,Consolas,&quot;Liberation Mono&quot;,Menlo=
,Courier,monospace;font-size:12px;line-height:20px;color:rgba(27,31,35,0.3)=
;text-align:right;white-space:nowrap;vertical-align:top"></td><td id=3D"m_-=
2529447214457150904m_-8068526654400668846m_6999289319972930716m_20874119080=
87471449m_3993083543651040280m_3771965108974702638gmail-LC175" class=3D"m_-=
2529447214457150904m_-8068526654400668846m_6999289319972930716m_20874119080=
87471449m_3993083543651040280m_3771965108974702638gmail-js-file-line m_-252=
9447214457150904m_-8068526654400668846m_6999289319972930716m_20874119080874=
71449m_3993083543651040280m_3771965108974702638gmail-blob-code m_-252944721=
4457150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m=
_3993083543651040280m_3771965108974702638gmail-blob-code-inner m_-252944721=
4457150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m=
_3993083543651040280m_3771965108974702638gmail-highlighted" style=3D"box-si=
zing:border-box;padding:0px 10px;line-height:20px;vertical-align:top;overfl=
ow:visible;font-family:SFMono-Regular,Consolas,&quot;Liberation Mono&quot;,=
Menlo,Courier,monospace;font-size:12px;word-wrap:normal;white-space:pre-wra=
p;background-color:rgb(255,251,221)">			} <span class=3D"m_-252944721445715=
0904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_39930=
83543651040280m_3771965108974702638gmail-pl-k" style=3D"box-sizing:border-b=
ox;color:rgb(215,58,73)">catch</span> (<span class=3D"m_-252944721445715090=
4m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_39930835=
43651040280m_3771965108974702638gmail-pl-smi" style=3D"box-sizing:border-bo=
x">Exception</span> e) {</td></tr><tr style=3D"box-sizing:border-box"><td i=
d=3D"m_-2529447214457150904m_-8068526654400668846m_6999289319972930716m_208=
7411908087471449m_3993083543651040280m_3771965108974702638gmail-L176" class=
=3D"m_-2529447214457150904m_-8068526654400668846m_6999289319972930716m_2087=
411908087471449m_3993083543651040280m_3771965108974702638gmail-js-line-numb=
er m_-2529447214457150904m_-8068526654400668846m_6999289319972930716m_20874=
11908087471449m_3993083543651040280m_3771965108974702638gmail-blob-num" sty=
le=3D"box-sizing:border-box;padding:0px 10px;width:50px;min-width:50px;font=
-family:SFMono-Regular,Consolas,&quot;Liberation Mono&quot;,Menlo,Courier,m=
onospace;font-size:12px;line-height:20px;color:rgba(27,31,35,0.3);text-alig=
n:right;white-space:nowrap;vertical-align:top"></td><td id=3D"m_-2529447214=
457150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_=
3993083543651040280m_3771965108974702638gmail-LC176" class=3D"m_-2529447214=
457150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_=
3993083543651040280m_3771965108974702638gmail-js-file-line m_-2529447214457=
150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_399=
3083543651040280m_3771965108974702638gmail-blob-code m_-2529447214457150904=
m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_399308354=
3651040280m_3771965108974702638gmail-blob-code-inner" style=3D"box-sizing:b=
order-box;padding:0px 10px;line-height:20px;vertical-align:top;overflow:vis=
ible;font-family:SFMono-Regular,Consolas,&quot;Liberation Mono&quot;,Menlo,=
Courier,monospace;font-size:12px;word-wrap:normal;white-space:pre-wrap">			=
	<span class=3D"m_-2529447214457150904m_-8068526654400668846m_6999289319972=
930716m_2087411908087471449m_3993083543651040280m_3771965108974702638gmail-=
pl-c1" style=3D"box-sizing:border-box;color:rgb(0,92,197)">LOG</span><span =
class=3D"m_-2529447214457150904m_-8068526654400668846m_6999289319972930716m=
_2087411908087471449m_3993083543651040280m_3771965108974702638gmail-pl-k" s=
tyle=3D"box-sizing:border-box;color:rgb(215,58,73)">.</span>warn(<span clas=
s=3D"m_-2529447214457150904m_-8068526654400668846m_6999289319972930716m_208=
7411908087471449m_3993083543651040280m_3771965108974702638gmail-pl-s" style=
=3D"box-sizing:border-box;color:rgb(3,47,98)"><span class=3D"m_-25294472144=
57150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_3=
993083543651040280m_3771965108974702638gmail-pl-pds" style=3D"box-sizing:bo=
rder-box">&quot;</span>Could not retrieve checkpoint. Removing it from the =
completed <span class=3D"m_-2529447214457150904m_-8068526654400668846m_6999=
289319972930716m_2087411908087471449m_3993083543651040280m_3771965108974702=
638gmail-pl-pds" style=3D"box-sizing:border-box">&quot;</span></span> <span=
 class=3D"m_-2529447214457150904m_-8068526654400668846m_6999289319972930716=
m_2087411908087471449m_3993083543651040280m_3771965108974702638gmail-pl-k" =
style=3D"box-sizing:border-box;color:rgb(215,58,73)">+</span></td></tr><tr =
style=3D"box-sizing:border-box"><td id=3D"m_-2529447214457150904m_-80685266=
54400668846m_6999289319972930716m_2087411908087471449m_3993083543651040280m=
_3771965108974702638gmail-L177" class=3D"m_-2529447214457150904m_-806852665=
4400668846m_6999289319972930716m_2087411908087471449m_3993083543651040280m_=
3771965108974702638gmail-js-line-number m_-2529447214457150904m_-8068526654=
400668846m_6999289319972930716m_2087411908087471449m_3993083543651040280m_3=
771965108974702638gmail-blob-num" style=3D"box-sizing:border-box;padding:0p=
x 10px;width:50px;min-width:50px;font-family:SFMono-Regular,Consolas,&quot;=
Liberation Mono&quot;,Menlo,Courier,monospace;font-size:12px;line-height:20=
px;color:rgba(27,31,35,0.3);text-align:right;white-space:nowrap;vertical-al=
ign:top"></td><td id=3D"m_-2529447214457150904m_-8068526654400668846m_69992=
89319972930716m_2087411908087471449m_3993083543651040280m_37719651089747026=
38gmail-LC177" class=3D"m_-2529447214457150904m_-8068526654400668846m_69992=
89319972930716m_2087411908087471449m_3993083543651040280m_37719651089747026=
38gmail-js-file-line m_-2529447214457150904m_-8068526654400668846m_69992893=
19972930716m_2087411908087471449m_3993083543651040280m_3771965108974702638g=
mail-blob-code m_-2529447214457150904m_-8068526654400668846m_69992893199729=
30716m_2087411908087471449m_3993083543651040280m_3771965108974702638gmail-b=
lob-code-inner" style=3D"box-sizing:border-box;padding:0px 10px;line-height=
:20px;vertical-align:top;overflow:visible;font-family:SFMono-Regular,Consol=
as,&quot;Liberation Mono&quot;,Menlo,Courier,monospace;font-size:12px;word-=
wrap:normal;white-space:pre-wrap">					<span class=3D"m_-252944721445715090=
4m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_39930835=
43651040280m_3771965108974702638gmail-pl-s" style=3D"box-sizing:border-box;=
color:rgb(3,47,98)"><span class=3D"m_-2529447214457150904m_-806852665440066=
8846m_6999289319972930716m_2087411908087471449m_3993083543651040280m_377196=
5108974702638gmail-pl-pds" style=3D"box-sizing:border-box">&quot;</span>che=
ckpoint store.<span class=3D"m_-2529447214457150904m_-8068526654400668846m_=
6999289319972930716m_2087411908087471449m_3993083543651040280m_377196510897=
4702638gmail-pl-pds" style=3D"box-sizing:border-box">&quot;</span></span>, =
e);</td></tr><tr style=3D"box-sizing:border-box"><td id=3D"m_-2529447214457=
150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_399=
3083543651040280m_3771965108974702638gmail-L178" class=3D"m_-25294472144571=
50904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_3993=
083543651040280m_3771965108974702638gmail-js-line-number m_-252944721445715=
0904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_39930=
83543651040280m_3771965108974702638gmail-blob-num" style=3D"box-sizing:bord=
er-box;padding:0px 10px;width:50px;min-width:50px;font-family:SFMono-Regula=
r,Consolas,&quot;Liberation Mono&quot;,Menlo,Courier,monospace;font-size:12=
px;line-height:20px;color:rgba(27,31,35,0.3);text-align:right;white-space:n=
owrap;vertical-align:top"></td><td id=3D"m_-2529447214457150904m_-806852665=
4400668846m_6999289319972930716m_2087411908087471449m_3993083543651040280m_=
3771965108974702638gmail-LC178" class=3D"m_-2529447214457150904m_-806852665=
4400668846m_6999289319972930716m_2087411908087471449m_3993083543651040280m_=
3771965108974702638gmail-js-file-line m_-2529447214457150904m_-806852665440=
0668846m_6999289319972930716m_2087411908087471449m_3993083543651040280m_377=
1965108974702638gmail-blob-code m_-2529447214457150904m_-806852665440066884=
6m_6999289319972930716m_2087411908087471449m_3993083543651040280m_377196510=
8974702638gmail-blob-code-inner" style=3D"box-sizing:border-box;padding:0px=
 10px;line-height:20px;vertical-align:top;overflow:visible;font-family:SFMo=
no-Regular,Consolas,&quot;Liberation Mono&quot;,Menlo,Courier,monospace;fon=
t-size:12px;word-wrap:normal;white-space:pre-wrap">
</td></tr><tr style=3D"box-sizing:border-box"><td id=3D"m_-2529447214457150=
904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_399308=
3543651040280m_3771965108974702638gmail-L179" class=3D"m_-25294472144571509=
04m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_3993083=
543651040280m_3771965108974702638gmail-js-line-number m_-252944721445715090=
4m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_39930835=
43651040280m_3771965108974702638gmail-blob-num" style=3D"box-sizing:border-=
box;padding:0px 10px;width:50px;min-width:50px;font-family:SFMono-Regular,C=
onsolas,&quot;Liberation Mono&quot;,Menlo,Courier,monospace;font-size:12px;=
line-height:20px;color:rgba(27,31,35,0.3);text-align:right;white-space:nowr=
ap;vertical-align:top"></td><td id=3D"m_-2529447214457150904m_-806852665440=
0668846m_6999289319972930716m_2087411908087471449m_3993083543651040280m_377=
1965108974702638gmail-LC179" class=3D"m_-2529447214457150904m_-806852665440=
0668846m_6999289319972930716m_2087411908087471449m_3993083543651040280m_377=
1965108974702638gmail-js-file-line m_-2529447214457150904m_-806852665440066=
8846m_6999289319972930716m_2087411908087471449m_3993083543651040280m_377196=
5108974702638gmail-blob-code m_-2529447214457150904m_-8068526654400668846m_=
6999289319972930716m_2087411908087471449m_3993083543651040280m_377196510897=
4702638gmail-blob-code-inner" style=3D"box-sizing:border-box;padding:0px 10=
px;line-height:20px;vertical-align:top;overflow:visible;font-family:SFMono-=
Regular,Consolas,&quot;Liberation Mono&quot;,Menlo,Courier,monospace;font-s=
ize:12px;word-wrap:normal;white-space:pre-wrap">				<span class=3D"m_-25294=
47214457150904m_-8068526654400668846m_6999289319972930716m_2087411908087471=
449m_3993083543651040280m_3771965108974702638gmail-pl-c" style=3D"box-sizin=
g:border-box;color:rgb(106,115,125)"><span class=3D"m_-2529447214457150904m=
_-8068526654400668846m_6999289319972930716m_2087411908087471449m_3993083543=
651040280m_3771965108974702638gmail-pl-c" style=3D"box-sizing:border-box">/=
/</span> remove the checkpoint with broken state handle</span></td></tr><tr=
 style=3D"box-sizing:border-box"><td id=3D"m_-2529447214457150904m_-8068526=
654400668846m_6999289319972930716m_2087411908087471449m_3993083543651040280=
m_3771965108974702638gmail-L180" class=3D"m_-2529447214457150904m_-80685266=
54400668846m_6999289319972930716m_2087411908087471449m_3993083543651040280m=
_3771965108974702638gmail-js-line-number m_-2529447214457150904m_-806852665=
4400668846m_6999289319972930716m_2087411908087471449m_3993083543651040280m_=
3771965108974702638gmail-blob-num" style=3D"box-sizing:border-box;padding:0=
px 10px;width:50px;min-width:50px;font-family:SFMono-Regular,Consolas,&quot=
;Liberation Mono&quot;,Menlo,Courier,monospace;font-size:12px;line-height:2=
0px;color:rgba(27,31,35,0.3);text-align:right;white-space:nowrap;vertical-a=
lign:top"></td><td id=3D"m_-2529447214457150904m_-8068526654400668846m_6999=
289319972930716m_2087411908087471449m_3993083543651040280m_3771965108974702=
638gmail-LC180" class=3D"m_-2529447214457150904m_-8068526654400668846m_6999=
289319972930716m_2087411908087471449m_3993083543651040280m_3771965108974702=
638gmail-js-file-line m_-2529447214457150904m_-8068526654400668846m_6999289=
319972930716m_2087411908087471449m_3993083543651040280m_3771965108974702638=
gmail-blob-code m_-2529447214457150904m_-8068526654400668846m_6999289319972=
930716m_2087411908087471449m_3993083543651040280m_3771965108974702638gmail-=
blob-code-inner" style=3D"box-sizing:border-box;padding:0px 10px;line-heigh=
t:20px;vertical-align:top;overflow:visible;font-family:SFMono-Regular,Conso=
las,&quot;Liberation Mono&quot;,Menlo,Courier,monospace;font-size:12px;word=
-wrap:normal;white-space:pre-wrap">				removeBrokenStateHandle(checkp<wbr>o=
intStateHandle<span class=3D"m_-2529447214457150904m_-8068526654400668846m_=
6999289319972930716m_2087411908087471449m_3993083543651040280m_377196510897=
4702638gmail-pl-k" style=3D"box-sizing:border-box;color:rgb(215,58,73)">.</=
span>f1, checkpointStateHandle<span class=3D"m_-2529447214457150904m_-80685=
26654400668846m_6999289319972930716m_2087411908087471449m_39930835436510402=
80m_3771965108974702638gmail-pl-k" style=3D"box-sizing:border-box;color:rgb=
(215,58,73)">.</span>f0);</td></tr><tr style=3D"box-sizing:border-box"><td =
id=3D"m_-2529447214457150904m_-8068526654400668846m_6999289319972930716m_20=
87411908087471449m_3993083543651040280m_3771965108974702638gmail-L181" clas=
s=3D"m_-2529447214457150904m_-8068526654400668846m_6999289319972930716m_208=
7411908087471449m_3993083543651040280m_3771965108974702638gmail-js-line-num=
ber m_-2529447214457150904m_-8068526654400668846m_6999289319972930716m_2087=
411908087471449m_3993083543651040280m_3771965108974702638gmail-blob-num" st=
yle=3D"box-sizing:border-box;padding:0px 10px;width:50px;min-width:50px;fon=
t-family:SFMono-Regular,Consolas,&quot;Liberation Mono&quot;,Menlo,Courier,=
monospace;font-size:12px;line-height:20px;color:rgba(27,31,35,0.3);text-ali=
gn:right;white-space:nowrap;vertical-align:top"></td><td id=3D"m_-252944721=
4457150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m=
_3993083543651040280m_3771965108974702638gmail-LC181" class=3D"m_-252944721=
4457150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m=
_3993083543651040280m_3771965108974702638gmail-js-file-line m_-252944721445=
7150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_39=
93083543651040280m_3771965108974702638gmail-blob-code m_-252944721445715090=
4m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_39930835=
43651040280m_3771965108974702638gmail-blob-code-inner" style=3D"box-sizing:=
border-box;padding:0px 10px;line-height:20px;vertical-align:top;overflow:vi=
sible;font-family:SFMono-Regular,Consolas,&quot;Liberation Mono&quot;,Menlo=
,Courier,monospace;font-size:12px;word-wrap:normal;white-space:pre-wrap">		=
	}


</td></tr></tbody></table></div></div><div class=3D"gmail_extra"><br><div c=
lass=3D"gmail_quote">On Thu, Oct 5, 2017 at 10:57 AM, Vishal Santoshi <span=
 dir=3D"ltr">&lt;<a href=3D"mailto:vishal.santoshi@gmail.com" target=3D"_bl=
ank">vishal.santoshi@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div dir=3D"ltr"><div><font face=3D"Lucida Console, Monaco, m=
onospace" color=3D"#444444"><span style=3D"font-size:12px;white-space:pre-w=
rap">So this is the issue and tell us that it is wrong. ZK had some state (=
 backed by hdfs )  that referred to a checkpoint ( the same exact last succ=
essful checkpoint that was successful before NN screwed us ). When the JM t=
ried to recreate the state and b&#39;coz NN was down failed to retrieve the=
 CHK handle from hdfs  and conveniently ( and I think very  wrongly ) remov=
ed the CHK from being considered and cleaned the pointer ( though failed as=
 was NN was down and is obvious from the dangling file in recovery ) . The =
metadata itself was on hdfs and failure in retrieving should have been a st=
op all, not going to trying doing magic exception rather than starting from=
 a blank state.</span></font></div><div><span style=3D"color:rgb(68,68,68);=
font-family:&quot;Lucida Console&quot;,Monaco,monospace;font-size:12px;whit=
e-space:pre-wrap"><br></span></div><span style=3D"color:rgb(68,68,68);font-=
family:&quot;Lucida Console&quot;,Monaco,monospace;font-size:12px;white-spa=
ce:pre-wrap">org.apache.flink.util.FlinkExc<wbr>eption: Could not retrieve =
</span><mark style=3D"box-sizing:border-box;background:rgb(252,229,113);pad=
ding:0.2em;font-family:&quot;Lucida Console&quot;,Monaco,monospace;font-siz=
e:12px;white-space:pre-wrap">checkpoint</mark><span style=3D"color:rgb(68,6=
8,68);font-family:&quot;Lucida Console&quot;,Monaco,monospace;font-size:12p=
x;white-space:pre-wrap"> 44286 from state handle under /0000000000000044286=
. This indicates that the retrieved state handle is broken. Try cleaning th=
e state handle store.</span><br><div><span style=3D"color:rgb(68,68,68);fon=
t-family:&quot;Lucida Console&quot;,Monaco,monospace;font-size:12px;white-s=
pace:pre-wrap"><br></span></div><div><span style=3D"color:rgb(68,68,68);fon=
t-family:&quot;Lucida Console&quot;,Monaco,monospace;font-size:12px;white-s=
pace:pre-wrap"><br></span></div><div><span style=3D"color:rgb(68,68,68);fon=
t-family:&quot;Lucida Console&quot;,Monaco,monospace;font-size:12px;white-s=
pace:pre-wrap"><br></span></div><div><span style=3D"color:rgb(68,68,68);fon=
t-family:&quot;Lucida Console&quot;,Monaco,monospace;font-size:12px;white-s=
pace:pre-wrap"><br></span></div><div><span style=3D"color:rgb(68,68,68);fon=
t-family:&quot;Lucida Console&quot;,Monaco,monospace;font-size:12px;white-s=
pace:pre-wrap"><br></span></div></div><div class=3D"m_-2529447214457150904m=
_-8068526654400668846m_6999289319972930716m_2087411908087471449m_3993083543=
651040280m_3771965108974702638HOEnZb"><div class=3D"m_-2529447214457150904m=
_-8068526654400668846m_6999289319972930716m_2087411908087471449m_3993083543=
651040280m_3771965108974702638h5"><div class=3D"gmail_extra"><br><div class=
=3D"gmail_quote">On Thu, Oct 5, 2017 at 10:13 AM, Vishal Santoshi <span dir=
=3D"ltr">&lt;<a href=3D"mailto:vishal.santoshi@gmail.com" target=3D"_blank"=
>vishal.santoshi@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex"><div dir=3D"ltr"><div>Also note that =C2=A0the zookeeper recovery d=
id =C2=A0( sadly on the same hdfs cluster ) also showed the same behavior. =
It had the pointers to the chk point =C2=A0( I =C2=A0think that is what it =
does, keeps metadata of where the checkpoint etc =C2=A0) .=C2=A0 It too dec=
ided to keep the recovery file from the failed state.<br></div><div><p clas=
s=3D"m_-2529447214457150904m_-8068526654400668846m_6999289319972930716m_208=
7411908087471449m_3993083543651040280m_3771965108974702638m_194693049251215=
7946m_2047538385832615368gmail-p1"><span class=3D"m_-2529447214457150904m_-=
8068526654400668846m_6999289319972930716m_2087411908087471449m_399308354365=
1040280m_3771965108974702638m_1946930492512157946m_2047538385832615368gmail=
-s1">-rw-r--r-- <span class=3D"m_-2529447214457150904m_-8068526654400668846=
m_6999289319972930716m_2087411908087471449m_3993083543651040280m_3771965108=
974702638m_1946930492512157946m_2047538385832615368gmail-Apple-converted-sp=
ace">=C2=A0 </span>3 root hadoop <span class=3D"m_-2529447214457150904m_-80=
68526654400668846m_6999289319972930716m_2087411908087471449m_39930835436510=
40280m_3771965108974702638m_1946930492512157946m_2047538385832615368gmail-A=
pple-converted-space">=C2=A0 =C2=A0 =C2=A0 </span>7041 2017-10-04 13:55 /fl=
ink-recovery/prod/completed<wbr>Checkpoint6c9096bb9ed4</span></p><p class=
=3D"m_-2529447214457150904m_-8068526654400668846m_6999289319972930716m_2087=
411908087471449m_3993083543651040280m_3771965108974702638m_1946930492512157=
946m_2047538385832615368gmail-p1">-rw-r--r-- <span class=3D"m_-252944721445=
7150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_39=
93083543651040280m_3771965108974702638m_1946930492512157946m_20475383858326=
15368gmail-Apple-converted-space">=C2=A0 </span>3 root hadoop <span class=
=3D"m_-2529447214457150904m_-8068526654400668846m_6999289319972930716m_2087=
411908087471449m_3993083543651040280m_3771965108974702638m_1946930492512157=
946m_2047538385832615368gmail-Apple-converted-space">=C2=A0 =C2=A0 =C2=A0 <=
/span>7044 2017-10-05 10:07 /flink-recovery/prod/completed<wbr>Checkpoint7c=
5a19300092<br></p><p class=3D"m_-2529447214457150904m_-8068526654400668846m=
_6999289319972930716m_2087411908087471449m_3993083543651040280m_37719651089=
74702638m_1946930492512157946m_2047538385832615368gmail-p1">This is getting=
 a little interesting. What say you :)</p><p class=3D"m_-252944721445715090=
4m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_39930835=
43651040280m_3771965108974702638m_1946930492512157946m_2047538385832615368g=
mail-p1"><br></p></div><div><div class=3D"m_-2529447214457150904m_-80685266=
54400668846m_6999289319972930716m_2087411908087471449m_3993083543651040280m=
_3771965108974702638m_1946930492512157946h5"><div class=3D"gmail_extra"><br=
><div class=3D"gmail_quote">On Thu, Oct 5, 2017 at 9:26 AM, Vishal Santoshi=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:vishal.santoshi@gmail.com" target=
=3D"_blank">vishal.santoshi@gmail.com</a>&gt;</span> wrote:<br><blockquote =
class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid=
;padding-left:1ex"><div dir=3D"ltr">Another thing I noted was this thing<di=
v><br></div><div><p class=3D"m_-2529447214457150904m_-8068526654400668846m_=
6999289319972930716m_2087411908087471449m_3993083543651040280m_377196510897=
4702638m_1946930492512157946m_2047538385832615368m_-3356443291438209497gmai=
l-p1"><span class=3D"m_-2529447214457150904m_-8068526654400668846m_69992893=
19972930716m_2087411908087471449m_3993083543651040280m_3771965108974702638m=
_1946930492512157946m_2047538385832615368m_-3356443291438209497gmail-s1">dr=
wxr-xr-x <span class=3D"m_-2529447214457150904m_-8068526654400668846m_69992=
89319972930716m_2087411908087471449m_3993083543651040280m_37719651089747026=
38m_1946930492512157946m_2047538385832615368m_-3356443291438209497gmail-App=
le-converted-space">=C2=A0 </span>- root hadoop<span class=3D"m_-2529447214=
457150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_=
3993083543651040280m_3771965108974702638m_1946930492512157946m_204753838583=
2615368m_-3356443291438209497gmail-Apple-converted-space">=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 </span>0 2017-10-04 13:54 /flink-checkpoints/prod/c4af8d<=
wbr>fa864e2f9a51764de9f0725b39/chk<wbr>-44286</span></p><p class=3D"m_-2529=
447214457150904m_-8068526654400668846m_6999289319972930716m_208741190808747=
1449m_3993083543651040280m_3771965108974702638m_1946930492512157946m_204753=
8385832615368m_-3356443291438209497gmail-p1"><span class=3D"m_-252944721445=
7150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m_39=
93083543651040280m_3771965108974702638m_1946930492512157946m_20475383858326=
15368m_-3356443291438209497gmail-s1">drwxr-xr-x <span class=3D"m_-252944721=
4457150904m_-8068526654400668846m_6999289319972930716m_2087411908087471449m=
_3993083543651040280m_3771965108974702638m_1946930492512157946m_20475383858=
32615368m_-3356443291438209497gmail-Apple-converted-space">=C2=A0 </span>- =
root hadoop<span class=3D"m_-2529447214457150904m_-8068526654400668846m_699=
9289319972930716m_2087411908087471449m_3993083543651040280m_377196510897470=
2638m_1946930492512157946m_2047538385832615368m_-3356443291438209497gmail-A=
pple-converted-space">=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 </span>0 2017-10-0=
5 09:15 /flink-checkpoints/prod/c4af8d<wbr>fa864e2f9a51764de9f0725b39/chk<w=
br>-45428</span></p><p class=3D"m_-2529447214457150904m_-806852665440066884=
6m_6999289319972930716m_2087411908087471449m_3993083543651040280m_377196510=
8974702638m_1946930492512157946m_2047538385832615368m_-3356443291438209497g=
mail-p1"><span class=3D"m_-2529447214457150904m_-8068526654400668846m_69992=
89319972930716m_2087411908087471449m_3993083543651040280m_37719651089747026=
38m_1946930492512157946m_2047538385832615368m_-3356443291438209497gmail-s1"=
><br></span></p><p class=3D"m_-2529447214457150904m_-8068526654400668846m_6=
999289319972930716m_2087411908087471449m_3993083543651040280m_3771965108974=
702638m_1946930492512157946m_2047538385832615368m_-3356443291438209497gmail=
-p1">Generally what Flink does IMHO is that it replaces the chk point direc=
tory with a new one. I see it happening now. Every minute it replaces the o=
ld directory.=C2=A0 In this job&#39;s case however, it did not delete the 2=
017-10-04 13:54 =C2=A0and hence the chk-44286 directory.=C2=A0 This was the=
 last chk-44286 ( =C2=A0I think =C2=A0) =C2=A0successfully created before N=
N had issues but as is usual did not delete this =C2=A0chk-44286. It looks =
as if it started with a blank slate ???????? Does this strike a chord ?????=
</p></div></div><div class=3D"m_-2529447214457150904m_-8068526654400668846m=
_6999289319972930716m_2087411908087471449m_3993083543651040280m_37719651089=
74702638m_1946930492512157946m_2047538385832615368HOEnZb"><div class=3D"m_-=
2529447214457150904m_-8068526654400668846m_6999289319972930716m_20874119080=
87471449m_3993083543651040280m_3771965108974702638m_1946930492512157946m_20=
47538385832615368h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quo=
te">On Thu, Oct 5, 2017 at 8:56 AM, Vishal Santoshi <span dir=3D"ltr">&lt;<=
a href=3D"mailto:vishal.santoshi@gmail.com" target=3D"_blank">vishal.santos=
hi@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" sty=
le=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div d=
ir=3D"ltr">Hello Fabian,=C2=A0<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 First of all congratulations on this=
 fabulous framework. I have worked with GDF and though GDF has some natural=
 pluses Flink&#39;s state management is far more advanced. With kafka as a =
source it negates issues GDF has ( GDF integration with pub/sub is organic =
and that is to be expected but non FIFO pub/sub is an issue with windows on=
 event time etc )</div><div><br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Coming back to this issue. We have th=
at same kafka topic feeding a streaming druid datasource and we do not see =
any issue there, so so data loss on the source, kafka is not applicable. I =
am totally certain that the &quot;<span style=3D"font-size:12.8px">retentio=
n&quot; time was not an issue. It is 4 days of retention and we fixed this =
issue within 30 minutes. We could replay kafka with a new consumer <a href=
=3D"http://group.id/" target=3D"_blank">group.id</a> and that worked fine.=
=C2=A0</span></div><div><span style=3D"font-size:12.8px"><br></span></div><=
div><span style=3D"font-size:12.8px"><br></span></div><div><span style=3D"f=
ont-size:12.8px">Note these properties and see if they strike a chord.</spa=
n></div><div><span style=3D"font-size:12.8px"><br></span></div><div><span s=
tyle=3D"font-size:12.8px">* The=C2=A0</span><span style=3D"font-family:Menl=
o,&quot;Lucida Console&quot;,monospace;font-size:12.6px">setCommitOffsetsOn=
Checkpoi<wbr>nts(boolean)=C2=A0</span><span style=3D"font-size:12.8px">for =
kafka consumers is the default true. I bring this up to see whether flink w=
ill in any circumstance drive consumption on the kafka perceived offset rat=
her than the one in the checkpoint.</span></div><div><span style=3D"font-si=
ze:12.8px"><br></span></div><div><span style=3D"font-size:12.8px">* The=C2=
=A0</span><span style=3D"background-color:rgb(239,240,241);color:rgb(36,39,=
41);font-family:Consolas,Menlo,Monaco,&quot;Lucida Console&quot;,&quot;Libe=
ration Mono&quot;,&quot;DejaVu Sans Mono&quot;,&quot;Bitstream Vera Sans Mo=
no&quot;,&quot;Courier New&quot;,monospace,sans-serif;font-size:13px;white-=
space:pre-wrap">state.backend.fs.memory-th<wbr>reshold: 0</span><span style=
=3D"font-size:12.8px">=C2=A0has not been set.=C2=A0 The state is big enough=
 though therefore IMHO no way the state is stored along with the meta data =
in JM ( or ZK ? ) . The reason I bring this up is to make sure when you say=
 that the size has to be less than 1024bytes , you are talking about cumula=
tive=C2=A0state of the pipeine.</span></div><div><span style=3D"font-size:1=
2.8px"><br></span></div><div><span style=3D"font-size:12.8px">* We have a g=
ood sense of SP ( save point ) =C2=A0and CP ( checkpoint ) and certainly un=
derstand that they actually are not dissimilar. However in this case there =
were multiple attempts to restart the pipe before it finally succeeded.=C2=
=A0</span></div><div><span style=3D"font-size:12.8px"><br></span></div><div=
><span style=3D"font-size:12.8px">* Other hdfs related poperties.</span></d=
iv><div><span style=3D"font-size:12.8px">=C2=A0</span></div><div><span styl=
e=3D"font-family:Menlo;font-size:7.5pt">=C2=A0state.backend.fs.checkpointdi=
<wbr>r: <a>hdfs:///flink-checkpoints</a>/&lt;%=3D flink_hdfs_root %&gt;</sp=
an></div><pre style=3D"font-family:Menlo;font-size:7.5pt"> state.savepoints=
.dir: <a>hdfs:///flink-savepoints</a>/&lt;%=3D flink_hdfs_root %&gt;<br></p=
re><pre style=3D"font-family:Menlo;font-size:7.5pt"> recovery.zookeeper.sto=
rageDir: <a>hdfs:///flink-recovery</a>/&lt;%=3D flink_hdfs_root %&gt;</pre>=
<div><br></div><div><span style=3D"font-size:12.8px"><br></span></div><div>=
<span style=3D"font-size:12.8px">Do these make sense ? Is there anything el=
se I should look at.=C2=A0 Please also note that it is the second time this=
 has happened. The first time I was vacationing and was not privy to the st=
ate of the flink pipeline, but the net effect were similar. The counts for =
the first window after an internal restart dropped.=C2=A0</span></div><div>=
<span style=3D"font-size:12.8px"><br></span></div><div><span style=3D"font-=
size:12.8px"><br></span></div><div><span style=3D"font-size:12.8px"><br></s=
pan></div><div><span style=3D"font-size:12.8px"><br></span></div><div><span=
 style=3D"font-size:12.8px">Thank you for you patience and regards,</span><=
/div><div><span style=3D"font-size:12.8px"><br></span></div><div><span styl=
e=3D"font-size:12.8px">Vishal</span></div><div><span style=3D"font-size:12.=
8px"><br></span></div><div><span style=3D"font-size:12.8px"><br></span></di=
v><div><span style=3D"font-size:12.8px"><br></span></div><div><span style=
=3D"font-size:12.8px"><br></span></div><div><span style=3D"font-size:12.8px=
"><br></span></div><div><span style=3D"font-size:12.8px"><br></span></div><=
div><span style=3D"font-size:12.8px"><br></span></div><div><span style=3D"f=
ont-size:12.8px"><br></span></div><div><span style=3D"font-size:12.8px"><br=
></span></div><div><span style=3D"font-size:12.8px"><br></span></div></div>=
<div class=3D"m_-2529447214457150904m_-8068526654400668846m_699928931997293=
0716m_2087411908087471449m_3993083543651040280m_3771965108974702638m_194693=
0492512157946m_2047538385832615368m_-3356443291438209497HOEnZb"><div class=
=3D"m_-2529447214457150904m_-8068526654400668846m_6999289319972930716m_2087=
411908087471449m_3993083543651040280m_3771965108974702638m_1946930492512157=
946m_2047538385832615368m_-3356443291438209497h5"><div class=3D"gmail_extra=
"><br><div class=3D"gmail_quote">On Thu, Oct 5, 2017 at 5:01 AM, Fabian Hue=
ske <span dir=3D"ltr">&lt;<a href=3D"mailto:fhueske@gmail.com" target=3D"_b=
lank">fhueske@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail=
_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:=
1ex"><div dir=3D"ltr"><div><div>Hi Vishal,<br><br></div>window operators ar=
e always stateful because the operator needs to remember previously receive=
d events (WindowFunction) or intermediate results (ReduceFunction).<br></di=
v><div>Given the program you described, a checkpoint should include the Kaf=
ka consumer offset and the state of the window operator. If the program eve=
ntually successfully (i.e., without an error) recovered from the last check=
point, all its state should have been restored. Since the last checkpoint w=
as before HDFS went into safe mode, the program would have been reset to th=
at point. If the Kafka retention time is less than the time it took to fix =
HDFS you would have lost data because it would have been removed from Kafka=
. If that&#39;s not the case, we need to investigate this further because a=
 checkpoint recovery must not result in state loss.</div><div><br></div><di=
v>Restoring from a savepoint is not so much different from automatic checkp=
oint recovery. Given that you have a completed savepoint, you can restart t=
he job from that point. The main difference is that checkpoints are only us=
ed for internal recovery and usually discarded once the job is terminated w=
hile savepoints are retained. <br></div><div><br></div><div>Regarding your =
question if a failed checkpoint should cause the job to fail and recover I&=
#39;m not sure what the current status is.</div><div>Stefan (in CC) should =
know what happens if a checkpoint fails.</div><div><br></div><div>Best, Fab=
ian<br></div></div><div class=3D"m_-2529447214457150904m_-80685266544006688=
46m_6999289319972930716m_2087411908087471449m_3993083543651040280m_37719651=
08974702638m_1946930492512157946m_2047538385832615368m_-3356443291438209497=
m_-2969325848596382893HOEnZb"><div class=3D"m_-2529447214457150904m_-806852=
6654400668846m_6999289319972930716m_2087411908087471449m_399308354365104028=
0m_3771965108974702638m_1946930492512157946m_2047538385832615368m_-33564432=
91438209497m_-2969325848596382893h5"><div class=3D"gmail_extra"><br><div cl=
ass=3D"gmail_quote">2017-10-05 2:20 GMT+02:00 Vishal Santoshi <span dir=3D"=
ltr">&lt;<a href=3D"mailto:vishal.santoshi@gmail.com" target=3D"_blank">vis=
hal.santoshi@gmail.com</a>&gt;</span>:<br><blockquote class=3D"gmail_quote"=
 style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><d=
iv dir=3D"ltr">To add to it, my pipeline is a simple=C2=A0<div><br></div><d=
iv><pre style=3D"font-family:Menlo;font-size:7.5pt">keyBy(<span style=3D"co=
lor:rgb(0,0,255)">0</span>)<br>        .timeWindow(Time.<span style=3D"font=
-style:italic">of</span>(<span style=3D"color:rgb(102,14,122);font-weight:b=
old">window_siz<wbr>e</span>, TimeUnit.<span style=3D"color:rgb(102,14,122)=
;font-weight:bold;font-style:italic">MINUTES</span>))<br>        .allowedLa=
teness(Time.<span style=3D"font-style:italic">of</span>(<span style=3D"colo=
r:rgb(102,14,122);font-weight:bold">late_<wbr>by</span>, TimeUnit.<span sty=
le=3D"color:rgb(102,14,122);font-weight:bold;font-style:italic">SECONDS</sp=
an>))<br>        .reduce(<span style=3D"color:rgb(0,0,128);font-weight:bold=
">new </span>ReduceFunction(), <span style=3D"color:rgb(0,0,128);font-weigh=
t:bold">new </span>WindowFunction())</pre></div></div><div class=3D"m_-2529=
447214457150904m_-8068526654400668846m_6999289319972930716m_208741190808747=
1449m_3993083543651040280m_3771965108974702638m_1946930492512157946m_204753=
8385832615368m_-3356443291438209497m_-2969325848596382893m_-190909981892268=
0838HOEnZb"><div class=3D"m_-2529447214457150904m_-8068526654400668846m_699=
9289319972930716m_2087411908087471449m_3993083543651040280m_377196510897470=
2638m_1946930492512157946m_2047538385832615368m_-3356443291438209497m_-2969=
325848596382893m_-1909099818922680838h5"><div class=3D"gmail_extra"><br><di=
v class=3D"gmail_quote">On Wed, Oct 4, 2017 at 8:19 PM, Vishal Santoshi <sp=
an dir=3D"ltr">&lt;<a href=3D"mailto:vishal.santoshi@gmail.com" target=3D"_=
blank">vishal.santoshi@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div dir=3D"ltr"><span style=3D"font-size:12.8px">Hello folks=
,</span><br style=3D"font-size:12.8px"><br style=3D"font-size:12.8px"><span=
 style=3D"font-size:12.8px">As far as I know checkpoint failure should be i=
gnored and retried with potentially larger state. I had this situation</spa=
n><br style=3D"font-size:12.8px"><br style=3D"font-size:12.8px"><span style=
=3D"font-size:12.8px">* hdfs went into a safe mode b&#39;coz of Name Node i=
ssues</span><br style=3D"font-size:12.8px"><span style=3D"font-size:12.8px"=
>* exception was thrown</span><br style=3D"font-size:12.8px"><br style=3D"f=
ont-size:12.8px"><span style=3D"font-size:12.8px">=C2=A0 =C2=A0 org.apache.=
hadoop.ipc.</span><span style=3D"font-size:12.8px">RemoteEx<wbr>ception(org=
.apache.</span><span style=3D"font-size:12.8px">hadoop.ipc.<wbr>StandbyExce=
ption): Operation category WRITE is not supported in state standby. Visit=
=C2=A0</span><a href=3D"https://s.apache.org/sbnn-error" rel=3D"noreferrer"=
 style=3D"font-size:12.8px" target=3D"_blank">https://s.apache.org/sbn<wbr>=
n-error</a><br style=3D"font-size:12.8px"><span style=3D"font-size:12.8px">=
=C2=A0 =C2=A0 ..................</span><br style=3D"font-size:12.8px"><br s=
tyle=3D"font-size:12.8px"><span style=3D"font-size:12.8px">=C2=A0 =C2=A0 at=
 org.apache.flink.runtime.fs.</span><span style=3D"font-size:12.8px">hd<wbr=
>fs.HadoopFileSystem.mkdirs(</span><span style=3D"font-size:12.8px">Had<wbr=
>oopFileSystem.java:453)</span><br style=3D"font-size:12.8px"><span style=
=3D"font-size:12.8px">=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.core.=
fs.</span><span style=3D"font-size:12.8px">Safet<wbr>yNetWrapperFileSystem.=
</span><span style=3D"font-size:12.8px">mkdirs(</span><span style=3D"font-s=
ize:12.8px">S<wbr>afetyNetWrapperFileSystem.</span><span style=3D"font-size=
:12.8px">java<wbr>:111)</span><br style=3D"font-size:12.8px"><span style=3D=
"font-size:12.8px">=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.runtime.=
</span><span style=3D"font-size:12.8px">state<wbr>.filesystem.</span><span =
style=3D"font-size:12.8px">FsCheckpointStream<wbr>Factory.</span><span styl=
e=3D"font-size:12.8px">createBasePath(</span><span style=3D"font-size:12.8p=
x">FsCheck<wbr>pointStreamFactory.</span><span style=3D"font-size:12.8px">j=
ava:132)</span><br style=3D"font-size:12.8px"><br style=3D"font-size:12.8px=
"><span style=3D"font-size:12.8px">* The pipeline came back after a few res=
tarts and checkpoint failures, after the hdfs issues were resolved.</span><=
br style=3D"font-size:12.8px"><br style=3D"font-size:12.8px"><span style=3D=
"font-size:12.8px">I would not have worried about the restart, but it was e=
vident that I lost my operator state. Either it was my kafka consumer that =
kept on advancing it&#39;s offset between a start and the next checkpoint f=
ailure ( a minute&#39;s worth ) or the the operator that had partial aggreg=
ates was lost. I have a 15 minute window of counts on a keyed operator</spa=
n><br style=3D"font-size:12.8px"><br style=3D"font-size:12.8px"><span style=
=3D"font-size:12.8px">I am using ROCKS DB and of course have checkpointing =
turned on.</span><br style=3D"font-size:12.8px"><br style=3D"font-size:12.8=
px"><span style=3D"font-size:12.8px">The questions thus are</span><br style=
=3D"font-size:12.8px"><br style=3D"font-size:12.8px"><span style=3D"font-si=
ze:12.8px">* Should a pipeline be restarted if checkpoint fails ?</span><br=
 style=3D"font-size:12.8px"><span style=3D"font-size:12.8px">* Why on resta=
rt did the operator state did not recreate ?</span><br style=3D"font-size:1=
2.8px"><span style=3D"font-size:12.8px">* Is the nature of the exception th=
rown have to do with any of this b&#39;coz suspend and resume from a save p=
oint work as expected ?</span><br style=3D"font-size:12.8px"><span style=3D=
"font-size:12.8px">* And though I am pretty sure, are operators like the Wi=
ndow operator stateful by drfault and thus if I have timeWindow(Time.of(win=
dow_</span><span style=3D"font-size:12.8px">size<wbr>, TimeUnit.MINUTES)).r=
educe(new ReduceFunction(), new WindowFunction()), the state is managed by =
flink ?</span><br style=3D"font-size:12.8px"><br style=3D"font-size:12.8px"=
><span style=3D"font-size:12.8px">Thanks.</span><br></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></blockquote></div><br></div></div></div></div></div></blockquote></d=
iv><br></div>
</div></div></blockquote></div><br></div>
</div></blockquote></div><br></div></div></div></div></blockquote></div><br=
></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--001a11467baa4bb7c105636dfe73--