From Elias Levy <fearsome.lucid...@gmail.com>
Subject Cluster resurrects old job
Date Wed, 20 Jun 2018 16:31:51 GMT
We had an unusual situation last night.  One of our Flink clusters
experienced some connectivity issues, with lead to the the single job
running on the cluster failing and then being restored.

And then something odd happened.  The cluster decided to also restore an
old version of the job.  One we were running a month ago.  That job was
canceled on June 5 with a savepoint:

June 5th 2018, 15:00:43.865 Trying to cancel job
c59dd3133b1182ce2c05a5e2603a0646 with savepoint to
June 5th 2018, 15:00:44.438 Savepoint stored in
s3://bucket/flink/foo/savepoints/savepoint-c59dd3-f748765c67df. Now
cancelling c59dd3133b1182ce2c05a5e2603a0646.
June 5th 2018, 15:00:44.438 Job IOC Engine
(c59dd3133b1182ce2c05a5e2603a0646) switched from state RUNNING to
June 5th 2018, 15:00:44.495 Job IOC Engine
(c59dd3133b1182ce2c05a5e2603a0646) switched from state CANCELLING to
June 5th 2018, 15:00:44.507 Removed job graph
c59dd3133b1182ce2c05a5e2603a0646 from ZooKeeper.
June 5th 2018, 15:00:44.508 Removing
/flink/foo/checkpoints/c59dd3133b1182ce2c05a5e2603a0646 from ZooKeeper
June 5th 2018, 15:00:44.732 Job c59dd3133b1182ce2c05a5e2603a0646 has been
archived at s3://bucket/flink/foo/archive/c59dd3133b1182ce2c05a5e2603a0646.

But then yesterday:

June 19th 2018, 17:55:31.917 Attempting to recover job
June 19th 2018, 17:55:32.155 Recovered
SubmittedJobGraph(c59dd3133b1182ce2c05a5e2603a0646, JobInfo(clients:
start: 1524514537697)).
June 19th 2018, 17:55:32.157 Submitting job
c59dd3133b1182ce2c05a5e2603a0646 (Some Job) (Recovery).
June 19th 2018, 17:55:32.157 Using restart strategy
delayBetweenRestartAttempts=30000) for c59dd3133b1182ce2c05a5e2603a0646.
June 19th 2018, 17:55:32.157 Submitting recovered job
June 19th 2018, 17:55:32.158 Running initialization on master for job Some
Job (c59dd3133b1182ce2c05a5e2603a0646).
June 19th 2018, 17:55:32.165 Initialized in
June 19th 2018, 17:55:32.170 Job Some Job
(c59dd3133b1182ce2c05a5e2603a0646) switched from state CREATED to RUNNING.
June 19th 2018, 17:55:32.170 Scheduling job
c59dd3133b1182ce2c05a5e2603a0646 (Some Job).

Anyone seen anything like this?  Any ideas what the cause may have been?

I am guessing that the state in ZK or S3 may have been somewhat corrupted
when the job was previously shutdown, and that when the cluster encountered
networking problems yesterday
that lead to the cancel and restore of the currently running job, the
restore logic scanned ZK or S3 looking for jobs to restore, came across the
old job with bad state and decided to bring it back to life.

Any way to scan ZooKeeper or S3 for such jobs?

