flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vino yang <yanghua1...@gmail.com>
Subject Re: Old job resurrected during HA failover
Date Thu, 02 Aug 2018 02:34:16 GMT
Hi Elias,

Your analysis is correct, yes, in theory the old jobgraph should be
deleted, but Flink currently uses the method of locking and asynchronously
deleting Path, so that it can not give you the acknowledgment of deleting,
so this is a risk point.

cc Till, there have been users who have encountered this problem before. I
personally think that asynchronous deletion may be risky, which may cause
JM to be revived by the cancel job after the failover.

Thanks, vino.

2018-08-02 5:25 GMT+08:00 Elias Levy <fearsome.lucidity@gmail.com>:

> I can see in the logs that the JM 1 (10.210.22.167), that one that became
> leader after failover, thinks it deleted the
> 2a4eff355aef849c5ca37dbac04f2ff1 job from ZK when it was canceled:
>
> July 30th 2018, 15:32:27.231 Trying to cancel job with ID
> 2a4eff355aef849c5ca37dbac04f2ff1.
> July 30th 2018, 15:32:27.232 Job Some Job (2a4eff355aef849c5ca37dbac04f2ff1)
> switched from state RESTARTING to CANCELED.
> July 30th 2018, 15:32:27.232 Stopping checkpoint coordinator for job
> 2a4eff355aef849c5ca37dbac04f2ff1
> July 30th 2018, 15:32:27.239 Removed job graph
> 2a4eff355aef849c5ca37dbac04f2ff1 from ZooKeeper.
> July 30th 2018, 15:32:27.245 Removing /flink/cluster_1/checkpoints/
> 2a4eff355aef849c5ca37dbac04f2ff1 from ZooKeeper
> July 30th 2018, 15:32:27.251 Removing /checkpoint-counter/
> 2a4eff355aef849c5ca37dbac04f2ff1 from ZooKeeper
>
> Both /flink/cluster_1/checkpoints/2a4eff355aef849c5ca37dbac04f2ff1
> and /flink/cluster_1/checkpoint-counter/2a4eff355aef849c5ca37dbac04f2ff1
> no longer exist, but for some reason the job graph as is still there.
>
> Looking at the ZK logs I find the problem:
>
> July 30th 2018, 15:32:27.241 Got user-level KeeperException when
> processing sessionid:0x2000001d2330001 type:delete cxid:0x434c
> zxid:0x60009dd94 txntype:-1 reqpath:n/a Error Path:/flink/cluster_1/
> jobgraphs/2a4eff355aef849c5ca37dbac04f2ff1 Error:KeeperErrorCode =
> Directory not empty for /flink/cluster_1/jobgraphs/
> 2a4eff355aef849c5ca37dbac04f2ff1
>
> Looking in ZK, we see:
>
> [zk: localhost:2181(CONNECTED) 0] ls /flink/cluster_1/jobgraphs/
> 2a4eff355aef849c5ca37dbac04f2ff1
> [d833418c-891a-4b5e-b983-080be803275c]
>
> From the comments in ZooKeeperStateHandleStore.java I gather that this
> child node is used as a deletion lock.  Looking at the contents of this
> ephemeral lock node:
>
> [zk: localhost:2181(CONNECTED) 16] get /flink/cluster_1/jobgraphs/
> 2a4eff355aef849c5ca37dbac04f2ff1/d833418c-891a-4b5e-b983-080be803275c
> *10.210.42.62*
> cZxid = 0x60002ffa7
> ctime = Tue Jun 12 20:01:26 UTC 2018
> mZxid = 0x60002ffa7
> mtime = Tue Jun 12 20:01:26 UTC 2018
> pZxid = 0x60002ffa7
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x30000003f4a0003
> dataLength = 12
> numChildren = 0
>
> and compared to the ephemeral node lock of the currently running job:
>
> [zk: localhost:2181(CONNECTED) 17] get /flink/cluster_1/jobgraphs/
> d77948df92813a68ea6dfd6783f40e7e/596a4add-9f5c-4113-99ec-9c942fe91172
> *10.210.22.167*
> cZxid = 0x60009df4b
> ctime = Mon Jul 30 23:01:04 UTC 2018
> mZxid = 0x60009df4b
> mtime = Mon Jul 30 23:01:04 UTC 2018
> pZxid = 0x60009df4b
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x2000001d2330001
> dataLength = 13
> numChildren = 0
>
> Assuming the content of the nodes represent the owner, it seems the job
> graph for the old canceled job, 2a4eff355aef849c5ca37dbac04f2ff1, is
> locked by the previous JM leader, JM 2(10.210.42.62), while the running job
> locked by the current JM leader, JM 1 (10.210.22.167).
>
> Somehow the previous leader, JM 2, did not give up the lock when
> leadership failed over to JM 2.
>
> Shouldn't something call ZooKeeperStateHandleStore.releaseAll during HA
> failover to release the locks on the graphs?
>
>
> On Wed, Aug 1, 2018 at 9:49 AM Elias Levy <fearsome.lucidity@gmail.com>
> wrote:
>
>> Thanks for the reply.  Looking in ZK I see:
>>
>> [zk: localhost:2181(CONNECTED) 5] ls /flink/cluster_1/jobgraphs
>> [d77948df92813a68ea6dfd6783f40e7e, 2a4eff355aef849c5ca37dbac04f2ff1]
>>
>> Again we see HA state for job 2a4eff355aef849c5ca37dbac04f2ff1, even
>> though that job is no longer running (it was canceled while it was in a
>> loop attempting to restart, but failing because of a lack of cluster slots).
>>
>> Any idea why that may be the case?
>>
>>>

Mime
View raw message