flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elias Levy <fearsome.lucid...@gmail.com>
Subject Re: Old job resurrected during HA failover
Date Wed, 01 Aug 2018 16:49:21 GMT
Vino,

Thanks for the reply.  Looking in ZK I see:

[zk: localhost:2181(CONNECTED) 5] ls /flink/cluster_1/jobgraphs
[d77948df92813a68ea6dfd6783f40e7e, 2a4eff355aef849c5ca37dbac04f2ff1]

Again we see HA state for job 2a4eff355aef849c5ca37dbac04f2ff1, even though
that job is no longer running (it was canceled while it was in a loop
attempting to restart, but failing because of a lack of cluster slots).

Any idea why that may be the case?


On Wed, Aug 1, 2018 at 8:38 AM vino yang <yanghua1127@gmail.com> wrote:

> If a job is explicitly canceled, its jobgraph node on ZK will be deleted.
> However, it is worth noting here that Flink enables a background thread to
> asynchronously delete the jobGraph node,
> so there may be cases where it cannot be deleted.
> On the other hand, the jobgraph node on ZK is the only basis for the JM
> leader to restore the job.
> There may be an unexpected recovery or an old job resurrection.
>

Mime
View raw message