flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <trohrm...@apache.org>
Subject Re: Zookeeper failure handling
Date Mon, 25 Sep 2017 12:10:39 GMT
Hi Gyula,

Flink uses internally the Curator LeaderLatch recipe to do leader election.
The LeaderLatch will revoke the leadership of a contender in case of a
SUSPENDED or LOST connection to the ZooKeeper quorum. The assumption here
is that if you cannot talk to ZooKeeper, then we can no longer be sure that
you are the leader.

Consequently, if you do a rolling update of your ZooKeeper cluster which
causes client connections to be lost or suspended, then it will trigger a
restart of the Flink job upon reacquiring the leadership again.

Cheers,
Till

On Fri, Sep 22, 2017 at 6:41 PM, Gyula Fóra <gyula.fora@gmail.com> wrote:

> We are using 1.3.2
>
> Gyula
>
> On Fri, Sep 22, 2017, 17:13 Ted Yu <yuzhihong@gmail.com> wrote:
>
> > Which release are you using ?
> >
> > Flink 1.3.2 uses Curator 2.12.0 which solves some leader election issues.
> >
> > Mind giving 1.3.2 a try ?
> >
> > On Fri, Sep 22, 2017 at 4:54 AM, Gyula Fóra <gyula.fora@gmail.com>
> wrote:
> >
> > > Hi all,
> > >
> > > We have observed that in case some nodes of the ZK cluster are
> restarted
> > > (for a rolling restart) the Flink Streaming jobs fail (and restart).
> > >
> > > Log excerpt:
> > >
> > > 2017-09-22 12:54:41,426 INFO  org.apache.zookeeper.ClientCnxn
> > >                      - Unable to read additional data from server
> > > sessionid 0x15cba6e1a239774, likely server has closed socket, closing
> > > socket connection and attempting reconnect
> > > 2017-09-22 12:54:41,527 INFO
> > > org.apache.flink.shaded.org.apache.curator.framework.
> > > state.ConnectionStateManager
> > >  - State change: SUSPENDED
> > > 2017-09-22 12:54:41,528 WARN
> > > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
> > >  - Connection to ZooKeeper suspended. The contender
> > > akka.tcp://flink@splat.sto.midasplayer.com:42118/user/jobmanager no
> > > longer participates in the leader election.
> > > 2017-09-22 12:54:41,528 WARN
> > > org.apache.flink.runtime.leaderretrieval.
> ZooKeeperLeaderRetrievalService
> > >  - Connection to ZooKeeper suspended. Can no longer retrieve the
> > > leader from ZooKeeper.
> > > 2017-09-22 12:54:41,528 WARN
> > > org.apache.flink.runtime.leaderretrieval.
> ZooKeeperLeaderRetrievalService
> > >  - Connection to ZooKeeper suspended. Can no longer retrieve the
> > > leader from ZooKeeper.
> > > 2017-09-22 12:54:41,530 WARN
> > > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
> > > ZooKeeper connection SUSPENDED. Changes to the submitted job graphs
> > > are not monitored (temporarily).
> > > 2017-09-22 12:54:41,530 INFO  org.apache.flink.yarn.YarnJobManager
> > >                      - JobManager
> > > akka://flink/user/jobmanager#-317276879 was revoked leadership.
> > > 2017-09-22 12:54:41,532 INFO
> > > org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
> > > event.game.log (2ad7bbcc476bbe3735954fc414ffcb97) switched from state
> > > RUNNING to SUSPENDED.
> > > java.lang.Exception: JobManager is no longer the leader.
> > >
> > >
> > > Is this the expected behaviour?
> > >
> > > Thanks,
> > > Gyula
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message