flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ge Wu <andrew.ge...@eniro.com>
Subject Cluster failure after zookeeper glitch.
Date Thu, 19 Jan 2017 12:16:57 GMT
Hi,


We recently had several zookeeper glitch, when that happens it seems to take flink cluster
with it.

We are running on 1.03

It started like this:


2017-01-19 11:52:13,047 INFO  org.apache.zookeeper.ClientCnxn                            
  - Unable to read additional data from server sessionid 0x159b505820a0008, likely server
has closed socket, closing socket connection and attempting reconnect
2017-01-19 11:52:13,047 INFO  org.apache.zookeeper.ClientCnxn                            
  - Unable to read additional data from server sessionid 0x159b505820a0009, likely server
has closed socket, closing socket connection and attempting reconnect
2017-01-19 11:52:13,151 INFO  org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager
 - State change: SUSPENDED
2017-01-19 11:52:13,151 INFO  org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager
 - State change: SUSPENDED
2017-01-19 11:52:13,166 WARN  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore
 - ZooKeeper connection SUSPENDED. Changes to the submitted job graphs are not monitored (temporarily).
2017-01-19 11:52:13,169 INFO  org.apache.flink.runtime.jobmanager.JobManager             
  - JobManager akka://flink/user/jobmanager#1976923422 was revoked leadership.
2017-01-19 11:52:13,179 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph     
  - op1 -> (Map, Map -> op2) (18/24) (5336dd375eb12616c5a0e93c84f93465) switched from
RUNNING to FAILED



Then our web-ui stopped serving and job manager stuck in an exception loop like this:
2017-01-19 13:05:13,521 WARN  org.apache.flink.runtime.jobmanager.JobManager             
  - Discard message LeaderSessionMessage(0318ecf5-7069-41b2-a793-2f24bdbaa287,01/19/2017 13:05:13
    Job execution switched to status RESTARTING.) because the expected leader session I
D None did not equal the received leader session ID Some(0318ecf5-7069-41b2-a793-2f24bdbaa287).
2017-01-19 13:05:13,521 INFO  org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy
 - Delaying retry of job execution for xxxxx ms …


Is it because we misconfigured anything? or this is expected behavior? When this happens we
have to restart the cluster to bring it back.


Thanks!


Andrew
-- 
Confidentiality Notice: This e-mail transmission may contain confidential 
or legally privileged information that is intended only for the individual 
or entity named in the e-mail address. If you are not the intended 
recipient, you are hereby notified that any disclosure, copying, 
distribution, or reliance upon the contents of this e-mail is strictly 
prohibited and may be unlawful. If you have received this e-mail in error, 
please notify the sender immediately by return e-mail and delete all copies 
of this message.

Mime
View raw message