hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bharat Viswanadham (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-12834) DFSZKFailoverController on error exits with 0 error code
Date Fri, 17 Nov 2017 18:19:00 GMT

     [ https://issues.apache.org/jira/browse/HDFS-12834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Bharat Viswanadham updated HDFS-12834:
--------------------------------------
    Status: Patch Available  (was: In Progress)

> DFSZKFailoverController on error exits with 0 error code
> --------------------------------------------------------
>
>                 Key: HDFS-12834
>                 URL: https://issues.apache.org/jira/browse/HDFS-12834
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 3.0.0-alpha4, 2.7.3
>            Reporter: Zbigniew Kostrzewa
>            Assignee: Bharat Viswanadham
>         Attachments: HDFS-12834.00.patch
>
>
> On error {{DFSZKFailoverController}} exits with 0 return code which leads to problems
when integrating it with scripts and monitoring tools, e.g. systemd, which when configured
to restart the service only on failure does not restart ZKFC because it exited with 0.
> For example, in my case, systemd reported zkfc exited with success but in logs I have
found this:
> {noformat}
> 2017-11-14 05:33:55,075 INFO org.apache.zookeeper.ClientCnxn: Client session timed out,
have not heard from server in 3334ms for sessionid 0x15fb794bd240001, closing socket connection
and attempting reconnect
> 2017-11-14 05:33:55,178 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected.
Entering neutral mode...
> 2017-11-14 05:33:55,564 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection
to server 10.9.4.73/10.9.4.73:2182. Will not attempt to authenticate using SASL (unknown error)
> 2017-11-14 05:33:55,566 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to 10.9.4.73/10.9.4.73:2182, initiating session
> 2017-11-14 05:33:55,569 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete
on server 10.9.4.73/10.9.4.73:2182, sessionid = 0x15fb794bd240001, negotiated timeout = 5000
> 2017-11-14 05:33:55,570 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
> 2017-11-14 05:33:58,230 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional
data from server sessionid 0x15fb794bd240001, likely server has closed socket, closing socket
connection and attempting reconnect
> 2017-11-14 05:33:58,335 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected.
Entering neutral mode...
> 2017-11-14 05:33:58,402 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection
to server 10.9.4.138/10.9.4.138:2181. Will not attempt to authenticate using SASL (unknown
error)
> 2017-11-14 05:33:58,403 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to 10.9.4.138/10.9.4.138:2181, initiating session
> 2017-11-14 05:33:58,406 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional
data from server sessionid 0x15fb794bd240001, likely server has closed socket, closing socket
connection and attempting reconnect
> 2017-11-14 05:33:59,218 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection
to server 10.9.4.228/10.9.4.228:2183. Will not attempt to authenticate using SASL (unknown
error)
> 2017-11-14 05:33:59,219 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to 10.9.4.228/10.9.4.228:2183, initiating session
> 2017-11-14 05:33:59,221 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional
data from server sessionid 0x15fb794bd240001, likely server has closed socket, closing socket
connection and attempting reconnect
> 2017-11-14 05:34:01,094 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection
to server 10.9.4.73/10.9.4.73:2182. Will not attempt to authenticate using SASL (unknown error)
> 2017-11-14 05:34:01,094 INFO org.apache.zookeeper.ClientCnxn: Client session timed out,
have not heard from server in 1773ms for sessionid 0x15fb794bd240001, closing socket connection
and attempting reconnect
> 2017-11-14 05:34:01,196 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat
error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection
errors.
> 2017-11-14 05:34:02,153 INFO org.apache.zookeeper.ZooKeeper: Session: 0x15fb794bd240001
closed
> 2017-11-14 05:34:02,154 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error
occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode
monitoring connection errors.
> 2017-11-14 05:34:02,154 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2017-11-14 05:34:05,208 INFO org.apache.hadoop.ipc.Server: Stopping server on 8019
> 2017-11-14 05:34:05,487 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener
on 8019
> 2017-11-14 05:34:05,488 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
> 2017-11-14 05:34:05,487 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from
election
> 2017-11-14 05:34:05,488 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor
thread
> 2017-11-14 05:34:05,490 FATAL org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Got
a fatal error, exiting now
> java.lang.RuntimeException: ZK Failover Controller failed: Received stat error from Zookeeper.
code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
>         at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:369)
>         at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:238)
>         at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)
>         at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)
>         at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)
>         at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
>         at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168)
>         at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:181)
> {noformat}
> The code that seems responsible is in {{DFSZKFailoverController.java}}:
> {code}
>   public static void main(String args[])
>       throws Exception {
> ...
>     int retCode = 0;
>     try {
>       retCode = zkfc.run(parser.getRemainingArgs());
>     } catch (Throwable t) {
>       LOG.fatal("Got a fatal error, exiting now", t); 
>     }   
>     System.exit(retCode);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message