hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7763) fix zkfc hung issue due to not catching exception in a corner case
Date Tue, 10 Feb 2015 08:04:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313776#comment-14313776
] 

Hadoop QA commented on HDFS-7763:
---------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12697678/jstack.4936
  against trunk revision e0ec071.

    {color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9513//console

This message is automatically generated.

> fix zkfc hung issue due to not catching exception in a corner case
> ------------------------------------------------------------------
>
>                 Key: HDFS-7763
>                 URL: https://issues.apache.org/jira/browse/HDFS-7763
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.6.0
>            Reporter: Liang Xie
>            Assignee: Liang Xie
>         Attachments: HDFS-7763.txt, jstack.4936
>
>
> In our product cluster, we hit both the two zkfc process is hung after a zk network outage.
> the zkfc log said:
> {code}
> 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session timed out,
have not heard from server in 3334ms for sessionid 0x4a61bacdd9dfb2, closing socket connection
and attempting reconnect
> 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat
error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection
errors.
> 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 0x4a61bacdd9dfb2
closed
> 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error
occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode
monitoring connection errors.
> 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 11300
> 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale
result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale
result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale
result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale
result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale
result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale
result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale
result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale
result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale
result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale
result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale
result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale
result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from
election
> 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
> 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor
thread
> 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener
on 11300
> {code}
> and the thread dump also be uploaded as attachment.
> From the dump, we can see due to the unknown non-daemon threads(pool-*-thread-*), the
process did not exit, but the critical threads, like health monitor and rpc threads had been
stopped, so our watchdog(supervisord) had not not observed the zkfc process is down or abnormal.
 so the following namenode failover could not be done as expected.
> there're two possible fixes here, 1) figure out the unset-thread-name, like pool-7-thread-1,
where them came from and close or set daemon property. i tried to search but got nothing right
now. 2) catch the exception from ZKFailoverController.run() so we can continue to exec the
System.exit, the attached patch is 2).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message