Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Thu, 24 Sep 2015 05:53:04 +0000 (UTC)
From: "zengyongping (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12895745.1443004494000.59289.1443073984368@Atlassian.JIRA>
In-Reply-To: <JIRA.12895745.1443004494000@Atlassian.JIRA>
References: <JIRA.12895745.1443004494000@Atlassian.JIRA>
 <JIRA.12895745.1443004494683@arcas>
Subject: [jira] [Updated] (HDFS-9126) namenode crash in fsimage
 download/transfer
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HDFS-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

zengyongping updated HDFS-9126:
-------------------------------
    Description: 
In our product Hadoop cluster,when active namenode begin download/transfer 
fsimage from standby namenode.some times zkfc monitor health of NameNode socket timeout,zkfs judge active namenode status SERVICE_NOT_RESPONDING ,happen hadoop namenode ha failover,fence old active namenode.

zkfc logs:
2015-09-24 11:44:44,739 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at hostname1/192.168.10.11:8020: Call From hostname1/192.168.10.11 to hostname1:8020 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.10.11:22614 remote=hostname1/192.168.10.11:8020]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING
2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at hostname1/192.168.10.11:8020 entered state: SERVICE_NOT_RESPONDING
2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ZKFailoverController: Quitting master election for NameNode at hostname1/192.168.10.11:8020 and marking that fencing is necessary
2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
2015-09-24 11:44:44,761 INFO org.apache.zookeeper.ZooKeeper: Session: 0x54d81348fe503e3 closed
2015-09-24 11:44:44,761 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x54d81348fe503e3
2015-09-24 11:44:44,764 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down


  was:
In our product Hadoop cluster,when active namenode begin download/transfer 
fsimage from standby namenode.some times zkfc monitor health of NameNode socket timeout,zkfs judge active namenode status SERVICE_NOT_RESPONDING ,happen hadoop namenode ha failover,fence old active namenode.


> namenode crash in fsimage download/transfer
> -------------------------------------------
>
>                 Key: HDFS-9126
>                 URL: https://issues.apache.org/jira/browse/HDFS-9126
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.0
>         Environment: OS:Centos 6.5(final)
> Hadoop:2.6.0
> namenode ha base 5 journalnode
>            Reporter: zengyongping
>            Priority: Critical
>
> In our product Hadoop cluster,when active namenode begin download/transfer 
> fsimage from standby namenode.some times zkfc monitor health of NameNode socket timeout,zkfs judge active namenode status SERVICE_NOT_RESPONDING ,happen hadoop namenode ha failover,fence old active namenode.
> zkfc logs:
> 2015-09-24 11:44:44,739 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at hostname1/192.168.10.11:8020: Call From hostname1/192.168.10.11 to hostname1:8020 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.10.11:22614 remote=hostname1/192.168.10.11:8020]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at hostname1/192.168.10.11:8020 entered state: SERVICE_NOT_RESPONDING
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ZKFailoverController: Quitting master election for NameNode at hostname1/192.168.10.11:8020 and marking that fencing is necessary
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
> 2015-09-24 11:44:44,761 INFO org.apache.zookeeper.ZooKeeper: Session: 0x54d81348fe503e3 closed
> 2015-09-24 11:44:44,761 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x54d81348fe503e3
> 2015-09-24 11:44:44,764 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)