Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A442B18EF3 for ; Thu, 24 Sep 2015 05:53:09 +0000 (UTC) Received: (qmail 21361 invoked by uid 500); 24 Sep 2015 05:53:04 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 21308 invoked by uid 500); 24 Sep 2015 05:53:04 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 21294 invoked by uid 99); 24 Sep 2015 05:53:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Sep 2015 05:53:04 +0000 Date: Thu, 24 Sep 2015 05:53:04 +0000 (UTC) From: "zengyongping (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-9126) namenode crash in fsimage download/transfer MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zengyongping updated HDFS-9126: ------------------------------- Description: In our product Hadoop cluster,when active namenode begin download/transfer fsimage from standby namenode.some times zkfc monitor health of NameNode socket timeout,zkfs judge active namenode status SERVICE_NOT_RESPONDING ,happen hadoop namenode ha failover,fence old active namenode. zkfc logs: 2015-09-24 11:44:44,739 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at hostname1/192.168.10.11:8020: Call From hostname1/192.168.10.11 to hostname1:8020 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.10.11:22614 remote=hostname1/192.168.10.11:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at hostname1/192.168.10.11:8020 entered state: SERVICE_NOT_RESPONDING 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ZKFailoverController: Quitting master election for NameNode at hostname1/192.168.10.11:8020 and marking that fencing is necessary 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election 2015-09-24 11:44:44,761 INFO org.apache.zookeeper.ZooKeeper: Session: 0x54d81348fe503e3 closed 2015-09-24 11:44:44,761 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x54d81348fe503e3 2015-09-24 11:44:44,764 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down was: In our product Hadoop cluster,when active namenode begin download/transfer fsimage from standby namenode.some times zkfc monitor health of NameNode socket timeout,zkfs judge active namenode status SERVICE_NOT_RESPONDING ,happen hadoop namenode ha failover,fence old active namenode. > namenode crash in fsimage download/transfer > ------------------------------------------- > > Key: HDFS-9126 > URL: https://issues.apache.org/jira/browse/HDFS-9126 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.6.0 > Environment: OS:Centos 6.5(final) > Hadoop:2.6.0 > namenode ha base 5 journalnode > Reporter: zengyongping > Priority: Critical > > In our product Hadoop cluster,when active namenode begin download/transfer > fsimage from standby namenode.some times zkfc monitor health of NameNode socket timeout,zkfs judge active namenode status SERVICE_NOT_RESPONDING ,happen hadoop namenode ha failover,fence old active namenode. > zkfc logs: > 2015-09-24 11:44:44,739 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at hostname1/192.168.10.11:8020: Call From hostname1/192.168.10.11 to hostname1:8020 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.10.11:22614 remote=hostname1/192.168.10.11:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout > 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING > 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at hostname1/192.168.10.11:8020 entered state: SERVICE_NOT_RESPONDING > 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ZKFailoverController: Quitting master election for NameNode at hostname1/192.168.10.11:8020 and marking that fencing is necessary > 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election > 2015-09-24 11:44:44,761 INFO org.apache.zookeeper.ZooKeeper: Session: 0x54d81348fe503e3 closed > 2015-09-24 11:44:44,761 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x54d81348fe503e3 > 2015-09-24 11:44:44,764 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down -- This message was sent by Atlassian JIRA (v6.3.4#6332)