hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "LAXMAN KUMAR SAHOO (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7451) Namenode HA failover happens very frequently from active to standby
Date Thu, 27 Nov 2014 09:21:12 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227422#comment-14227422
] 

LAXMAN KUMAR SAHOO commented on HDFS-7451:
------------------------------------------

Many Thanks [~vinayrpet]. It helps a lot.

> Namenode HA failover happens very frequently from active to standby
> -------------------------------------------------------------------
>
>                 Key: HDFS-7451
>                 URL: https://issues.apache.org/jira/browse/HDFS-7451
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: LAXMAN KUMAR SAHOO
>
> We have two namenode having HA enabled. From last couple of days we are observing that
the failover happens very frequently from active to standby mode. Below is the log details
of the active namenode during failover happens. Is there any fix to get rid of this issue?
> Namenode logs:
> {code}
> 2014-11-25 22:24:02,020 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call
org.apache.hadoop.hdfs.protocol.Clie
> ntProtocol.getListing from 10.2.16.214:40751: output error
> 2014-11-25 22:24:02,020 INFO org.apache.hadoop.ipc.Server: IPC Server handler 23 on 8020
caught an exception
> java.nio.channels.ClosedChannelException
>         at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
>         at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2195)
>         at org.apache.hadoop.ipc.Server.access$2000(Server.java:110)
>         at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:979)
>         at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1045)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1798)
> 2014-11-25 22:24:10,631 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager:
Finalizing edits file /sda/dfs/namenode/current/edits_inprogress_0000000001643676954 ->
/sda/dfs/namenode/current/edits_0000000001643676954-0000000001643677390
> 2014-11-25 22:24:10,631 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager:
Closing
> java.lang.Exception
>         at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel.close(IPCLoggerChannel.java:182)
>         at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.close(AsyncLoggerSet.java:102)
>         at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.close(QuorumJournalManager.java:446)
>         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.close(JournalSet.java:107)
>         at org.apache.hadoop.hdfs.server.namenode.JournalSet$4.apply(JournalSet.java:222)
>         at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:347)
>         at org.apache.hadoop.hdfs.server.namenode.JournalSet.close(JournalSet.java:219)
>         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.close(FSEditLog.java:308)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.stopActiveServices(FSNamesystem.java:939)
>         at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.stopActiveServices(NameNode.java:1365)
>         at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.exitState(ActiveState.java:70)
>         at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:61)
>         at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.setState(ActiveState.java:52)
>         at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToStandby(NameNode.java:1278)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToStandby(NameNodeRpcServer.java:1046)
>         at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToStandby(HAServiceProtocolServerSideTranslatorPB.java:119)
>         at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3635)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1752)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1748)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
> 2014-11-25 22:24:10,632 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting
services required for standby state
> 2014-11-25 22:24:10,633 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer:
Will roll logs on active node at dc1-had03-m002.dc01.revsci.net/10.2.16.92:8020 every 120
seconds.
> 2014-11-25 22:24:10,634 INFO org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer:
Starting standby checkpoint thread...
> Checkpointing active NN at dc1-had03-m002.dc01.revsci.net:50070
> Serving checkpoints at dc1-had03-m001.dc01.revsci.net/10.2.16.91:50070
> {code}
> zkfc logs:
> {code}
> 2014-11-25 22:24:12,192 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional
data from server sessionid 0x449b8
> ce9a110255, likely server has closed socket, closing socket connection and attempting
reconnect
> 2014-11-25 22:24:12,293 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected.
Entering neutral mode...
> 2014-11-25 22:24:12,950 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection
to server dc1-had03-zook06.dc01.re
> vsci.net/10.2.16.205:2181. Will not attempt to authenticate using SASL (unknown error)
> 2014-11-25 22:24:12,951 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to dc1-had03-zook06.dc01.revsc
> i.net/10.2.16.205:2181, initiating session
> 2014-11-25 22:24:12,952 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to
ZooKeeper service, session 0x449b8ce9
> a110255 has expired, closing socket connection
> 2014-11-25 22:24:12,952 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session expired.
Entering neutral mode and rejoini
> ng...
> 2014-11-25 22:24:12,952 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish
ZK session
> 2014-11-25 22:24:12,952 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection,
connectString=dc1-had03-zook01.
> dc01.revsci.net:2181,dc1-had03-zook02.dc01.revsci.net:2181,dc1-had03-zook03.dc01.revsci.net:2181,dc1-had03-zook04.dc01.rev
> sci.net:2181,dc1-had03-zook05.dc01.revsci.net:2181,dc1-had03-zook06.dc01.revsci.net:2181
sessionTimeout=5000 watcher=org.a
> pache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@7f042529
> 2014-11-25 22:24:12,954 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection
to server dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181. Will not attempt to authenticate
using SASL (unknown error)
> 2014-11-25 22:24:12,954 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181, initiating session
> 2014-11-25 22:24:13,172 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete
on server dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181, sessionid = 0x149a7f9b6d60263,
negotiated timeout = 5000
> 2014-11-25 22:24:13,173 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
> 2014-11-25 22:24:13,173 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale
result from old client with sessionId 0x449b8ce9a110255
> 2014-11-25 22:24:13,173 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2014-11-25 22:24:13,389 INFO org.apache.hadoop.ha.ZKFailoverController: ZK Election indicated
that NameNode at dc1-had03-m001.dc01.revsci.net/10.2.16.91:8020 should become standby
> 2014-11-25 22:24:13,391 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully
transitioned NameNode at dc1-had03-m001.dc01.revsci.net/10.2.16.91:8020 to standby state
> {code}
> -laxman



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message