hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "linhaiqiang (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Deleted] (HDFS-9126) namenode crash in fsimage download/transfer
Date Wed, 12 Jul 2017 03:05:00 GMT

     [ https://issues.apache.org/jira/browse/HDFS-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

linhaiqiang updated HDFS-9126:
------------------------------
    Comment: was deleted

(was: Similar questions, 2017-07-06 17:22:12 active namenode health monitor 45 seconds timeout,
active namenode exit election, ZK connection closed
However, there is no new log generation for the namenode log from 2017-07-06 17:20:37 and
no exception log
and The GC log looks normal
Through the source code analysis, the abnormal process is as follows
1. The healthMonitor at the active node causes a 45 second timeout during the detection of
the NN state to cause access to the SERVICE_NOT_RESPONDING state
2. The active node exits the election, the zkClient closes to the ZOOKEEPER connection, and
the temporary lock znode on the ZK automatically deletes
3. The standby node gets the temporary lock ZK on the znode and starts switching itself to
active
To prevent the presence of 2 active brain clefts at the same time, the active is first converted
to standby by the RPC request
4. If the RPC request is blocked, then connect the past kill to the NameNode process by way
of SSH
5. After success, you change yourself into a active node
But why 45 seconds timeout, because the process was kill, can not find the time jstack stack
information, there is no way to find)

> namenode crash in fsimage download/transfer
> -------------------------------------------
>
>                 Key: HDFS-9126
>                 URL: https://issues.apache.org/jira/browse/HDFS-9126
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.0
>         Environment: OS:Centos 6.5(final)
> Apache Hadoop:2.6.0
> namenode ha base 5 journalnodes
>            Reporter: zengyongping
>            Priority: Critical
>
> In our product Hadoop cluster,when active namenode begin download/transfer 
> fsimage from standby namenode.some times zkfc monitor health of NameNode socket timeout,zkfs
judge active namenode status SERVICE_NOT_RESPONDING ,happen hadoop namenode ha failover,fence
old active namenode.
> zkfc logs:
> 2015-09-24 11:44:44,739 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception
trying to monitor health of NameNode at hostname1/192.168.10.11:8020: Call From hostname1/192.168.10.11
to hostname1:8020 failed on socket timeout exception: java.net.SocketTimeoutException: 45000
millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/192.168.10.11:22614 remote=hostname1/192.168.10.11:8020]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ZKFailoverController: Local service
NameNode at hostname1/192.168.10.11:8020 entered state: SERVICE_NOT_RESPONDING
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ZKFailoverController: Quitting master
election for NameNode at hostname1/192.168.10.11:8020 and marking that fencing is necessary
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from
election
> 2015-09-24 11:44:44,761 INFO org.apache.zookeeper.ZooKeeper: Session: 0x54d81348fe503e3
closed
> 2015-09-24 11:44:44,761 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale
result from old client with sessionId 0x54d81348fe503e3
> 2015-09-24 11:44:44,764 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
> namenode logs:
> 2015-09-24 11:43:34,074 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll
Edit Log from 192.168.10.12
> 2015-09-24 11:43:34,074 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling
edit logs
> 2015-09-24 11:43:34,075 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending
log segment 2317430129
> 2015-09-24 11:43:34,253 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number
of transactions: 272988 Total time for transactions(ms): 5502 Number of transactions batched
in Syncs: 146274 Number of syncs: 32375 SyncTimes(ms): 274465 319599
> 2015-09-24 11:43:46,005 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30000 milliseconds
> 2015-09-24 11:44:21,054 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
PendingReplicationMonitor timed out blk_1185804191_112164210
> 2015-09-24 11:44:36,076 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager:
Finalizing edits file /software/data/hadoop-data/hdfs/namenode/current/edits_inprogress_0000000002317430129
-> /software/data/hadoop-data/hdfs/namenode/current/edits_0000000002317430129-0000000002317703116
> 2015-09-24 11:44:36,077 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting
log segment at 2317703117
> 2015-09-24 11:45:38,008 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number
of transactions: 1 Total time for transactions(ms): 0 Number of transactions batched in Syncs:
0 Number of syncs: 0 SyncTimes(ms): 0 61585
> 2015-09-24 11:45:38,009 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage:
Transfer took 222.88s at 63510.29 KB/s
> 2015-09-24 11:45:38,009 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage:
Downloaded file fsimage.ckpt_0000000002317430128 size 14495092105 bytes.
> 2015-09-24 11:45:38,416 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager:
Remote journal 192.168.10.13:8485 failed to write txns 2317703117-2317703117. Will try to
write to this JN again after the next log roll.
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 44 is less than
the last promised epoch 45
>         at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414)
>         at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:442)
>         at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:342)
>         at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
>         at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
>         at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1468)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy9.journal(Unknown Source)
>         at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
>         at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
>         at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> 		
> #Similar log like above 
> 2015-09-24 11:45:38,418 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager:
Aborting QuorumOutputStream starting at txid 2317703117
> 2015-09-24 11:45:38,505 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
> 2015-09-24 11:45:38,549 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at hostname1/192.168.10.11



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message