hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marc Heide (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4176) EditLogTailer should call rollEdits with a timeout
Date Thu, 30 Oct 2014 07:57:34 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189767#comment-14189767
] 

Marc Heide commented on HDFS-4176:
----------------------------------

So what became of this error?

I am pretty that we have observed exactly this problem on one of our test clusters using Cloudera
4.5 (hadoop 2.0.0-cdh4.5.0) release in a Quorum Based HA mode. For a test we intentionally
destroyed one of the active namenode's disks using Linux dd command (yeah, its ugly but so
is life). The poor thing got stuck in an IO operation trying to close a file. The thread which
got blocked, held locks which blocked then a lot of other threads (e.g. threads for incoming
RPC calls). That had a fatal impact on the whole cluster, since everything stopped to work
at once. HBase, HDFS and all commands did not work and either came back with a timeout or
simply hang forever. Unfortunately the live checks from ZKFC seemed to work just fine, so
the ZKFC did not detect failures and hence did not trigger a failover.

So we tried to stop it manually. After doing a kill -2 and then a kill -9 on the NameNode
process the ZKFC finally detected the error and tried to activate the standby NameNode on
another machine. But this got stuck too. I have attached the pstack of this NameNode process
as he tries to get active but never made it. As far as I can see he is not able to stop the
EditLogTailerThread. 

The root cause is probably that the formerly active NameNode was not really dead. After searching
around for some time we found that he had left a zombie (defunct process) running, which held
the Port 8020 opened! You cannot kill such zombies in Linux without a reboot. So this is exaclty
the situation described here. Former NN was frozen but not really dead. And the standby could
not go active. 

Another sad story is that even the restart of this standby NameNode did not work. It became
active, thats fine. But as long as this other zombie was running and kept his 8020 port open,
all clients got stuck, so neither HBase started properly, nor could we access the HDFS with
the dfs client commands. Just as we rebooted the former NN's machine, the cluster started
up properly. But this is probably not part of this Jira. So working with interruptible RPC
calls and using a timeout everywhere seems to be vital.

> EditLogTailer should call rollEdits with a timeout
> --------------------------------------------------
>
>                 Key: HDFS-4176
>                 URL: https://issues.apache.org/jira/browse/HDFS-4176
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha, namenode
>    Affects Versions: 3.0.0, 2.0.2-alpha
>            Reporter: Todd Lipcon
>
> When the EditLogTailer thread calls rollEdits() on the active NN via RPC, it currently
does so without a timeout. So, if the active NN has frozen (but not actually crashed), this
call can hang forever. This can then potentially prevent the standby from becoming active.
> This may actually considered a side effect of HADOOP-6762 -- if the RPC were interruptible,
that would also fix the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message