hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7400) More reliable namenode health check to detect OS/HW issues
Date Mon, 17 Nov 2014 05:31:34 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214287#comment-14214287

Colin Patrick McCabe commented on HDFS-7400:

The failure detailed in the description doesn't seem like a network issue.  It sounds like
the NN disks stopped working and this was causing major problems for HDFS, but ZKFC continued
to report that things were OK.  We could speculate about possible ways to recover from different
network issues, but it seems like it wouldn't address this particular issue.

I would guess that the failure to ssh was because the ssh login shell tried to read some file,
and the kernel got stuck forever waiting for the underlying disk to respond (or even eventually
failed, causing the login to fail). 

I'm not sure what we could do better here.  We could have the ZKFC check write an edit log
op... and then if that wasn't possible, we'd fail.  But this would not allow NNs in safe mode
to work.  We could write to an arbitrary file on one of the NN's disks during a ZKFC checkup,
but the NN can use multiple disks.  And anyway, just because you can write to one file doesn't
mean you can write to others, when things in the kernel are getting weird.  Interesting ideas,
maybe you guys can come up with something here...

> More reliable namenode health check to detect OS/HW issues
> ----------------------------------------------------------
>                 Key: HDFS-7400
>                 URL: https://issues.apache.org/jira/browse/HDFS-7400
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ming Ma
> We had this scenario on an active NN machine.
> * Disk array controller firmware has a bug. So disks stop working.
> * ZKFC and NN still considered the node healthy; Communications between ZKFC and ZK as
well as ZKFC and NN are good.
> * The machine can be pinged.
> * The machine can't be sshed.
> So all clients and DNs can't use the NN. But ZKFC and NN still consider the node healthy.
> The question is how we can have ZKFC and NN detect such OS/HW specific issues quickly?
Some ideas we discussed briefly,
> * Have other machines help to make the decision whether the NN is actually healthy. Then
you have to figure out to make the decision accurate in the case of network issue, etc.
> * Run OS/HW health check script external to ZKFC/NN on the same machine. If it detects
disk or other issues, it can reboot the machine for example.
> * Run OS/HW health check script inside ZKFC/NN. For example NN's HAServiceProtocol#monitorHealth
can be modified to call such health check script.
> Thoughts?

This message was sent by Atlassian JIRA

View raw message