hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7400) More reliable namenode health check to detect OS/HW issues
Date Sat, 15 Nov 2014 01:28:34 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213246#comment-14213246
] 

Allen Wittenauer commented on HDFS-7400:
----------------------------------------

bq. Disk array controller firmware has a bug. So disks stop working.
...
bq. The machine can be pinged.
bq. The machine can't be sshed.

Was ssh actually opening the socket and just not completing the login process?

On the surface, this sounds like typical Linux IO weirdisms, but I want to make sure. 

bq. Out of curiosity, did your failure condition result in a situation where df worked, but
the disk was otherwise non-functional? 

I keep thinking about the situation where there are two controllers but only one went belly
up. Doing things like df or even a write+read combo might not be sufficient unless we do it
across all devices.  I suspect:

bq. Have other machines help to make the decision whether the NN is actually healthy. 

... might be the only truly viable solution under various failure modes.

> More reliable namenode health check to detect OS/HW issues
> ----------------------------------------------------------
>
>                 Key: HDFS-7400
>                 URL: https://issues.apache.org/jira/browse/HDFS-7400
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ming Ma
>
> We had this scenario on an active NN machine.
> * Disk array controller firmware has a bug. So disks stop working.
> * ZKFC and NN still considered the node healthy; Communications between ZKFC and ZK as
well as ZKFC and NN are good.
> * The machine can be pinged.
> * The machine can't be sshed.
> So all clients and DNs can't use the NN. But ZKFC and NN still consider the node healthy.
> The question is how we can have ZKFC and NN detect such OS/HW specific issues quickly?
Some ideas we discussed briefly,
> * Have other machines help to make the decision whether the NN is actually healthy. Then
you have to figure out to make the decision accurate in the case of network issue, etc.
> * Run OS/HW health check script external to ZKFC/NN on the same machine. If it detects
disk or other issues, it can reboot the machine for example.
> * Run OS/HW health check script inside ZKFC/NN. For example NN's HAServiceProtocol#monitorHealth
can be modified to call such health check script.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message