Return-Path: X-Original-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1CEF31003B for ; Sat, 15 Nov 2014 00:31:38 +0000 (UTC) Received: (qmail 7634 invoked by uid 500); 15 Nov 2014 00:31:35 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 7513 invoked by uid 500); 15 Nov 2014 00:31:35 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 7402 invoked by uid 99); 15 Nov 2014 00:31:35 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Nov 2014 00:31:35 +0000 Date: Sat, 15 Nov 2014 00:31:35 +0000 (UTC) From: "Ming Ma (JIRA)" To: hdfs-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HDFS-7400) More reliable namenode health check to detect OS/HW issues MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Ming Ma created HDFS-7400: ----------------------------- Summary: More reliable namenode health check to detect OS/HW issues Key: HDFS-7400 URL: https://issues.apache.org/jira/browse/HDFS-7400 Project: Hadoop HDFS Issue Type: Improvement Reporter: Ming Ma We had this scenario on an active NN machine. * Disk array controller firmware has a bug. So disks stop working. * ZKFC and NN still considered the node healthy; Communications between ZKFC and ZK as well as ZKFC and NN are good. * The machine can be pinged. * The machine can't be sshed. So all clients and DNs can't use the NN. But ZKFC and NN still consider the node healthy. The question is how we can have ZKFC and NN detect such OS/HW specific issues quickly? Some ideas we discussed briefly, * Have other machines help to make the decision whether the NN is actually healthy. Then you have to figure out to make the decision accurate in the case of network issue, etc. * Run OS/HW health check script external to ZKFC/NN on the same machine. If it detects disk or other issues, it can reboot the machine for example. * Run OS/HW health check script inside ZKFC/NN. For example NN's HAServiceProtocol#monitorHealth can be modified to call such health check script. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)