Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Date: Wed, 22 Jun 2016 22:58:16 +0000 (UTC)
From: "Allen Wittenauer (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12981047.1466449183000.3139.1466636296305@Atlassian.JIRA>
In-Reply-To: <JIRA.12981047.1466449183000@Atlassian.JIRA>
References: <JIRA.12981047.1466449183000@Atlassian.JIRA> <JIRA.12981047.1466449183634@arcas>
Subject: [jira] [Commented] (YARN-5274) Use smartctl to determine health of
 disks
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Wed, 22 Jun 2016 22:58:18 -0000


    [ https://issues.apache.org/jira/browse/YARN-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345328#comment-15345328 ] 

Allen Wittenauer commented on YARN-5274:
----------------------------------------

bq. The node health script is meant for the health of the node. It can't mark a single disk as bad. 

Yes, I'm very familiar with both the health check (esp given I'm the one who pushed for it to get added to begin with...) and smartctl. 

bq. The health test to determine if a disk should be valid whether the disk is a HDD or SSD. We shouldn't use smartctl if it doesn't apply to storage in question, and fallback on the existing checks.

If I configure a file system to use /hadoop/1/tmp  and /hadoop/1's mount device is hadoop1/1, now what? Is it going to be smart enough to look to see what devices the hadoop1 pool has in it?

bq. Where explicit monitoring does not exist, the NM can take some pro-active steps to detect bad disks.

But that's my point:  explicit monitoring DOES exist, just not inside Hadoop. There are whole industries based around hardware monitoring that user's should be deploying.   Trying to do it all is part of why YARN is descending into chaos.  There are times when it is appropriate to walk away and say "this isn't our core competency, let someone else do it.".  This is one of them.


Besides: why is this a YARN-specific problem?  Shouldn't this be in HADOOP so that both HDFS and YARN can take advantage of any code written? 

> Use smartctl to determine health of disks
> -----------------------------------------
>
>                 Key: YARN-5274
>                 URL: https://issues.apache.org/jira/browse/YARN-5274
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Varun Vasudev
>
> It would be nice to add support for smartctl(on machines where it is available) to determine disk health for the YARN local and log dirs(if smartctl is applicable). The current disk checking mechanism misses out on issues like bad sectors, etc.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org