Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0CD7C200B41 for ; Thu, 23 Jun 2016 00:58:18 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 0B8A3160A68; Wed, 22 Jun 2016 22:58:18 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 58C7F160A36 for ; Thu, 23 Jun 2016 00:58:17 +0200 (CEST) Received: (qmail 15082 invoked by uid 500); 22 Jun 2016 22:58:16 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 15050 invoked by uid 99); 22 Jun 2016 22:58:16 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jun 2016 22:58:16 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 4B5FF2C1F68 for ; Wed, 22 Jun 2016 22:58:16 +0000 (UTC) Date: Wed, 22 Jun 2016 22:58:16 +0000 (UTC) From: "Allen Wittenauer (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-5274) Use smartctl to determine health of disks MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 22 Jun 2016 22:58:18 -0000 [ https://issues.apache.org/jira/browse/YARN-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345328#comment-15345328 ] Allen Wittenauer commented on YARN-5274: ---------------------------------------- bq. The node health script is meant for the health of the node. It can't mark a single disk as bad. Yes, I'm very familiar with both the health check (esp given I'm the one who pushed for it to get added to begin with...) and smartctl. bq. The health test to determine if a disk should be valid whether the disk is a HDD or SSD. We shouldn't use smartctl if it doesn't apply to storage in question, and fallback on the existing checks. If I configure a file system to use /hadoop/1/tmp and /hadoop/1's mount device is hadoop1/1, now what? Is it going to be smart enough to look to see what devices the hadoop1 pool has in it? bq. Where explicit monitoring does not exist, the NM can take some pro-active steps to detect bad disks. But that's my point: explicit monitoring DOES exist, just not inside Hadoop. There are whole industries based around hardware monitoring that user's should be deploying. Trying to do it all is part of why YARN is descending into chaos. There are times when it is appropriate to walk away and say "this isn't our core competency, let someone else do it.". This is one of them. Besides: why is this a YARN-specific problem? Shouldn't this be in HADOOP so that both HDFS and YARN can take advantage of any code written? > Use smartctl to determine health of disks > ----------------------------------------- > > Key: YARN-5274 > URL: https://issues.apache.org/jira/browse/YARN-5274 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Reporter: Varun Vasudev > > It would be nice to add support for smartctl(on machines where it is available) to determine disk health for the YARN local and log dirs(if smartctl is applicable). The current disk checking mechanism misses out on issues like bad sectors, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org