Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1FDCF17613 for ; Wed, 1 Oct 2014 22:29:36 +0000 (UTC) Received: (qmail 8287 invoked by uid 500); 1 Oct 2014 22:29:35 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 8242 invoked by uid 500); 1 Oct 2014 22:29:35 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 8230 invoked by uid 99); 1 Oct 2014 22:29:35 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Oct 2014 22:29:35 +0000 Date: Wed, 1 Oct 2014 22:29:35 +0000 (UTC) From: "Varun Vasudev (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-90: ------------------------------ Attachment: apache-yarn-90.8.patch Thanks for the review [~mingma]! {quote} 1. What if a dir is transitioned from DISK_FULL state to OTHER state? DirectoryCollection.checkDirs doesn't seem to update errorDirs and fullDirs properly. We can use some state machine for each dir and make sure each transition is covered. {quote} Fixed. I've re-written the checkDir function but I haven't used a state machine. Can you please review? {quote} 2. DISK_FULL state is counted toward the error disk threshold by LocalDirsHandlerService.areDisksHealthy; later RM could mark NM NODE_UNUSABLE. If we believe DISK_FULL is mostly temporary issue, should we consider disks are healthy if disks only stay in DISK_FULL for some short period of time? {quote} The issue here is that if a disk is full, we can't launch new containers on it. If we can't launch containers, the RM should consider the node is unhealthy. Once the disk is cleaned up, the RM will assign containers to it. {quote} 3. In AppLogAggregatorImpl.java, "(Path[]) localAppLogDirs.toArray(new Path\[localAppLogDirs.size()]).". It seems the (Path[]) cast isn't necessary. {quote} Fixed. {quote} 4. What is the intention of numFailures? Method getNumFailures isn't used. {quote} This is a carry over function - it existed as part of the existing implementation. {quote} 5. Nit: It is better to expand "import java.util.*;" in DirectoryCollection.java and LocalDirsHandlerService.java. {quote} Fixed. > NodeManager should identify failed disks becoming good back again > ----------------------------------------------------------------- > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Reporter: Ravi Gummadi > Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, apache-yarn-90.8.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)