Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Wed, 1 Oct 2014 22:29:35 +0000 (UTC)
From: "Varun Vasudev (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12532140.1321876387000.170270.1412202575534@Atlassian.JIRA>
In-Reply-To: <JIRA.12532140.1321876387000@Atlassian.JIRA>
References: <JIRA.12532140.1321876387000@Atlassian.JIRA>
 <JIRA.12532140.1321876387187@arcas>
Subject: [jira] [Updated] (YARN-90) NodeManager should identify failed disks
 becoming good back again
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Varun Vasudev updated YARN-90:
------------------------------
    Attachment: apache-yarn-90.8.patch

Thanks for the review [~mingma]!

{quote}
1. What if a dir is transitioned from DISK_FULL state to OTHER state? DirectoryCollection.checkDirs doesn't seem to update errorDirs and fullDirs properly. We can use some state machine for each dir and make sure each transition is covered.
{quote}

Fixed. I've re-written the checkDir function but I haven't used a state machine. Can you please review?

{quote}
2. DISK_FULL state is counted toward the error disk threshold by LocalDirsHandlerService.areDisksHealthy; later RM could mark NM NODE_UNUSABLE. If we believe DISK_FULL is mostly temporary issue, should we consider disks are healthy if disks only stay in DISK_FULL for some short period of time?
{quote}

The issue here is that if a disk is full, we can't launch new containers on it. If we can't launch containers, the RM should consider the node is unhealthy. Once the disk is cleaned up, the RM will assign containers to it.

{quote}
3. In AppLogAggregatorImpl.java, "(Path[]) localAppLogDirs.toArray(new Path\[localAppLogDirs.size()]).". It seems the (Path[]) cast isn't necessary.
{quote}

Fixed.

{quote}
4. What is the intention of numFailures? Method getNumFailures isn't used.
{quote}

This is a carry over function - it existed as part of the existing implementation.

{quote}
5. Nit: It is better to expand "import java.util.*;" in DirectoryCollection.java and LocalDirsHandlerService.java.
{quote}

Fixed.

> NodeManager should identify failed disks becoming good back again
> -----------------------------------------------------------------
>
>                 Key: YARN-90
>                 URL: https://issues.apache.org/jira/browse/YARN-90
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Ravi Gummadi
>            Assignee: Varun Vasudev
>         Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, apache-yarn-90.8.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back).


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)