hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Vasudev (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again
Date Wed, 01 Oct 2014 22:29:35 GMT

     [ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Varun Vasudev updated YARN-90:
    Attachment: apache-yarn-90.8.patch

Thanks for the review [~mingma]!

1. What if a dir is transitioned from DISK_FULL state to OTHER state? DirectoryCollection.checkDirs
doesn't seem to update errorDirs and fullDirs properly. We can use some state machine for
each dir and make sure each transition is covered.

Fixed. I've re-written the checkDir function but I haven't used a state machine. Can you please

2. DISK_FULL state is counted toward the error disk threshold by LocalDirsHandlerService.areDisksHealthy;
later RM could mark NM NODE_UNUSABLE. If we believe DISK_FULL is mostly temporary issue, should
we consider disks are healthy if disks only stay in DISK_FULL for some short period of time?

The issue here is that if a disk is full, we can't launch new containers on it. If we can't
launch containers, the RM should consider the node is unhealthy. Once the disk is cleaned
up, the RM will assign containers to it.

3. In AppLogAggregatorImpl.java, "(Path[]) localAppLogDirs.toArray(new Path\[localAppLogDirs.size()]).".
It seems the (Path[]) cast isn't necessary.


4. What is the intention of numFailures? Method getNumFailures isn't used.

This is a carry over function - it existed as part of the existing implementation.

5. Nit: It is better to expand "import java.util.*;" in DirectoryCollection.java and LocalDirsHandlerService.java.


> NodeManager should identify failed disks becoming good back again
> -----------------------------------------------------------------
>                 Key: YARN-90
>                 URL: https://issues.apache.org/jira/browse/YARN-90
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Ravi Gummadi
>            Assignee: Varun Vasudev
>         Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch,
apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch,
apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch,
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it
is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs
restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time

This message was sent by Atlassian JIRA

View raw message