hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
Date Mon, 04 Nov 2013 20:00:20 GMT

    [ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813183#comment-13813183

Vinod Kumar Vavilapalli commented on YARN-90:

Thanks for the patch, Song! Some quick comments:
 - Because you are changing the semantics of checkDirs(), there are more changes that are
  -- updateDirsAfterFailure() -> updateConfAfterDirListChange?
  -- The log message in updateDirsAfterFailure: "Disk(s) failed. " should be changed to something
like "Disk-health report changed: " or something like that.
 - Web UI and Web-services are fine for now I think, nothing to do there.
 - Drop the extraneous "System.out.println" lines in all of the patch.
 - Let's drop the metrics changes. We need to expose this end-to-end and not just metrics
- client side reports, jmx and metrics. Worth tracking that effort separately.
 - Test:
    -- testAutoDir() -> testDisksGoingOnAndOff ?
    -- Can you also validate the health-report both when disks go off and when they come back
    -- Also just throw unwanted exceptions instead of catching them and printing stack-trace.

> NodeManager should identify failed disks becoming good back again
> -----------------------------------------------------------------
>                 Key: YARN-90
>                 URL: https://issues.apache.org/jira/browse/YARN-90
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Ravi Gummadi
>         Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it
is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs
restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time

This message was sent by Atlassian JIRA

View raw message