hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Vasudev (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again
Date Wed, 12 Mar 2014 20:44:45 GMT

     [ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Varun Vasudev updated YARN-90:

    Attachment: apache-yarn-90.1.patch

Uploaded new patch.
    DirectoryCollection: can you put the block where you create and delete a random directory
inside a dir.exists() check? We don't want to create-delete a directory that already exists
but matches with our random string - very unlikely but not impossible.
Fixed. The dir check is now its own function with the exists check.

    ResourceLocalizationService (RLS): What happens to disks that become good after service-init?
We don't create the top level directories there. Depending on our assumptions in the code
in the remaining NM subsystem, this may or may not lead to bad bugs. Should we permanently
exclude bad-disks found during initializing?
    Similary in RLS, service-init, we cleanUpLocalDir() to delete old files, If disks become
good again, we will have unclean disks. And depending on our assumptions, we may or may not
run into issues. For e.g, files 'leaked' like that may never get deleted.
Fixed. Local and log dirs undergo a check before use to ensure that they have been setup correctly.

    Add comments to all the tests describing what is being tested

    Add more inline comments for each test-block, say for e.g. "changing a disk to be bad"
before a blocker where you change permissions. For readability.

    In all the tests where you sleep for a time more than disk-checker frequency, it may or
may not pass the test depending on the underlying thread scheduling. Instead of that, you
should explicitly call LocalDirsHandlerService.checkDirs()
Fixed, used mocks of the LocalDirsHandlerService removing the timing issue.

        Nonstandard formatting in method declaration
        There is a bit of code about creating container-dirs. Can we reuse some of it from
Fixed the non-standard formatting. The ContainerLocalizer code creates only the usercache(we
need the filecache and the nmPrivate dirs as well).

        In the existing test-case, you have "actually create the dirs". Why is that needed?
Fixed. Used mocking to remove requirement.

        Can we reuse any code in this test with what exists in TestLogAggregationService?
Seems to me that they both should mostly be the same.
Fixed. Shared code moved into functions.

    TestDirectoryCollection.testFailedDirPassingCheck -> testFailedDisksBecomingGoodAgain

> NodeManager should identify failed disks becoming good back again
> -----------------------------------------------------------------
>                 Key: YARN-90
>                 URL: https://issues.apache.org/jira/browse/YARN-90
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Ravi Gummadi
>            Assignee: Varun Vasudev
>         Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch,
apache-yarn-90.0.patch, apache-yarn-90.1.patch
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it
is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs
restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time

This message was sent by Atlassian JIRA

View raw message