Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Wed, 24 Sep 2014 21:14:35 +0000 (UTC)
From: "Varun Vasudev (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12532140.1321876387000.115633.1411593275356@Atlassian.JIRA>
In-Reply-To: <JIRA.12532140.1321876387000@Atlassian.JIRA>
References: <JIRA.12532140.1321876387000@Atlassian.JIRA>
 <JIRA.12532140.1321876387187@arcas>
Subject: [jira] [Updated] (YARN-90) NodeManager should identify failed disks
 becoming good back again
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Varun Vasudev updated YARN-90:
------------------------------
    Attachment: apache-yarn-90.5.patch

Uploaded a new patch to address [~jlowe]'s comments.

{quote}
It's a bit odd to have a hash map to map disk error types to lists of directories, fill them all in, but we only in practice actually look at one type in the map and that's DISK_FULL. It'd be simpler (and faster and less space since there's no hashmap involved) to just track full disks as a separate collection like we already do for localDirs and failedDirs.
{quote}

Fixed. I renamed failedDirs to errorDirs and added a list for fullDirs. The getFailedDirs() function returns a union of the two.

{quote}
Nit: DISK_ERROR_CAUSE should be DiskErrorCause (if we keep the enum) to match the style of other enum types in the code.
{quote}

Fixed.

{quote}
In verifyDirUsingMkdir, if an error occurs during the finally clause then that exception will mask the original exception
{quote}

Fixed.

{quote}
isDiskUsageUnderPercentageLimit is named backwards. Disk usage being under the configured limit shouldn't be a full disk error, and the error message is inconsistent with the method name (method talks about being under but error message says its above).
{noformat}
        if (isDiskUsageUnderPercentageLimit(testDir)) {
          msg =
              "used space above threshold of "
                  + diskUtilizationPercentageCutoff
                  + "%, removing from the list of valid directories.";
{noformat}
{quote}

Yep, thanks for catching it. Fixed.

{quote}
We should only call getDisksHealthReport() once in the following code:
{noformat}
+    String report = getDisksHealthReport();
+    if (!report.isEmpty()) {
+      LOG.info("Disk(s) failed. " + getDisksHealthReport());
{noformat}
{quote}
Fixed.

{quote}
Should updateDirsAfterTest always say "Disk(s) failed" if the report isn't empty? Thinking of the case where two disks go bad, then one later is restored. The health report will still have something, but that last update is a disk turning good not failing. Before this code was only called when a new disk failed, and now that's not always the case. Maybe it should just be something like "Disk health update: " instead?
{quote}

I've changed it to "Disk(s) health report: ". My only concern with this is that there might be scripts looking for the "Disk(s) failed" log line for monitoring. What do you think?

{quote}
Is it really necessary to stat a directory before we try to delete it? Seems like we can just try to delete it.
{quote}

Just wanted to avoid an unnecessary attempt. If a disk is comes back as good when a container is running, it won't have the container directories leading to an unnecessary delete.

{quote}
The idiom of getting the directories and adding the full directories seems pretty common. Might be good to have dirhandler methods that already do this, like getLocalDirsForCleanup or getLogDirsForCleanup.
{quote}

Fixed.

{quote}
I'm a bit worried that getInitializedLocalDirs could potentially try to delete an entire directory tree for a disk. If this fails in some sector-specific way but other containers are currently using their files from other sectors just fine on the same disk, removing these files from underneath active containers could be very problematic and difficult to debug.
{quote}

Fixed. Directories are only cleaned up during startup. The code tests for existence of the directories and the correct permissions. This does mean that container directories left behind for any reason won't get cleaned up unit the NodeManager is restarted. Is that ok?

> NodeManager should identify failed disks becoming good back again
> -----------------------------------------------------------------
>
>                 Key: YARN-90
>                 URL: https://issues.apache.org/jira/browse/YARN-90
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Ravi Gummadi
>            Assignee: Varun Vasudev
>         Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back).


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)