hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Vasudev (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again
Date Tue, 30 Sep 2014 20:24:36 GMT

     [ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Varun Vasudev updated YARN-90:
    Attachment: apache-yarn-90.7.patch

Uploaded a new patch to address the comments by Jason.


bq.    I've changed it to "Disk(s) health report: ". My only concern with this is that there
might be scripts looking for the "Disk(s) failed" log line for monitoring. What do you think?

If that's true then the code should bother to do a diff between the old disk list and the
new one, logging which disks turned bad using the "Disk(s) failed" line and which disks became
healthy with some other log message.

Fixed. We now have two log messages - one indicating when disks go bad and one when disks
get marked as good.

bq.    Directories are only cleaned up during startup. The code tests for existence of the
directories and the correct permissions. This does mean that container directories left behind
for any reason won't get cleaned up unit the NodeManager is restarted. Is that ok?

This could still be problematic for the NM work-preserving restart case, as we could try to
delete an entire disk tree with active containers on it due to a hiccup when the NM restarts.
I think a better approach is a periodic cleanup scan that looks for directories under yarn-local
and yarn-logs that shouldn't be there. This could be part of the health check scan or done
separately. That way we don't have to wait for a disk to turn good or bad to catch leaked
entities on the disk due to some hiccup. Sorta like an fsck for the NM state on disk. That
is best done as a separate JIRA, as I think this functionality is still an incremental improvement
without it.

The current code will only cleanup if the NM recovery can't be carried out.
      if (!stateStore.canRecover()) {
        cleanUpLocalDirs(lfs, delService);
Will that handle the case you mentioned?

bq. checkDirs unnecessarily calls union(errorDirs, fullDirs) twice.


bq. isDiskFreeSpaceOverLimt is now named backwards, as the code returns true if the free space
is under the limit.

bq. getLocalDirsForCleanup and getLogDirsForCleanup should have javadoc comments like the
other methods.


Nit: The union utility function doesn't technically perform a union but rather a concatenation,
and it'd be a little clearer if the name reflected that. Also the function should leverage
the fact that it knows how big the ArrayList will be after the operations and give it the
appropriate hint to its constructor to avoid reallocations.


> NodeManager should identify failed disks becoming good back again
> -----------------------------------------------------------------
>                 Key: YARN-90
>                 URL: https://issues.apache.org/jira/browse/YARN-90
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Ravi Gummadi
>            Assignee: Varun Vasudev
>         Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch,
apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch,
apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it
is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs
restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time

This message was sent by Atlassian JIRA

View raw message