hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-3474) NM disk failure detection only covers local dirs
Date Sat, 08 Sep 2012 00:15:09 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vinod Kumar Vavilapalli updated MAPREDUCE-3474:
-----------------------------------------------

    Issue Type: Bug  (was: Sub-task)
        Parent:     (was: MAPREDUCE-3121)
    
> NM disk failure detection only covers local dirs 
> -------------------------------------------------
>
>                 Key: MAPREDUCE-3474
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3474
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Eli Collins
>
> This is the MR counterpart to HDFS-1848. Like HDFS volume failure detection, NM disk
failure detection checks a subset of the disks, and a subset of the directories. Eg the TT
and the NM do not check the root disk for errors unless a local dir resides on them. Even
if a local dir resides on the root disk the disk checking code only checks the local dirs
so a failure only seen when accessing a part of the disk no hosting the local dirs will not
be noticed. The disk that hosts the logs, pid, tmp dirs etc is critical, so if needs to be
checked as well, and the NM should shutdown if a critical disk is not available (to prevent
MR issues similar to HDFS-1848 and HDFS-2095). Typically people currently work around this
limitation by (aside from ignoring it) by using raid-1 for the root disk or a health script
that checks the root disk health.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message