hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3121) NodeManager should handle disk-failures
Date Thu, 17 Nov 2011 09:28:52 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151940#comment-13151940
] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3121:
----------------------------------------------------

Okay, still a monster patch, but in a far better shape. I anticipate just one more iteration.

 - Please don't include util like APIs in NodeHealthStatus record, we want to keep the record
implementations to the bare essentials.
 - Pass DiskHandlerService everywhere instead of NodeHealthCheckerService.
 - Change DISKS_FAILED to not use 144. May be -1001.
 - Remove unused imports in ContainersLauncher.
 - Remove the commented out init() code in ContainerExecutor.
 - Rename LocalStorage to DirectoryCollection?
 - getHealthScriptTimer() belongs to the HealthScriptRunner itself. Let's make nodeHealthScriptRunner.getTimerTask()
public and drop TimerTask getHealthScriptTimer() from NodeHealthCheckerService.
 - Trivial: (java)doc NodeHealthCheckerService class.

ContainerLaunch: When all disks have failed, use the health-report in the exception *and*
also add a diagnostics to the event.
 - Same in ResourceLocalizationService
 - DiskHandlerService: when major-percentage disks are gone, log the report. (+108)

ResourceLocalizationService:
 - Take a snapshot of dirs before the health-check for startLocalizer()?
 - PublicLocalizer uses a LocalDirAllocator for downloading file. Should it instead use DiskHandlerService?
May be also check for min-percentage disks to be alive for each addResource() request. You
will need changes to FSDownload too.
 - Remove PUBCACHE_CTXT after doing above.

AppLogAggregatorImpl:
 - Existing log message at +120 can also list the good dirs. Bad dirs can be deduced from
the DHS logs.

DiskHandlerService:
 - The APIs with size are not needed or don't need the size paramater itself.
 - Take a lock on the cloned config for accesses via updateDirsInConfiguration(), getLocalPathForWrite(String
pathStr) etc. where configuration is accessed.

MiniYARNCluster:
 - Change the default the numLocalDirs and numLogDirs to 4? Also, consolidate the constructors?
I can see the N number of constructors pattern of MiniMRCluster, let's avoid that.

conf/controller.cfg
 - Update to not have the removed configs.
 - Can you also add banned.users and min.user.id with the default values?

TestDiskFailure:
 - verifyDisksHealth(): Loop through and wait for a max of say 10 seconds for the node to
turn unhealthy.
 - waitForDiskHealthCheck(): We can capture DiskHandlerService's last report time and wait
till it changes atleast once. Of course that should be capped by a upper limit on the wait
time.

Can you run the linux-container-executor tests: TestLinuxContainerExecutor and TestContainerManagerWithLCE?
Create a separate ticket for handling the disks that come back up online.
Create a separate ticket for having a metric for numFailedDirs.

-----

Test plan:
 - RM stops scheduling when major-percentage of disks go bad: Done
 - Node's DiskHandler recognises bad disks: Done
 - Node's DiskHandler recognises minimum percentage of good disks : Done
 - Integration test: Run a mapreduce job (so that Shuffle is also verified), offline some
disks, run one more job and verify that both the apps pass. TODO
 - LogAggregation test: Verify that logs written on bad disks are ignored for aggregation
(augment TestLogAggregationService) TODO:
 - ContainerLaunch: Verify that
   -- new containers don't use bad directories(by testing the LOCAL_DIRS env in a custom map
job): TODO
   -- if major percentage disks turn bad,
      -- container should exit with proper exit code(should be easy with a custom application).
TODO
      -- localization for a resource fails TODO 
                
> NodeManager should handle disk-failures
> ---------------------------------------
>
>                 Key: MAPREDUCE-3121
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Ravi Gummadi
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: 3121.patch, 3121.v1.1.patch, 3121.v1.patch, 3121.v2.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to minimize the impact
of transient/permanent disk failures on containers. With larger number of disks per node,
the ability to continue to run containers on other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message