hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bharath Mundlapudi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2413) TaskTracker should handle disk failures at both startup and runtime
Date Mon, 12 Sep 2011 18:21:10 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102879#comment-13102879
] 

Bharath Mundlapudi commented on MAPREDUCE-2413:
-----------------------------------------------

Hi Eli,

Please note that lot of testing needs to be done as root like cases where we need to mount
disk as 'ro' or if you want to inject a failure. These are cases where we can't write unit
tests. There was lot of manual testing went into this feature.
Of course, we can add some more unit test which is true for any feature. That is the nature
of this problem. 

And regarding your question related to N disks, I think, Owen answered it. I agree too. Its
reasonable to make TT run without DN and vice-versa. If you want old behavior, one can do
the following:

1. Set the threshold in DN say 'k' disks.
2. Send 'ERROR' msg from health check script after 'k' disks fail so TT can be blacklisted
as it is today.

You can have this behavior today with the existing code. 


> TaskTracker should handle disk failures at both startup and runtime
> -------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2413
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2413
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: task-controller, tasktracker
>    Affects Versions: 0.20.204.0
>            Reporter: Bharath Mundlapudi
>            Assignee: Ravi Gummadi
>             Fix For: 0.20.204.0
>
>         Attachments: MR-2413.v0.1.patch, MR-2413.v0.2.patch, MR-2413.v0.3.patch, MR-2413.v0.patch
>
>
> At present, TaskTracker doesn't handle disk failures properly both at startup and runtime.
> (1) Currently TaskTracker doesn't come up if any of the mapred-local-dirs is on a bad
disk. TaskTracker should ignore that particular mapred-local-dir and start up and use only
the remaining good mapred-local-dirs.
> (2) If a disk goes bad while TaskTracker is running, currently TaskTracker doesn't do
anything special. This results in either
>    (a) TaskTracker continues to "try to use that bad disk" and this results in lots of
task failures and possibly job failures(because of multiple TTs having bad disks) and eventually
these TTs getting graylisted for all jobs. And this needs manual restart of TT with modified
configuration of mapred-local-dirs avoiding the bad disk. OR
>    (b) Health check script identifying the disk as bad and the TT gets blacklisted. And
this also needs manual restart of TT with modified configuration of mapred-local-dirs avoiding
the bad disk.
> This JIRA is to make TaskTracker more fault-tolerant to disk failures solving (1) and
(2). i.e. TT should start even if at least one of the mapred-local-dirs is on a good disk
and TT should adjust its in-memory list of mapred-local-dirs and avoid using bad mapred-local-dirs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message