hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4480) data node process should not die if one dir goes bad
Date Mon, 27 Oct 2008 18:24:45 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642998#action_12642998
] 

Allen Wittenauer commented on HADOOP-4480:
------------------------------------------

I specifically targeted the HDFS framework in this bug primarily because the MR framework
issues are actually worse.  There is a very good chance that if you have multiple disks, you
have swap spread across those disks.  In the case of drive failure, this means you lose a
chunk of swap.  Loss of swap==less memory for streaming jobs==job failure in many, many instances.
 So let's not get distracted with the issues around MR, job failure, job speed, etc.

What I'm seeing is that at any given time we have 10-20% of our nodes down. The vast majority
have a single failed disk.  This means we're leaving capacity on the floor, waiting for a
drive replacement.  Why can't these machines just stay up, providing blocks and providing
space on the good drives?  For large clusters, this might be a minor inconvenience but for
small clusters this could be deadly. 

The current fix is done with wetware, a source of additional strain on traditionally overloaded
operations teams.  Random failure times vs. letting the ops team decide when a data node goes
down?  This seems like a no brainer from a practicality perspective. Yes, this is clearly
more difficult than just killing the node completely.  But over the long haul, it is going
to be cheaper in human labor to fix this in Hadoop than to throw more admins at it.

> data node process should not die if one dir goes bad
> ----------------------------------------------------
>
>                 Key: HADOOP-4480
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4480
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>            Reporter: Allen Wittenauer
>
> When multiple directories are configured for the data node process to use to store blocks,
it currently exits when one of them is not writable.   Instead, it should either completely
ignore that directory or attempt to continue reading and then marking it unusable if reads
fail.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message