hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Collins (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-1848) Datanodes should shutdown when a critical volume fails
Date Thu, 21 Apr 2011 16:36:05 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022816#comment-13022816
] 

Eli Collins commented on HDFS-1848:
-----------------------------------

bq. I am wondering if this is necessary? Typically, critical volume (eg the volume that hosts
the OS, logs, pid, tmp dir etc.) is RAID-1 and if this goes down we can safely assume Datanode
to be down.

I don't think we should require that datanodes use RAID-1. Raiding the boot disk (OS, logs,
pids etc) on every datanode wastes an extra disk per datanode in the cluster and requires
datanodes have a HW raid controller or use SW raid. However this just lowers the probability
of this volume failing, we still have to deal with it, and as you point out a datanode can
not survive the failure of the boot disk.

bq. I too am not clear why the datanode process has to watch over "critical" disks. It would
be nice if the datanode considers all disks the same.

The idea is that the datanode can gracefully handle some types of volume failures but not
others. For example the datanode should be able to survive the failure of a disk that just
hosts blocks, but can not survive the failure of a volume that resides on the boot disk. 

Therefore if the volume that resides on the boot disk fails the datanode should fail-stop
and fail-fast (because it can not tolerate this failure) but if a volume that lives on one
of the data disks fails it should continue operating (or decommission itself if the threshold
of volume failures has been reached). If the datanode considers all disks the same then it
doesn't know whether it should fail itself  or tolerate the failure. Make sense?

> Datanodes should shutdown when a critical volume fails
> ------------------------------------------------------
>
>                 Key: HDFS-1848
>                 URL: https://issues.apache.org/jira/browse/HDFS-1848
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node
>            Reporter: Eli Collins
>             Fix For: 0.23.0
>
>
> A DN should shutdown when a critical volume (eg the volume that hosts the OS, logs, pid,
tmp dir etc.) fails. The admin should be able to specify which volumes are critical, eg they
might specify the volume that lives on the boot disk. A failure in one of these volumes would
not be subject to the threshold (HDFS-1161) or result in host decommissioning (HDFS-1847)
as the decommissioning process would likely fail.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message