hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk
Date Fri, 14 Nov 2014 20:04:36 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212757#comment-14212757

Yongjun Zhang commented on HDFS-4239:

HI [~qwertymaniac],

My bad that I did not notice your earlier comment 
I just noticed Steve's comment referring the same - should've gone through properly before
spending google cycles. I feel HDFS-1362 implemented would solve half of this - and the other
half would be to make the removals automatic. Right now the checkDiskError does not eject
if its slow - as long as its succeed, which would have to be done via this JIRA I think. The
re-add would be possible via HDFS-1362.
until now. So we need to use the functionality provided by HDFS-1362 to automatically remove
a sick disk. It seems the original goal of HDFS-4239 is the same as HDFS-1362 (right?), and
we can create a new jira for  automatically removing a sick disk?


> Means of telling the datanode to stop using a sick disk
> -------------------------------------------------------
>                 Key: HDFS-4239
>                 URL: https://issues.apache.org/jira/browse/HDFS-4239
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Yongjun Zhang
>         Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, hdfs-4239_v4.patch,
> If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally,
or just exhibiting high latency -- your choices are:
> 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 disks of data,
especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed
datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving:
e.g. hosting an hbase cluster.
> 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount
the disk while it is in use).  This latter is better in that only the bad disk's data is rereplicated,
not all datanode data.
> Is it possible to do better, say, send the datanode a signal to tell it stop using a
disk an operator has designated 'bad'.  This would be like option #2 above minus the need
to stop and restart the datanode.  Ideally the disk would become unmountable after a while.
> Nice to have would be being able to tell the datanode to restart using a disk after its
been replaced.

This message was sent by Atlassian JIRA

View raw message