hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsz Wo (Nicholas), SZE (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-457) better handling of volume failure in Data Node storage
Date Wed, 05 Aug 2009 21:44:15 GMT

    [ https://issues.apache.org/jira/browse/HDFS-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739773#action_12739773
] 

Tsz Wo (Nicholas), SZE commented on HDFS-457:
---------------------------------------------

- There is a second version of LOG.warn(..) which accepts an Exception object as a parameter.
 You may consider changing
{code}
+      DataNode.LOG.warn("IOException in BlockReceiver constructor. Cause is " + 
+          cause);
{code}
to
{code}
+      DataNode.LOG.warn("IOException in BlockReceiver constructor.", cause);
{code}
and the other LOG.warn/info/err calls.

- DataNode.checkDiskError() should not be public.

- The return type of checkDirs() shuold be List<FSVolume>.

- Looks like that removed_vols.size() is equal to removed, or rmoved_vols == null when removed
== 0.  So we may drop removed.

- We may omit toSting() calls in the following:
{code}
+          DataNode.LOG.warn("Removing failed volume " + fsv.toString() + " - " + e.getLocalizedMessage());
{code}
{code}
+          "volumes. List of current volumes: " +   toString());
{code}

- It is hard to related the name "keepRunning" with "checks how many valid storage volumes
are there in the DataNode".  How about change it to something like "hasEnoughVolumes", "hasEnoughResource"
or "isSystemHealthy?

- The second "f = null" is redundant since f == null if validateBlockFile throws an exception.
{code}
+    File f = null;;
+    try {
+      f = validateBlockFile(b);
+    } catch(IOException e) {
+      f = null;
+    }
{code}
I suggest to print some messages in the catch block.

- checkDataDir() changes volumeMap but is not synchronized on it.

- Also, would the new checkDataDir() implementation take a long time to execute?  We may need
some performance test on this.

- It is better not to change the existing constant values.
{code}
-  final static int DISK_ERROR = 1;
-  final static int INVALID_BLOCK = 2;
+  final static int DISK_ERROR = 1; // there are still valid volumes on DN
+  final static int FATAL_DISK_ERROR = 2; // no valid volumes left on DN
+  final static int INVALID_BLOCK = 3;
{code}
How about set FATAL_DISK_ERROR to 3?

> better handling of volume failure in Data Node storage
> ------------------------------------------------------
>
>                 Key: HDFS-457
>                 URL: https://issues.apache.org/jira/browse/HDFS-457
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node
>            Reporter: Boris Shkolnik
>            Assignee: Boris Shkolnik
>         Attachments: HDFS-457.patch
>
>
> Current implementation shuts DataNode down completely when one of the configured volumes
of the storage fails.
> This is rather wasteful behavior because it  decreases utilization (good storage becomes
unavailable) and imposes extra load on the system (replication of the blocks from the good
volumes). These problems will become even more prominent when we move to mixed (heterogeneous)
clusters with many more volumes per Data Node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message