hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-7722) DataNode#checkDiskError should also remove Storage when error is found.
Date Tue, 10 Feb 2015 22:15:12 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14315064#comment-14315064
] 

Colin Patrick McCabe edited comment on HDFS-7722 at 2/10/15 10:14 PM:
----------------------------------------------------------------------

FsDatasetSpi#checkDataDir is really checking volumes, and returning a list of failed volumes.
 But it returns a {{Set<File>}} instead, and then we look up the {{FsVolume}} objects
again based on that.  Why?  Why not simply pass back a list of {{FsVolumeRef}} objects that
we could use directly?

Similarly, {{FsVolumeList#checkDirs}} should be returning a list of {{FsVolumeRef}}, not {{FsVolumeImpl}}.

We should be rate-limiting {{FsVolumeList#checkDirs}} as well.  Please remember that this
scans *all* files on a volume, which is an expensive operation.  If we hit a bunch of I/O
errors, we could end up calling checkDirs over and over within the same minute or two.  We
should rate-limit this so that we ignore calls to checkDirs on a volume that happen more than
once every 10 minutes or something.

{code}
+        // Disable the volume from the service
 +      asyncDiskService.removeVolume(fv.getCurrentDir());
{code}

Why are we duplicating all the logic of {{FsDatasetImpl#removeVolumes}} here?  If we want
{{checkDirs}} to remove a volume, it should just call that function.  That would also avoid
the need to have {{DataNode#checkDiskError}} do anything but simply call {{FsDatasetSpi#checkDirs}}.


was (Author: cmccabe):
FsDatasetSpi#checkDataDir is really checking volumes, and returning a list of failed volumes.
 But it returns a {{Set<File>}} instead, and then we look up the {{FsVolume}} objects
again based on that.  Why?  Why not simply pass back a list of {{FsVolumeRef}} objects that
we could use directly?

Similarly, {{FsVolumeList#checkDirs}} should be returning a list of {{FsVolumeRef}}, not {{FsVolumeImpl}}.

We should be rate-limiting {{FsVolumeList#checkDirs}} as well.  Please remember that this
scans *all* files on a volume, which is an expensive operation.  If we hit a bunch of I/O
operations, we could end up calling checkDirs over and over within the same minute or two.
 We should rate-limit this so that we ignore calls to checkDirs on a volume that happen more
than once every 10 minutes or something.

{code}
+        // Disable the volume from the service
 +      asyncDiskService.removeVolume(fv.getCurrentDir());
{code}

Why are we duplicating all the logic of {{FsDatasetImpl#removeVolumes}} here?  If we want
{{checkDirs}} to remove a volume, it should just call that function.  That would also avoid
the need to have {{DataNode#checkDiskError}} do anything but simply call {{FsDatasetSpi#checkDirs}}.

> DataNode#checkDiskError should also remove Storage when error is found.
> -----------------------------------------------------------------------
>
>                 Key: HDFS-7722
>                 URL: https://issues.apache.org/jira/browse/HDFS-7722
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.6.0
>            Reporter: Lei (Eddy) Xu
>            Assignee: Lei (Eddy) Xu
>         Attachments: HDFS-7722.000.patch
>
>
> When {{DataNode#checkDiskError}} found disk errors, it removes all block metadatas from
{{FsDatasetImpl}}. However, it does not removed the corresponding {{DataStorage}} and {{BlockPoolSliceStorage}}.

> The result is that, we could not directly run {{reconfig}} to hot swap the failure disks
without changing the configure file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message