hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Collins (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3044) fsck move should be non-destructive by default
Date Sat, 10 Mar 2012 00:46:58 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226646#comment-13226646
] 

Eli Collins commented on HDFS-3044:
-----------------------------------

- The new boolean destructive is unused
- FsckOperation is kind of overkill, probably simpler to have two bools since these are independent
operations:
-- salvageCorruptFiles, whehter to copy whatever blocks are left to lost+found
-- deleteCorruptFiles, whether to delete corrupt files
- Let's rename lostFoundMove to something like copyBlocksToLostFound to reflect what this
method actually does, ditto update the warning since we didn't really copy the file (perhaps
"coppied accessible blocks for file X")
- Let's rename testFsckMove to testFsckMoveAndDelete and add a testFsckMove that tests that
fsck move is not destructive
- Per the last bullet in the description would be good to at least add a log at INFO level
indicating the # of datanodes that have checked in so an admin can see if the number looks
off (and doesn't do a destructive operation before waiting for DNs to check in)

                
> fsck move should be non-destructive by default
> ----------------------------------------------
>
>                 Key: HDFS-3044
>                 URL: https://issues.apache.org/jira/browse/HDFS-3044
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: name-node
>            Reporter: Eli Collins
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-3044.001.patch
>
>
> The fsck move behavior in the code and originally articulated in HADOOP-101 is:
> {quote}Current failure modes for DFS involve blocks that are completely missing. The
only way to "fix" them would be to recover chains of blocks and put them into lost+found{quote}
> A directory is created with the file name, the blocks that are accessible are created
as individual files in this directory, then the original file is removed. 
> I suspect the rationale for this behavior was that you can't use files that are missing
locations, and copying the block as files at least makes part of the files accessible. However
this behavior can also result in permanent dataloss. Eg:
> - Some datanodes don't come up (eg due to a HW issues) and checkin on cluster startup,
files with blocks where all replicas are on these set of datanodes are marked corrupt
> - Admin does fsck move, which deletes the "corrupt" files, saves whatever blocks were
available
> - The HW issues with datanodes are resolved, they are started and join the cluster. The
NN tells them to delete their blocks for the corrupt files since the file was deleted. 
> I think we should:
> - Make fsck move non-destructive by default (eg just does a move into lost+found)
> - Make the destructive behavior optional (eg "--destructive" so admins think about what
they're doing)
> - Provide better sanity checks and warnings, eg if you're running fsck and not all the
slaves have checked in (if using dfs.hosts) then fsck should print a warning indicating this
that an admin should have to override if they want to do something destructive

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message