Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 06CB590FF for ; Thu, 29 Mar 2012 21:11:51 +0000 (UTC) Received: (qmail 24783 invoked by uid 500); 29 Mar 2012 21:11:50 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 24753 invoked by uid 500); 29 Mar 2012 21:11:50 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 24744 invoked by uid 99); 29 Mar 2012 21:11:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Mar 2012 21:11:50 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Mar 2012 21:11:48 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 4C97634DE9A for ; Thu, 29 Mar 2012 21:11:27 +0000 (UTC) Date: Thu, 29 Mar 2012 21:11:27 +0000 (UTC) From: "Colin Patrick McCabe (Updated) (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <1697573758.35110.1333055487328.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <684894537.22835.1330965958382.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (HDFS-3044) fsck move should be non-destructive by default MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3044: --------------------------------------- Attachment: (was: HDFS-3050-b1.001.patch) > fsck move should be non-destructive by default > ---------------------------------------------- > > Key: HDFS-3044 > URL: https://issues.apache.org/jira/browse/HDFS-3044 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node > Reporter: Eli Collins > Assignee: Colin Patrick McCabe > Fix For: 2.0.0 > > Attachments: HDFS-3044-b1.002.patch, HDFS-3044.002.patch, HDFS-3044.003.patch > > > The fsck move behavior in the code and originally articulated in HADOOP-101 is: > {quote}Current failure modes for DFS involve blocks that are completely missing. The only way to "fix" them would be to recover chains of blocks and put them into lost+found{quote} > A directory is created with the file name, the blocks that are accessible are created as individual files in this directory, then the original file is removed. > I suspect the rationale for this behavior was that you can't use files that are missing locations, and copying the block as files at least makes part of the files accessible. However this behavior can also result in permanent dataloss. Eg: > - Some datanodes don't come up (eg due to a HW issues) and checkin on cluster startup, files with blocks where all replicas are on these set of datanodes are marked corrupt > - Admin does fsck move, which deletes the "corrupt" files, saves whatever blocks were available > - The HW issues with datanodes are resolved, they are started and join the cluster. The NN tells them to delete their blocks for the corrupt files since the file was deleted. > I think we should: > - Make fsck move non-destructive by default (eg just does a move into lost+found) > - Make the destructive behavior optional (eg "--destructive" so admins think about what they're doing) > - Provide better sanity checks and warnings, eg if you're running fsck and not all the slaves have checked in (if using dfs.hosts) then fsck should print a warning indicating this that an admin should have to override if they want to do something destructive -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira