Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F179CC798 for ; Fri, 4 May 2012 07:43:40 +0000 (UTC) Received: (qmail 85478 invoked by uid 500); 4 May 2012 07:43:40 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 85439 invoked by uid 500); 4 May 2012 07:43:40 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 85415 invoked by uid 99); 4 May 2012 07:43:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 May 2012 07:43:39 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 May 2012 07:43:38 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 4B49242FCBD for ; Fri, 4 May 2012 07:43:18 +0000 (UTC) Date: Fri, 4 May 2012 07:43:18 +0000 (UTC) From: "Konstantin Shvachko (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <1911477551.26100.1336117398309.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <2086549363.26065.1336115568729.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko updated HDFS-3368: -------------------------------------- Description: All replicas of a block can be removed if bad DataNodes come up and down during cluster restart resulting in data loss. (was: All replicas of a block can be removed if bad DataNodes come up and down during cluter restart resulting in data loss.) Target Version/s: 0.22.1, 2.0.0, 3.0.0 (was: 3.0.0, 2.0.0, 0.22.1) - A block b has 3 replicas initially located on DNs do1, do2, do3. - At different times all three nodes malfunctioned and died, causing the replicas to be migrate to dn1, dn2, dn3. - do1, do2, do3 were not added to the exclude list. And when the cluster restarts do1, do2, do3 are brought up along with dn1, dn2, dn3. - NN sees 6 replicas for block b and correctly decides to remove 3 of them. {{BlockPlacementPolicyDefault.chooseReplicaToDelete()}} selects three targets to be deleted based on the free space remaining on DNs deemed to posses replicas. dn1, dn2, dn3 are most likely to be the targets for replicas deletion because they have been on the cluster longer than do1, do2, do3 and therefore are likely to have less free space. - Expectedly do1, do2, do3 malfunction again and go down shortly after reporting their blocks to NN. - It will take 10 minutes for NN to recognize the fact that do1, do2, do3 are dead. By that time replicas will be removed from the good nodes, resulting in data loss. This is the real story seen in production. I verified that all major version are affected. > Missing blocks due to bad DataNodes comming up and down. > -------------------------------------------------------- > > Key: HDFS-3368 > URL: https://issues.apache.org/jira/browse/HDFS-3368 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0 > Reporter: Konstantin Shvachko > Assignee: Konstantin Shvachko > > All replicas of a block can be removed if bad DataNodes come up and down during cluster restart resulting in data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira