Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 91BF217D23 for ; Tue, 21 Oct 2014 16:54:35 +0000 (UTC) Received: (qmail 85163 invoked by uid 500); 21 Oct 2014 16:54:35 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 85116 invoked by uid 500); 21 Oct 2014 16:54:35 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 85101 invoked by uid 99); 21 Oct 2014 16:54:35 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Oct 2014 16:54:35 +0000 Date: Tue, 21 Oct 2014 16:54:35 +0000 (UTC) From: "Kihwal Lee (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-6964) NN fails to fix under replication leading to data loss MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-6964: ----------------------------- Target Version/s: 2.7.0 (was: 2.6.0) > NN fails to fix under replication leading to data loss > ------------------------------------------------------ > > Key: HDFS-6964 > URL: https://issues.apache.org/jira/browse/HDFS-6964 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.0.0-alpha, 3.0.0 > Reporter: Daryn Sharp > Priority: Blocker > > We've encountered lost blocks due to node failure even when there is ample time to fix the under-replication. > 2 nodes were lost. The 3rd node with the last remaining replicas averaged 1 copy block per heartbeat (3s) until ~7h later when that node was lost resulting in over 50 lost blocks. When the node was restarted and sent its BR the NN immediately began fixing the replication. > In another data loss event, over 150 blocks were lost due to node failure but the timing of the node loss is not known so there may have been inadequate time to fix the under-replication unlike the first case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)