Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 139E8B4F0 for ; Fri, 20 Jan 2012 23:51:02 +0000 (UTC) Received: (qmail 88920 invoked by uid 500); 20 Jan 2012 23:51:01 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 88835 invoked by uid 500); 20 Jan 2012 23:51:01 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 88827 invoked by uid 99); 20 Jan 2012 23:51:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Jan 2012 23:51:01 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Jan 2012 23:51:00 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id EC0E0158CC3 for ; Fri, 20 Jan 2012 23:50:39 +0000 (UTC) Date: Fri, 20 Jan 2012 23:50:39 +0000 (UTC) From: "Todd Lipcon (Updated) (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <1817431958.62166.1327103439968.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1099139534.315.1325566341782.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (HDFS-2742) HA: observed dataloss in replication stress test MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-2742: ------------------------------ Attachment: hdfs-2742.txt OK, here's a patch which addresses several issues since the previous revision. I've been testing on a real cluster running HBase by a combination of graceful failovers, full restarts, etc, and think I've ironed out the bugs. I also added a number of new asserts to expose any places where we might have further bugs (and running my cluster with assertions enabled). > HA: observed dataloss in replication stress test > ------------------------------------------------ > > Key: HDFS-2742 > URL: https://issues.apache.org/jira/browse/HDFS-2742 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: data-node, ha, name-node > Affects Versions: HA branch (HDFS-1623) > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Priority: Blocker > Attachments: hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, log-colorized.txt > > > The replication stress test case failed over the weekend since one of the replicas went missing. Still diagnosing the issue, but it seems like the chain of events was something like: > - a block report was generated on one of the nodes while the block was being written - thus the block report listed the block as RBW > - when the standby replayed this queued message, it was replayed after the file was marked complete. Thus it marked this replica as corrupt > - it asked the DN holding the corrupt replica to delete it. And, I think, removed it from the block map at this time. > - That DN then did another block report before receiving the deletion. This caused it to be re-added to the block map, since it was "FINALIZED" now. > - Replication was lowered on the file, and it counted the above replica as non-corrupt, and asked for the other replicas to be deleted. > - All replicas were lost. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira