Return-Path: Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: (qmail 24721 invoked from network); 17 Jun 2010 13:11:11 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 17 Jun 2010 13:11:11 -0000 Received: (qmail 94543 invoked by uid 500); 17 Jun 2010 04:11:11 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 94208 invoked by uid 500); 17 Jun 2010 04:11:07 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 94191 invoked by uid 99); 17 Jun 2010 04:11:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jun 2010 04:11:06 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jun 2010 04:11:04 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o5H4AgXX003678 for ; Thu, 17 Jun 2010 04:10:43 GMT Message-ID: <31827412.47851276747842803.JavaMail.jira@thor> Date: Thu, 17 Jun 2010 00:10:42 -0400 (EDT) From: "Thanh Do (JIRA)" To: hdfs-dev@hadoop.apache.org Subject: [jira] Created: (HDFS-1225) Block lost when primary crashes in recoverBlock MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org Block lost when primary crashes in recoverBlock ----------------------------------------------- Key: HDFS-1225 URL: https://issues.apache.org/jira/browse/HDFS-1225 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: Block is lost if primary datanode crashes in the middle tryUpdateBlock. - Setup: # available datanode = 2 # replica = 2 # disks / datanode = 1 # failures = 1 # failure type = crash When/where failure happens = (see below) - Details: Suppose we have 2 datanodes: dn1 and dn2 and dn1 is primary. Client appends to blk_X_1001 and crash happens during dn1.recoverBlock, at the point after blk_X_1001.meta is renamed to blk_X_1001.meta_tmp1002 **Interesting**, this case, the block X is lost eventually. Why? After dn1.recoverBlock crashes at rename, what left at dn1 current directory is: 1) blk_X 2) blk_X_1001.meta_tmp1002 ==> this is an invalid block, because it has no meta file associated with it. dn2 (after dn1 crash) now contains: 1) blk_X 2) blk_X_1002.meta (note that the rename at dn2 is completed, because dn1 called dn2.updateBlock() before calling its own updateBlock()) But the command namenode.commitBlockSynchronization is not reported to namenode, because dn1 is crashed. Therefore, from namenode point of view, the block X has GS 1001. Hence, the block is lost. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (thanhdo@cs.wisc.edu) and Haryadi Gunawi (haryadi@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.