Return-Path: X-Original-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 87B5A10856 for ; Thu, 3 Apr 2014 02:40:23 +0000 (UTC) Received: (qmail 90072 invoked by uid 500); 3 Apr 2014 02:40:18 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 89948 invoked by uid 500); 3 Apr 2014 02:40:16 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 89937 invoked by uid 99); 3 Apr 2014 02:40:16 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Apr 2014 02:40:16 +0000 Date: Thu, 3 Apr 2014 02:40:16 +0000 (UTC) From: "Tsz Wo Nicholas Sze (JIRA)" To: hdfs-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (HDFS-1336) TruncateBlock does not update in-memory information correctly MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo Nicholas Sze resolved HDFS-1336. --------------------------------------- Resolution: Not a Problem I guess that this is not a problem anymore. Please feel free to reopen this if I am wrong. Resolving ... > TruncateBlock does not update in-memory information correctly > ------------------------------------------------------------- > > Key: HDFS-1336 > URL: https://issues.apache.org/jira/browse/HDFS-1336 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 0.20-append > Reporter: Thanh Do > > - Component: data node > > - Version: 0.20-append > > - Summary: we found a case that when a block is truncated during updateBlock, > the length on the ongoingCreates is not updated, hence leading to failed append. > > - Setup: > # disks / datanode = 3 > # failures = 2 > failure type = crash > When/where failure happens = (see below) > > - Details: > 1) Client writes to dn1-dn2-dn3. Write successes. > 2) Now client tried to append. It first call dn1.recoverBlock().This recoverBlock succeeds. > 3) Suppose the pipeline is dn3-dn2-dn1. Client sends packet to dn3. > dn3 forwards the packet to dn2 and writes to its disk (i.e dn3's disk). > Now, *dn2 crashes*, so that dn1 has not received this packet yet. > 4) Client calls dn1.recoverBlock() again, this time with dn3-dn1 in the pipeline. > dn1 then calls dn3.startBlockRecovery() to terminate the writer thread in dn3. > get the *in memory* metadata info of the block, and verify that info with > the real file on disk. > dn3 maintains an in-memory data structure call *ongoingCreates* to record > information about currently-being-created block. If a block is finalized, then > its info is removed from *ongoingCreates*. > > Now suppose that at the time dn3 receives startBlockRecovery() request from dn1, > it has: > + finished writing data to disk (hence, the block length on disk is 1024) > + set visible in memory length (hence, in memory length is also 1024) > but it *has not* finalized the block, hence the block info is still in the *ongoingCreates*. > (Note: the interruption of writer thread makes the finalization never happens) > > Because of all above stuff, dn3 gives dn1 info about the block with length 1024. > > 5. Now dn1 calls its own startBlockRecovery() successfully (because the on-disk > file length and memory file length match, both are 512 byte). > > 6. Now, dn1 has a sync list (block_X_GS1 at dn1 with length 512, block_X_GS1 at dn3 with length 1024). > it needs to make sure all dn in the pipeline agree on new GS and length. > dn1 calls NN.nextGS() to get new GS2. It form new block_X_GS2 with length 512, and > call updateBlock on dn3 and itself. > > 7. dn3, receiving updateBlock request from dn1, does: > + rename the block from block_X_GS1 ==> block_X_GS2 > + truncate the block file length from 1024 to 512 > But, here is the key, it *does not update the length of the block kept in ongoingCreates* > + return to dn1 successfully > > 8. Now, dn1 call its own updateBlock and *crashes*. > > 9. From client point of view, dn1.recoverBlock fails. > It retries call dn1.recoverBlock six times, and declare dn1 as bad. > > 10. Client now calls dn3.recoverBlock() > > 11. Dn3 in turns calls its startBlockRecovery() to > + interrupt block writer threads if any > + getBlockMetadataInfo (as part of forming the syncList, and updateBlock later) > > it first look into ongoingCreates to see the block info is there, > and found it (because the block is not finalized). > Hence, in-memory length is 1024 (even though truncateBlock is called before) > > verify if the in-memory length (1024) with on-disk length (512) > Hence, the *un-matched file length exception* > > 12. From client point of view, recoverBlock fails, because *All data nodes are bad* > Client retries calling dn3.recoverBlock five more times and gets the same exception, > Hence, append fails. > > Note: > - to fix it, i think when truncating the file, we need to update the ongoingCreates too > (but i am not sure, if we fix thing like this, is there any other workload may affect) > - interestingly, NN.leaseRecovery fails because of the exact exception at dn3. > - until dead node restarts and NN.leaseRecovery is triggered again, NN is still the lease holder of the file > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (thanhdo@cs.wisc.edu) and > Haryadi Gunawi (haryadi@eecs.berkeley.edu -- This message was sent by Atlassian JIRA (v6.2#6252)