Return-Path: Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: (qmail 41304 invoked from network); 24 Jun 2010 01:14:12 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 24 Jun 2010 01:14:12 -0000 Received: (qmail 92035 invoked by uid 500); 24 Jun 2010 01:14:12 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 92008 invoked by uid 500); 24 Jun 2010 01:14:11 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 92000 invoked by uid 99); 24 Jun 2010 01:14:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Jun 2010 01:14:11 +0000 X-ASF-Spam-Status: No, hits=-1541.3 required=10.0 tests=ALL_TRUSTED,AWL X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Jun 2010 01:14:11 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o5O1Domx008682 for ; Thu, 24 Jun 2010 01:13:50 GMT Message-ID: <6116053.30681277342030895.JavaMail.jira@thor> Date: Wed, 23 Jun 2010 21:13:50 -0400 (EDT) From: "Todd Lipcon (JIRA)" To: hdfs-issues@hadoop.apache.org Subject: [jira] Updated: (HDFS-1260) 0.20: Block lost when multiple DNs trying to recover it to different genstamps In-Reply-To: <11977491.5271277253052195.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1260: ------------------------------ Attachment: hdfs-1260.txt Here's a patch that moves the call over to the adapter. Also added a bit of javadoc to DelayAnswer > 0.20: Block lost when multiple DNs trying to recover it to different genstamps > ------------------------------------------------------------------------------ > > Key: HDFS-1260 > URL: https://issues.apache.org/jira/browse/HDFS-1260 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 0.20-append > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Priority: Critical > Fix For: 0.20-append > > Attachments: hdfs-1260.txt, hdfs-1260.txt > > > Saw this issue on a cluster where some ops people were doing network changes without shutting down DNs first. So, recovery ended up getting started at multiple different DNs at the same time, and some race condition occurred that caused a block to get permanently stuck in recovery mode. What seems to have happened is the following: > - FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, while the block in the volumeMap (and on filesystem) was genstamp 7093 > - we find the block file and meta file based on block ID only, without comparing gen stamp > - we rename the meta file to the new genstamp _7094 > - in updateBlockMap, we do comparison in the volumeMap by oldblock *without* wildcard GS, so it does *not* update volumeMap > - validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not exist in blocks map" > After this point, all future recovery attempts to that node fail in getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock (since the meta file got renamed above) and then fails since _7094 isn't in volumeMap in validateBlockMetadata > Making a unit test for this is probably going to be difficult, but doable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.