Return-Path: Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: (qmail 97877 invoked from network); 22 Aug 2010 04:59:42 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 22 Aug 2010 04:59:42 -0000 Received: (qmail 79611 invoked by uid 500); 22 Aug 2010 04:59:42 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 79519 invoked by uid 500); 22 Aug 2010 04:59:40 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 79509 invoked by uid 99); 22 Aug 2010 04:59:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 22 Aug 2010 04:59:39 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 22 Aug 2010 04:59:38 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o7M4xHWS014837 for ; Sun, 22 Aug 2010 04:59:17 GMT Message-ID: <13141677.492731282453157869.JavaMail.jira@thor> Date: Sun, 22 Aug 2010 00:59:17 -0400 (EDT) From: "sam rash (JIRA)" To: hdfs-issues@hadoop.apache.org Subject: [jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN In-Reply-To: <24081230.11671277279269916.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901134#action_12901134 ] sam rash commented on HDFS-1262: -------------------------------- re: RPC compatibility. I'm not 100% sure this is a good idea. If we start to enumerate the cases of how a client can interact with the server, bugs seem more likely. It makes sense with a single method, but if RPC changes become interdependent... what's the case that mandates using a new client against an old namenode? is it not possible to use the appropriately versioned client? or is it the case of heterogeneous sets of clusters and simplicity of management with a single client code base? any other thoughts on this? > Failed pipeline creation during append leaves lease hanging on NN > ----------------------------------------------------------------- > > Key: HDFS-1262 > URL: https://issues.apache.org/jira/browse/HDFS-1262 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs client, name-node > Affects Versions: 0.20-append > Reporter: Todd Lipcon > Assignee: sam rash > Priority: Critical > Fix For: 0.20-append > > Attachments: hdfs-1262-1.txt, hdfs-1262-2.txt, hdfs-1262-3.txt, hdfs-1262-4.txt > > > Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: > 1) File's original writer died > 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery > 3) Recovery completed successfully > 4) Recovery client calls append again, which succeeds on the NN > 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later > 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. > The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.