Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 05D4810470 for ; Mon, 17 Nov 2014 17:59:37 +0000 (UTC) Received: (qmail 78465 invoked by uid 500); 17 Nov 2014 17:59:36 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 78414 invoked by uid 500); 17 Nov 2014 17:59:36 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 78306 invoked by uid 99); 17 Nov 2014 17:59:36 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Nov 2014 17:59:36 +0000 Date: Mon, 17 Nov 2014 17:59:36 +0000 (UTC) From: "Ravi Prakash (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214910#comment-14214910 ] Ravi Prakash commented on HDFS-4882: ------------------------------------ Thanks for your investigation Yongjun! In scenario #2 (and in fact in every scenario I traced), shouldn't there be the warning message logged? I did NOT see this message. Please see the log excerpt I have posted on HDFS-7342. This makes me slightly suspicious that scenario #2 is the only failure case in which leases are not recovered. In our instance the nodes with the penultimate block were decomissioned during the file write. > Namenode LeaseManager checkLeases() runs into infinite loop > ----------------------------------------------------------- > > Key: HDFS-4882 > URL: https://issues.apache.org/jira/browse/HDFS-4882 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client, namenode > Affects Versions: 2.0.0-alpha, 2.5.1 > Reporter: Zesheng Wu > Assignee: Ravi Prakash > Priority: Critical > Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch > > > Scenario: > 1. cluster with 4 DNs > 2. the size of the file to be written is a little more than one block > 3. write the first block to 3 DNs, DN1->DN2->DN3 > 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out > 5. DN2 and DN3 are down > 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE > 7. client continuously writes the last block, and try to close the file after written all the data > 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE > 9. shutdown the client > 10. the file's lease exceeds hard limit > 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() > 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: > {noformat} > 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard > limit > 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= > /user/h_wuzesheng/test.dat > 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, > lastBLockState=COMPLETE > 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM > APREDUCE_-1252656407_1, pendingcreates: 1] > {noformat} > (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)