Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 97B33200C87 for ; Wed, 17 May 2017 22:31:09 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 92D0F160BBA; Wed, 17 May 2017 20:31:09 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D9D97160BAC for ; Wed, 17 May 2017 22:31:08 +0200 (CEST) Received: (qmail 89400 invoked by uid 500); 17 May 2017 20:31:08 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 89387 invoked by uid 99); 17 May 2017 20:31:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 May 2017 20:31:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 805FCCCC0C for ; Wed, 17 May 2017 20:31:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id Waay8Jsu8xn4 for ; Wed, 17 May 2017 20:31:06 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 3CF1760D03 for ; Wed, 17 May 2017 20:31:06 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 04A33E07D6 for ; Wed, 17 May 2017 20:31:05 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 43FDD263AC for ; Wed, 17 May 2017 20:31:04 +0000 (UTC) Date: Wed, 17 May 2017 20:31:04 +0000 (UTC) From: "Kihwal Lee (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-11817) A faulty node can cause a lease leak and NPE on accessing data MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 17 May 2017 20:31:09 -0000 [ https://issues.apache.org/jira/browse/HDFS-11817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16014712#comment-16014712 ] Kihwal Lee commented on HDFS-11817: ----------------------------------- *Summary:* The observation of the incident resulted in discovery of three flaws. 1) A block recovery can involve dead nodes and it can lead to a corruption of a data structure, which causes NPE. 2) If a block cannot be completed, {{commitBlockSynchronization()}} will fail, since it requires all blocks to be complete, unlike regular file closings. 3) If a block has experienced 2) and remains committed (not complete), next lease recovery will result in a lease state corruption. (Removed from LeaseManager, but INode stays under-construction) > A faulty node can cause a lease leak and NPE on accessing data > -------------------------------------------------------------- > > Key: HDFS-11817 > URL: https://issues.apache.org/jira/browse/HDFS-11817 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.8.0 > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Priority: Critical > > When the namenode performs a lease recovery for a failed write, the {{commitBlockSynchronization()}} will fail, if none of the new target has sent a received-IBR. At this point, the data is inaccessible, as the namenode will throw a {{NullPointerException}} upon {{getBlockLocations()}}. > The lease recovery will be retried in about an hour by the namenode. If the nodes are faulty (usually when there is only one new target), they may not block report until this point. If this happens, lease recovery throws an {{AlreadyBeingCreatedException}}, which causes LeaseManager to simply remove the lease without finalizing the inode. > This results in an inconsistent lease state. The inode stays under-construction, but no more lease recovery is attempted. A manual lease recovery is also not allowed. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org