Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B3FBC200C3E for ; Tue, 7 Mar 2017 02:28:39 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id B2CA0160B76; Tue, 7 Mar 2017 01:28:39 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 08659160B81 for ; Tue, 7 Mar 2017 02:28:38 +0100 (CET) Received: (qmail 70655 invoked by uid 500); 7 Mar 2017 01:28:37 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 70644 invoked by uid 99); 7 Mar 2017 01:28:37 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Mar 2017 01:28:37 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 18503C0C0D for ; Tue, 7 Mar 2017 01:28:37 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.451 X-Spam-Level: * X-Spam-Status: No, score=1.451 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_NEUTRAL=0.652] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id R6hSeyBD9vg5 for ; Tue, 7 Mar 2017 01:28:36 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 95D1161E2B for ; Tue, 7 Mar 2017 01:28:35 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 2211BE0A32 for ; Tue, 7 Mar 2017 01:28:34 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 281352417C for ; Tue, 7 Mar 2017 01:28:33 +0000 (UTC) Date: Tue, 7 Mar 2017 01:28:33 +0000 (UTC) From: "Manoj Govindassamy (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-11499) Decommissioning stuck because of failing recovery MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 07 Mar 2017 01:28:39 -0000 [ https://issues.apache.org/jira/browse/HDFS-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898537#comment-15898537 ] Manoj Govindassamy commented on HDFS-11499: ------------------------------------------- [~lukmajercak], Are you referring to the timeout in TestDecommission#testDecommissionWithOpenFileAndDatanodeFailing() which was part of the patch v01 ? In the patch v02 I added Maintenance State related test. Not sure, if extending the timeout for the failed test is going to solve the problem. Because, the nodes didn't move to DECOMMISSIONED state as the test is expecting . {noformat} 2017-03-06 23:33:49,462 [Thread-782] INFO hdfs.AdminStatesBaseTest (AdminStatesBaseTest.java:waitNodeState(342)) - Waiting for node 127.0.0.1:33069 to change state to Decommissioned current state: Decommission In Progress 2017-03-06 23:33:49,462 [Thread-782] INFO hdfs.AdminStatesBaseTest (AdminStatesBaseTest.java:waitNodeState(342)) - Waiting for node 127.0.0.1:33069 to change state to Decommissioned current state: Decommission In Progress [test timeout] 2017-03-06 23:33:49,486 [main] INFO hdfs.MiniDFSCluster (MiniDFSCluster.java:shutdown(1951)) - Shutting down the Mini HDFS Cluster {noformat} > Decommissioning stuck because of failing recovery > ------------------------------------------------- > > Key: HDFS-11499 > URL: https://issues.apache.org/jira/browse/HDFS-11499 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs, namenode > Affects Versions: 2.7.1, 2.7.2, 2.7.3, 3.0.0-alpha2 > Reporter: Lukas Majercak > Assignee: Lukas Majercak > Labels: blockmanagement, decommission, recovery > Fix For: 3.0.0-alpha3 > > Attachments: HDFS-11499.02.patch, HDFS-11499.patch > > > Block recovery will fail to finalize the file if the locations of the last, incomplete block are being decommissioned. Vice versa, the decommissioning will be stuck, waiting for the last block to be completed. > {code:xml} > org.apache.hadoop.ipc.RemoteException(java.lang.IllegalStateException): Failed to finalize INodeFile testRecoveryFile since blocks[255] is non-complete, where blocks=[blk_1073741825_1001, blk_1073741826_1002... > {code} > The fix is to count replicas on decommissioning nodes when completing last block in BlockManager.commitOrCompleteLastBlock, as we know that the DecommissionManager will not decommission a node that has UC blocks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org