Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Date: Tue, 7 Mar 2017 01:28:33 +0000 (UTC)
From: "Manoj Govindassamy (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.13048381.1488744070000.14546.1488850113162@Atlassian.JIRA>
In-Reply-To: <JIRA.13048381.1488744070000@Atlassian.JIRA>
References: <JIRA.13048381.1488744070000@Atlassian.JIRA> <JIRA.13048381.1488744070457@jira-lw-us.apache.org>
Subject: [jira] [Commented] (HDFS-11499) Decommissioning stuck because of
 failing recovery
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Tue, 07 Mar 2017 01:28:39 -0000


    [ https://issues.apache.org/jira/browse/HDFS-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898537#comment-15898537 ] 

Manoj Govindassamy commented on HDFS-11499:
-------------------------------------------

[~lukmajercak],
Are you referring to the timeout in TestDecommission#testDecommissionWithOpenFileAndDatanodeFailing() which was part of the patch v01 ? In the patch v02 I added Maintenance State related test. Not sure, if extending the timeout for the failed test is going to solve the problem. Because, the nodes didn't move to DECOMMISSIONED state as the test is expecting .

{noformat}

2017-03-06 23:33:49,462 [Thread-782] INFO  hdfs.AdminStatesBaseTest (AdminStatesBaseTest.java:waitNodeState(342)) - Waiting for node 127.0.0.1:33069 to change state to Decommissioned current state: Decommission In Progress
2017-03-06 23:33:49,462 [Thread-782] INFO  hdfs.AdminStatesBaseTest (AdminStatesBaseTest.java:waitNodeState(342)) - Waiting for node 127.0.0.1:33069 to change state to Decommissioned current state: Decommission In Progress

[test timeout]

2017-03-06 23:33:49,486 [main] INFO  hdfs.MiniDFSCluster (MiniDFSCluster.java:shutdown(1951)) - Shutting down the Mini HDFS Cluster
{noformat}

> Decommissioning stuck because of failing recovery
> -------------------------------------------------
>
>                 Key: HDFS-11499
>                 URL: https://issues.apache.org/jira/browse/HDFS-11499
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs, namenode
>    Affects Versions: 2.7.1, 2.7.2, 2.7.3, 3.0.0-alpha2
>            Reporter: Lukas Majercak
>            Assignee: Lukas Majercak
>              Labels: blockmanagement, decommission, recovery
>             Fix For: 3.0.0-alpha3
>
>         Attachments: HDFS-11499.02.patch, HDFS-11499.patch
>
>
> Block recovery will fail to finalize the file if the locations of the last, incomplete block are being decommissioned. Vice versa, the decommissioning will be stuck, waiting for the last block to be completed.
> {code:xml}
> org.apache.hadoop.ipc.RemoteException(java.lang.IllegalStateException): Failed to finalize INodeFile testRecoveryFile since blocks[255] is non-complete, where blocks=[blk_1073741825_1001, blk_1073741826_1002...
> {code}
> The fix is to count replicas on decommissioning nodes when completing last block in BlockManager.commitOrCompleteLastBlock, as we know that the DecommissionManager will not decommission a node that has UC blocks.


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org