hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhe Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3599) Better expose when under-construction files are preventing DN decommission
Date Wed, 06 Jan 2016 00:20:39 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15084422#comment-15084422

Zhe Zhang commented on HDFS-3599:

With the HDFS-5579 change, it's still possible for UC files to block decomm, if the {{minReplication}}
is configured to be larger than 1. In that case it's still possible for the last block in
an UC file to be under-replicated. NN won't try to re-replicate it and it will block decomm.

Another issue is that HDFS-7411 removed the below logic (introduced by HDFS-5579):
              if (block.equals(bc.getLastBlock()) && curReplicas > minReplication)

Pinging [~andrew.wang] to confirm whether we should add it back.

> Better expose when under-construction files are preventing DN decommission
> --------------------------------------------------------------------------
>                 Key: HDFS-3599
>                 URL: https://issues.apache.org/jira/browse/HDFS-3599
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, namenode
>    Affects Versions: 3.0.0
>            Reporter: Todd Lipcon
>            Assignee: Zhe Zhang
> Filing on behalf of Konstantin Olchanski:
> {quote}
> I have been trying to decommission a data node, but the process
> stalled. I followed the correct instructions, observed my node
> listed in "Decommissioning Nodes", etc, observed "Under Replicated Blocks"
> decrease, etc. But the count went down to "1" and the decommissin process stalled.
> There was no visible activity anywhere, nothing was happening (well,
> maybe in some hidden log file somewhere something complained,
> but I did not look).
> It turns out that I had some files stuck in "OPENFORWRITE" mode,
> as reported by "hdfs fsck / -openforwrite -files -blocks -locations -racks":
> {code}
> /users/trinat/data/.fuse_hidden0000177e00000002 0 bytes, 0 block(s), OPENFORWRITE:  OK
> /users/trinat/data/.fuse_hidden0000178d00000003 0 bytes, 0 block(s), OPENFORWRITE:  OK
> /users/trinat/data/.fuse_hidden00001da300000004 0 bytes, 1 block(s), OPENFORWRITE:  OK
> 0. BP-88378204-{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[|RBW], ReplicaUnderConstruction[|RBW],
ReplicaUnderConstruction[|RBW]]} len=0 repl=3 [/detfac/,
/isac2/, /isac2/]
> {code}
> After I deleted those files, the decommission process completed successfully.
> Perhaps one can add some visible indication somewhere on the HDFS status web page
> that the decommission process is stalled and maybe report why it is stalled?
> Maybe the number of "OPENFORWRITE" files should be listed on the status page
> next to the "Number of Under-Replicated Blocks"? (Since I know that nobody is writing
> to my HDFS, the non-zero count would give me a clue that something is wrong).
> {quote}

This message was sent by Atlassian JIRA

View raw message