hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hariharan <hariharan...@gmail.com>
Subject Re: HDFS - Corrupt replicas preventing decommissioning?
Date Tue, 15 Nov 2016 12:34:42 GMT
Thanks Brahma. That certainly cleared up a lot of doubts - the file did
indeed show up in *fsck -openforwrite *and deleting it made the node move
to "decommissioned" state.

So the recommendation here is to wait for all files having blocks on the
node to be closed before adding it to the excludes file (assuming the
number of replicas is fine)?

Thanks,
Hariharan

On Tue, Nov 15, 2016 at 5:27 PM Brahma Reddy Battula <
brahmareddy.battula@huawei.com> wrote:

> Please check my inline comments to your queries. Hope I have answered all
> your questions…
>
>
>
>
>
> Regards
>
> Brahma Reddy Battula
>
>
>
> *From:* Hariharan [mailto:hariharan022@gmail.com]
> *Sent:* 15 November 2016 18:55
> *To:* user@hadoop.apache.org
> *Subject:* HDFS - Corrupt replicas preventing decommissioning?
>
>
>
> Hello folks,
>
> I'm running Apache Hadoop 2.6.0 and I'm seeing a weird problem where I
> keep seeing corrupt replicas. Example:
> 2016-11-15 06:42:38,104 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block:
> *blk_1073747320_231160*{blockUCState=COMMITTED, primaryNodeIndex=0,
> replicas=[ReplicaUnderConstruction[[DISK]DS-11d5d492-a608-4bc0-9a04-048b8127bb32:NORMAL:10.0.8.185:50010|RBW]]},
> Expected Replicas: 2, *live replicas: 0, corrupt replicas: 2*,
> decommissioned replicas: 1, excess replicas: 0, Is Open File: true,
> Datanodes having this block: 10.0.8.185:50010 10.0.8.148:50010
> 10.0.8.149:50010 , Current Datanode: 10.0.8.185:50010, Is current
> datanode decommissioning: true
>
> But I can't figure out which file this block belongs to - *hadoop fsck /
> -files -blocks -locations | grep blk_1073747320_231160* returns nothing.
>
> *>> Looks files are open state, you can check fsck with **–openforwrite **option
> which will list all the open files also.*
>
> So I'm unable to delete the file and my concern is that this seems to be
> blocking decommissioning of my datanode (going on for ~18 hours now) since,
> looking at the code in BlockManager.java, we would not mark the DN as
> decommissioned if there are blocks with no live replicas on it.
>
> My questions are:
>
> 1. What causes corrupt replicas and how to avoid them? I seem to be seeing
> these frequently:
>
> (examples from prior runs)
>
> *>>As files are open state, there are chances blocks can be corrupt state
> since might not send block received command to Namenode.*
>
> *So before going for decommission ensure that files are closed and check
> the under-replicated block count.*
>
>
> hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block:
> blk_1074063633_2846521{blockUCState=COMMITTED, primaryNodeIndex=0,
> replicas=[ReplicaUnderConstruction[[DISK]DS-7b8e7b76-6066-43fb-8340-d93f7ab9c6ea:NORMAL:10.0.8.75:50010|RBW]]},
> Expected Replicas: 2, *live replicas: 0*, *corrupt replicas: 4*,
> decommissioned replicas: 1, excess replicas: 0, Is Open File: true,
> Datanodes having this block: 10.0.8.75:50010 10.0.8.156:50010
> 10.0.8.188:50010 10.0.8.34:50010 10.0.8.74:50010 , Current Datanode:
> 10.0.8.75:50010, Is current datanode decommissioning: true
> hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block:
> blk_1073975974_2185091{blockUCState=COMMITTED, primaryNodeIndex=0,
> replicas=[ReplicaUnderConstruction[[DISK]DS-b9b8b191-f8c8-49b0-b4c1-b2a9ce6b9ee8:NORMAL:10.0.8.153:50010|RBW]]},
> Expected Replicas: 2, *live replicas: 0, corrupt replicas: 3*,
> decommissioned replicas: 1, excess replicas: 0, Is Open File: true,
> Datanodes having this block: 10.0.8.153:50010 10.0.8.74:50010
> 10.0.8.7:50010 10.0.8.198:50010 , Current Datanode: 10.0.8.153:50010, Is
> current datanode decommissioning: true
> hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block:
> blk_1073975974_2185091{blockUCState=COMMITTED, primaryNodeIndex=0,
> replicas=[ReplicaUnderConstruction[[DISK]DS-b9b8b191-f8c8-49b0-b4c1-b2a9ce6b9ee8:NORMAL:10.0.8.153:50010|RBW]]},
> Expected Replicas: 2, *live replicas: 0, corrupt replicas: 3*,
> decommissioned replicas: 1, excess replicas: 0, Is Open File: true,
> Datanodes having this block: 10.0.8.153:50010 10.0.8.74:50010
> 10.0.8.7:50010 10.0.8.198:50010 , Current Datanode: 10.0.8.7:50010, Is
> current datanode decommissioning: true
>
> 2. Is this possibly a JIRA that's fixed in recent versions (I realize I'm
> running a very old version)?
>
> *>> Based on the exact root cause for corrupt, we can able to tell jira
> Id’s..Need to check all of your logs.*
>
> 3. Anything I can do to "force" decommissioning of such nodes (apart from
> forcefully terminating them)?
>
> *>> As of now no “forceful” decommission. But you can delete the corrupt
> blocks using  “hdfs fsck delete <filePath>”*
>
> Thanks,
>
> Hari
>
>
>
>
>
>
>

Mime
View raw message