hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brahma Reddy Battula <brahmareddy.batt...@huawei.com>
Subject RE: HDFS - Corrupt replicas preventing decommissioning?
Date Tue, 15 Nov 2016 11:57:07 GMT
Please check my inline comments to your queries. Hope I have answered all your questions…


Regards
Brahma Reddy Battula

From: Hariharan [mailto:hariharan022@gmail.com]
Sent: 15 November 2016 18:55
To: user@hadoop.apache.org
Subject: HDFS - Corrupt replicas preventing decommissioning?

Hello folks,
I'm running Apache Hadoop 2.6.0 and I'm seeing a weird problem where I keep seeing corrupt
replicas. Example:
2016-11-15 06:42:38,104 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block:
blk_1073747320_231160{blockUCState=COMMITTED, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]DS-11d5d492-a608-4bc0-9a04-048b8127bb32:NORMAL:10.0.8.185:50010|RBW]]},
Expected Replicas: 2, live replicas: 0, corrupt replicas: 2, decommissioned replicas: 1, excess
replicas: 0, Is Open File: true, Datanodes having this block: 10.0.8.185:50010<http://10.0.8.185:50010>
10.0.8.148:50010<http://10.0.8.148:50010> 10.0.8.149:50010<http://10.0.8.149:50010>
, Current Datanode: 10.0.8.185:50010<http://10.0.8.185:50010>, Is current datanode decommissioning:
true
But I can't figure out which file this block belongs to - hadoop fsck / -files -blocks -locations
| grep blk_1073747320_231160 returns nothing.
>> Looks files are open state, you can check fsck with –openforwrite option which
will list all the open files also.
So I'm unable to delete the file and my concern is that this seems to be blocking decommissioning
of my datanode (going on for ~18 hours now) since, looking at the code in BlockManager.java,
we would not mark the DN as decommissioned if there are blocks with no live replicas on it.
My questions are:
1. What causes corrupt replicas and how to avoid them? I seem to be seeing these frequently:
(examples from prior runs)
>>As files are open state, there are chances blocks can be corrupt state since might
not send block received command to Namenode.
So before going for decommission ensure that files are closed and check the under-replicated
block count.

hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
Block: blk_1074063633_2846521{blockUCState=COMMITTED, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]DS-7b8e7b76-6066-43fb-8340-d93f7ab9c6ea:NORMAL:10.0.8.75:50010|RBW]]},
Expected Replicas: 2, live replicas: 0, corrupt replicas: 4, decommissioned replicas: 1, excess
replicas: 0, Is Open File: true, Datanodes having this block: 10.0.8.75:50010<http://10.0.8.75:50010>
10.0.8.156:50010<http://10.0.8.156:50010> 10.0.8.188:50010<http://10.0.8.188:50010>
10.0.8.34:50010<http://10.0.8.34:50010> 10.0.8.74:50010<http://10.0.8.74:50010>
, Current Datanode: 10.0.8.75:50010<http://10.0.8.75:50010>, Is current datanode decommissioning:
true
hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
Block: blk_1073975974_2185091{blockUCState=COMMITTED, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]DS-b9b8b191-f8c8-49b0-b4c1-b2a9ce6b9ee8:NORMAL:10.0.8.153:50010|RBW]]},
Expected Replicas: 2, live replicas: 0, corrupt replicas: 3, decommissioned replicas: 1, excess
replicas: 0, Is Open File: true, Datanodes having this block: 10.0.8.153:50010<http://10.0.8.153:50010>
10.0.8.74:50010<http://10.0.8.74:50010> 10.0.8.7:50010<http://10.0.8.7:50010>
10.0.8.198:50010<http://10.0.8.198:50010> , Current Datanode: 10.0.8.153:50010<http://10.0.8.153:50010>,
Is current datanode decommissioning: true
hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
Block: blk_1073975974_2185091{blockUCState=COMMITTED, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]DS-b9b8b191-f8c8-49b0-b4c1-b2a9ce6b9ee8:NORMAL:10.0.8.153:50010|RBW]]},
Expected Replicas: 2, live replicas: 0, corrupt replicas: 3, decommissioned replicas: 1, excess
replicas: 0, Is Open File: true, Datanodes having this block: 10.0.8.153:50010<http://10.0.8.153:50010>
10.0.8.74:50010<http://10.0.8.74:50010> 10.0.8.7:50010<http://10.0.8.7:50010>
10.0.8.198:50010<http://10.0.8.198:50010> , Current Datanode: 10.0.8.7:50010<http://10.0.8.7:50010>,
Is current datanode decommissioning: true
2. Is this possibly a JIRA that's fixed in recent versions (I realize I'm running a very old
version)?
>> Based on the exact root cause for corrupt, we can able to tell jira Id’s..Need
to check all of your logs.
3. Anything I can do to "force" decommissioning of such nodes (apart from forcefully terminating
them)?
>> As of now no “forceful” decommission. But you can delete the corrupt blocks using
 “hdfs fsck delete <filePath>”
Thanks,
Hari



Mime
View raw message