hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jeff whiting <je...@qualtrics.com>
Subject Re: Unbalanced Datanode and Lots of Blocks Waiting for Deletion
Date Thu, 03 Jun 2010 15:56:18 GMT
Thanks that fixed my problem!  I patched my 0.20.2 branch with both HDFS-611 and HADOOP-5124.
 However, HADOOP-5124 only had a patch for trunk that required some manual intervention to
get it to patch 0.20.2.  So I've created a patch file for 0.20.2 and uploaded into HADOOP-5124
in case anybody runs into my problem.

Thanks again,
~Jeff

p.s.

Here is how it looks now:

Node	Last 
Contact	Admin State	Configured 
Capacity (TB)	Used 
(TB)	Non DFS 
Used (TB)	Remaining 
(TB)	Used 
(%)	Used 
(%)	Remaining 
(%)	Blocks
ds1	2	In Service	5.37	1.85	0.27	3.25	34.47	
60.44	50932
ds2	0	In Service	5.37	1.85	0.27	3.25	34.48	
60.44	50743
ds3	2	In Service	5.37	1.85	0.27	3.25	34.48	
60.43	55598

(much better)

On Jun 2, 2010, at 3:34 PM, Todd Lipcon wrote:

> Hi Jeff,
> 
> This issue is caused by a confluence of factors.
> 
> The first is this bug fixed in trunk:
> https://issues.apache.org/jira/browse/HADOOP-5124
> 
> The second, and what causes a lot of extra deletion when it shouldn't, especially with
HBase is this one, not fixed yet:
> https://issues.apache.org/jira/browse/HDFS-1172
> 
> And lastly, the thing that can cause deletions to hold up heartbeats and cause HDFS-1172:

> https://issues.apache.org/jira/browse/HDFS-611
> 
> We'll likely include HDFS-611 and HADOOP-5124 in the next beta release of CDH3.
> 
> Thanks
> -Todd
> 
> On Wed, Jun 2, 2010 at 2:27 PM, jeff whiting <jeffw@qualtrics.com> wrote:
> I'm running a 3 node hdfs cluster and am having major data distribution issues.  Looking
at "live nodes" in the web interface I'm seeing the following:
> 
> Node	Last 
> Contact	 Admin State	Configured 
> Capacity (TB)	 Used 
> (TB)	Non DFS 
> Used (TB)	 Remaining 
> (TB)	Used 
> (%)	 Used 
> (%)	Remaining 
> (%)	 Blocks
> ds1	2	 In Service	5.37	 1.62	0.27	 3.48	30.19	
> 64.73	81969
> ds2	0	 In Service	5.37	 5.1	0.27	 0	94.9	
> 0.01	72692
> ds3	 0	In Service	 5.37	1.77	 0.27	3.33	 32.91	
> 62.01	84412
> 
> In a non-html formated way:
> 
> node   capacity   %used
> ds1      5.37TB     30%
> ds2      5.37TB     95%      
> ds3      5.37TB     32%      
> 
> 
> I ran a dfsadmin metasave and got the following
> 
> Metasave: Blocks waiting for replication: 0
> Metasave: Blocks being replicated: 0
> Metasave: Blocks 692759 waiting deletion from 3 datanodes.
> 
> It looks like all of the spacing being used on ds2 is due to block not being deleted.
 The vast majority of blocks that need to be deleted are attributed to ds2 (I didn't include
it here because the list is so large). Checking the logs I'll see the occasional: 
> 
> (FSNamesystem.java:invalidateWorkForOneNode(2717)) - BLOCK* ask 192.168.0.81:50010 to
delete  blk_8850139985669106987_2950393 blk_6677512006515381913_3142239 blk_-7534196842342813001_2880360
blk_6575946937866450337_3280570 blk_-3722158283806045364_3118632 blk_-3490603823691151224_3036593
blk_-897396045616120182_2930553 blk_-4660390234299740937_3117083 blk_4605672167531794646_3042444
blk_-2793729264523330063_3046949 blk_-1069835590195826211_2928578 blk_-3689480462529026793_3284707
blk_-2100166843619194516_3265408 blk_5162047185501320447_3278539 blk_3664800743566330457_3065400
blk_3369418146997398320_3111317 blk_5964743871832843148_3031713 blk_-8218489376644120438_2987780
blk_367071346032512828_3180655 blk_-442303570139272169_3314076 blk_5419190113922354447_3205121
blk_-2101734991458420810_3075412 blk_1957248302788390163_2955454 blk_8699145900031080784_2957098
blk_7385528884584110838_3058451 blk_4447871550951654682_3039010 blk_1887493293417017989_3223726
blk_6157668188087364422_2901764 blk_-8576478885691122637_3268999 blk_1151511910147641335_3222139
blk_8085841381003430120_2901077 blk_-7657800079806100653_3240574 blk_234746170041166777_3211835
blk_7314545895906772373_2975491 blk_613366993704120940_2873518 blk_-7668134916749889355_2904183
blk_64385028396804451_3109940
> 
> but it is very infrequent for ds2.  For ds1 and ds3 the requests are much more regular.
 Any idea what is going on?  Why it isn't sending the delete commands? Or what I need to do
or check to solve the problem?
> 
> Thanks,
> ~Jeff
> 
> --
> Jeff Whiting
> Qualtrics Senior Software Engineer
> jeffw@qualtrics.com
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

--
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com






Mime
View raw message