hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Re: Could only be replicated to 0 nodes, instead of 1
Date Thu, 21 May 2009 20:32:57 GMT

On May 21, 2009, at 3:10 PM, Stas Oskin wrote:

> Hi.
> If this analysis is right, I would add it can happen even on large  
> clusters!
>> I've seen this error at our cluster when we're very full (>97%) and  
>> very
>> few nodes have any empty space.  This usually happens because we  
>> have two
>> very large nodes (10x bigger than the rest of the cluster), and  
>> HDFS tends
>> to distribute writes randomly -- meaning the smaller nodes fill up  
>> quickly,
>> until the balancer can catch up.
> A bit of topic, do you ran the balancer manually? Or you have some  
> scheduler
> that does it?

crontab does it for us, once an hour.  We're always importing data, so  
the cluster is always out-of-balance.

If the previous balancer didn't exit, the new one will simply exit.

The real trick has been to make sure the balancer doesn't get stuck --  
a Nagios plugin makes sure that the stdout has been printed to in the  
last hour or so, otherwise it kills the running balancer.  Stuck  
balancers have been an issue in the past.


View raw message