hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rosenstrauch <dar...@darose.net>
Subject Re: unhealthy NN after startup
Date Tue, 03 Jul 2012 15:45:39 GMT
On 07/02/2012 09:04 PM, Jianhui Zhang wrote:
> Hi,
> I was restarting the DFS cluster. First, the DNs did not join. But if
> I kept stopping and starting each DN, eventually, all DNs joined the
> NN. But the NN doesn't look healthy.
> The machine has 16 cores. The NN process's CPU stayed at 20% and the
> "system CPU" constantly took up 50%. Here is the top output:
> top - 18:01:34 up 144 days, 16:05,  5 users,  load average: 12.65, 12.06, 12.34
> Tasks: 363 total,   6 running, 357 sleeping,   0 stopped,   0 zombie
> Cpu(s): 18.2%us, 48.7%sy,  0.0%ni, 28.9%id,  0.0%wa,  0.0%hi,  4.1%si,  0.0%st
> Mem:  33000560k total,  6449412k used, 26551148k free,   596812k buffers
> Swap: 64452600k total,        0k used, 64452600k free,  3318352k cached
> And it has been in this state for a long long time - several hours.
> Anybody has seen this before?
> Thanks,
> James

Over the weekend, many people's Hadoop systems (including mine) got hit 
with problems due to the leap second bug in the Linux kernel.  (Which 
brought down many major web sites.)  Perhaps your namenode got hit with 
that as well?

As a result of the bug, many people's java or MySQL processes began 
using excessive CPU.  The problem happened on machines that were running 
NTP to do time synchronization.  The solution was to either reboot the 
server, or (if you're not able to do a reboot for whatever reason) 
execute a particular date command.  Either of those would clear out the 
erroneous state in the kernel.

I have no idea if this is in fact your issue, but figured I'd mention it 
since it sounded plausible.

More details here:




View raw message