hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen Wittenauer <awittena...@linkedin.com>
Subject Re: monit? daemontools? jsvc? something else?
Date Thu, 06 Jan 2011 19:20:54 GMT

On Jan 6, 2011, at 12:39 AM, Otis Gospodnetic wrote:
>>    In the case of Hadoop,  no.  There has usually been at least a core dump, 
>> message in syslog,  message in datanode log, etc, etc.   [You *do* have cores 
>> enabled,  right?]
> 
> Hm, "cores enabled".... what do you mean by that?  Are you referring to JVM heap 
> dump -XX JVM argument (-XX:+HeapDumpOnOutOfMemoryError)?  If not, I'm all 
> eyes/ears!

	I'm talking about system level core dumps. i.e., ulimit -c and friends.  [I'm much more of
a systems programmer than a java guy, so ... ] You can definitely write Java code that will
make the JVM crash due to misuse of threading libraries.  There are also CPU, kernel, and
BIOS bugs that I've seen that cause the JVM to crash. Usually jstack or a core will lead the
way to a patching the system to work around these issues. 

> 
>>    We also have in place a monitor that checks  the # of active nodes.  If it 
>> falls below a certain percentage, then we get  alerted and check on them en 
>> masse.   Worrying about one or two nodes going  down probably means you need 
>> more nodes. :D
>> 
> 
> That's probably right. :)
> So what do you use for monitoring the # of active nodes?

	We currently have a custom plugin for Nagios that screen scrapes the NN and JT web UI.  When
a certain percentage of nodes dies, we get alerted that we need to take a look and start bringing
stuff back up.  [We used the same approach at Y!, so it does work at scale.]

	I'm hoping to replace this (and Ganglia) with something better over the next year.... ;)
Mime
View raw message