hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen Wittenauer ...@apache.org>
Subject Re: How to query a slave node for monitoring information
Date Tue, 12 Jul 2011 22:13:42 GMT
On Jul 12, 2011, at 3:02 PM, <samdispmail-trust3@yahoo.com>
 <samdispmail-trust3@yahoo.com> wrote:
> I am new to Hadoop, and I apologies if this was answered before, or if this is 
> not the right list for my question.

	common-user@ would likely have been better, but I'm too lazy to forward you there today.

> I am trying to do the following:
> 1- Read monitoring information from slave nodes in hadoop
> 2- Process the data to detect nodes failure (node crash, problems in requests 
> ... etc) and decide if I need to restart the whole machine.
> 3- Restart the machine running the slave facing problems

	At scale, one doesn't monitor individual nodes for up/down.  Verifying the up/down of a given
node will drive you insane and is pretty much a waste of time unless the grid itself is under-configured
to the point that *every* *node* *counts*.  (If that is the case, then there are bigger issues

	Instead, one should monitor the namenode and jobtracker and alert based on a percentage of
availability.  This can be done in a variety of ways, depending upon which version of Hadoop
is in play.  For 0.20.2, a simple screen scrape is good enough.  I recommend warn on 10%,
alert on 20%, panic on 30%.

> My question is for step 1- collecting monitoring information.
> I have checked Hadoop monitoring features. But currently you can forward the 
> motioning data to files, or to Ganglia.

	Do you want monitoring information or metrics information?  Ganglia is purely a metrics tool.
 Metrics are a different animal.  While it is possible to alert on them, in most cases they
aren't particular useful in a monitoring context other than up/down.

View raw message