hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen Wittenauer ...@apache.org>
Subject Re: How to query a slave node for monitoring information
Date Wed, 13 Jul 2011 00:37:26 GMT

On Jul 12, 2011, at 4:34 PM, <samdispmail-trust3@yahoo.com>
 <samdispmail-trust3@yahoo.com> wrote:
> I am working on deploying Hadoop on a small cluster. For now, I am interested in 
> restarting (restart the node or even reboot the OS) the nodes Hadoop detects as 
> crashed.

	There are quite a few scenarios where one service may be up but another may be down.  So
per-service is usually a better way to go.

> "Instead, one should monitor the namenode and jobtracker and alert based on a 
> percentage of availability.  ... "
> Indeed.
> I use Hadoop 0.20.203.

	OK, then that means...

> "This can be done in a variety of ways, ..."
> Can you please provide any pointers.

	... you're pretty much required to use JMX to query the NN and JT to get node information,
since the rest of the APIs weren't forward ported as promised---Ganglia is out of the equation
anyway.   Luckily, it is fairly trivial to setup a Nagios script to poll that information
(and our experiences say that information is actually working.  Some stuff in the metricsv2
API doesn't appear to be working properly on the DN and TT.).  

> Do you know how I can access the monitoring information of the namenode or the 
> jobtracker so I can extract a list of failed  nodes?

	Take a look at the DeadNodes and LiveNodes attributes in the NameNode and JobTracker section
of the Hadoop MBean.  That's likely your best bet.  

> Why I thought of using metrics information, is because they are periodic and 
> seemed easy to access. I though of using them as heart beats only (i.e. if I do 
> not receive the metric in 2-3 periods I reset the node).

	You end up essentially doing the same that the NN and JT are doing... so might as well just
ask them rather than doing it again, generating even more network traffic than necessary.
  Additionally, there are some failures where the NN or JT may view a service daemon as down
but it actually responds to other queries (from thread death/lock-up).  For example, we've
got a job that has on occasion tripped up the 0.20.2 DN with OOM issues.  The process lies
in a psuedo-dead state due to some weird exception handling down in the bowels of the code.
 The NN rightfully declares it as dead, but depending upon how you ask the node itself, it
may respond!

	So be careful out there.
View raw message