hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From samdispmail-tru...@yahoo.com
Subject Re: How to query a slave node for monitoring information
Date Wed, 13 Jul 2011 06:16:19 GMT
Thank you very much Allen and Nathan for your help.

I will follow your suggestions and check your pointers.

Thank you 

-sam



________________________________
From: Nathan Milford <nathan@milford.io>
To: mapreduce-user@hadoop.apache.org
Sent: Tue, July 12, 2011 6:21:07 PM
Subject: Re: How to query a slave node for monitoring information


If you wanna be kinda ghetto about it on a 5 node test cluster you could:

if curl namenode:50070; then
 echo "YAY it's up"
else
 echo "It's down"
 restartNamenodeCommand
fi

But, if you have a proper cluster you're better off using Nagios 
or something similar.

If you want down/up information in Nagios:

service_description NameNode
check_command  check_http!50070

service_description JobTracker
check_command  check_http!50030

service_description TaskTracker
check_command  check_http!50060

service_description Secondary NameNode
check_command  check_http!50090

service_description DataNode
check_command  check_http!50075

If you want metrics for thresholds in Nagios:

Modify hadoop-metrics.properties to expose the /metrics URL and run something 
like: 
http://exchange.nagios.org/directory/Plugins/Others/check_hadoop_metrics/details

Similar to what Allen suggested, we also have a script the scrapes the NameNode 
and JobTracker pages and gets the number of nodes reporting and alerts if we 
fall below a threshold.

- Nathan Milford


On Tue, Jul 12, 2011 at 7:34 PM, <samdispmail-trust3@yahoo.com> wrote:

Thank you very much Allen,
>
>
>"common-user@ would likely have been better, but I'm too lazy to forward you 
>there today. :)"
>Thank you :-)
>
>
>"Do you want monitoring information or metrics information? "
>I need monitoring information. 
>I am working on deploying Hadoop on a small cluster. For now, I am interested in 
>restarting (restart the node or even reboot the OS) the nodes Hadoop detects as 
>crashed.
>
>"Instead, one should monitor the namenode and jobtracker and alert based on a 
>percentage of availability.  ... "
>Indeed.
>I use Hadoop 0.20.203.
>
>"This can be done in a variety of ways, ..."
>Can you please provide any pointers.
>Do you know how I can access the monitoring information of the namenode or the 
>jobtracker so I can extract a list of failed  nodes?
>
>Thank you very much for your help
>
>P.S.:
>Why I thought of using metrics information, is because they are periodic and 
>seemed easy to access. I though of using them as heart beats only (i.e. if I do 
>not receive the metric in 2-3 periods I reset the node).
>
>Thank you 
>
>-sam
>
>
>
________________________________
 From: Allen Wittenauer <aw@apache.org>
>To: mapreduce-user@hadoop.apache.org
>Sent: Tue, July 12, 2011 3:13:42 PM
>Subject: Re: How to query a slave node for monitoring information
>
>
>On Jul 12, 2011, at 3:02 PM, <samdispmail-trust3@yahoo.com>
><samdispmail-trust3@yahoo.com> wrote:
>> I am new to Hadoop, and I apologies if this was answered before, or if this is 
>
>> not the right list for my question.
>
>    common-user@ would likely have been better, but I'm too lazy to forward you 
>there today. :)
>
>> 
>> I am trying to do the following:
>> 1- Read monitoring information from slave nodes in hadoop
>> 2- Process the  data to detect nodes failure (node crash, problems in requests 
>
>> ... etc) and decide if I need to restart the whole machine.
>> 3- Restart the machine running the slave facing problems
>
>
>    At scale, one doesn't monitor individual nodes for up/down.  Verifying  the 
>up/down of a given node will drive you insane and is pretty much a waste of time 
>unless the grid itself is under-configured to the point that *every* *node* 
>*counts*.  (If that is the case, then there are bigger issues afoot...)
>
>    Instead, one should monitor the namenode and jobtracker and alert based on a 
>percentage of availability.  This can be done in a variety of ways, depending 
>upon which version of Hadoop is in play.  For 0.20.2, a simple screen scrape is 
>good enough.  I recommend warn on 10%, alert on 20%, panic on 30%.
>
>> My question is for step 1- collecting monitoring information.
>>  I have checked Hadoop monitoring features. But currently you can forward the 

>> motioning data to files, or to Ganglia.
>
>    
>    Do you want monitoring information or metrics information?  Ganglia is 
>purely a metrics tool.  Metrics are a different  animal.  While it is possible 
>to alert on them, in most cases they aren't particular useful in a monitoring 
>context other than up/down.
>
>
>

Mime
View raw message