hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Re: nagios to monitor hadoop datanodes!
Date Wed, 08 Oct 2008 14:01:14 GMT
Hey Edward,

The JMX documentation for Hadoop is non-existent, but here's about  
what you need to do:

1) download and install the check_jmx Nagios plugin
2) Open up the hadoop JMX install to the outside world.  I added the  
following lines to hadoop-env.sh
export HADOOP_OPTS=" -Dcom.sun.management.jmxremote.authenticate=false  
-Dcom.sun.management.jmxremote.port=8004 "

Note the potential security issue I'm opening up.  You could also  
switch things to SSL auth, but I have not explored that thoroughly in  
combination with Nagios.
3) Restart Hadoop
3) Use jconsole to connect to Hadoop's JVM.  Look in the "MBeans" tab  
and decide what metrics you want to monitor.  If you look at the  
"info" tab (the last on the right), you'll see the MBean Name; you'll  
need to remember this later.
4) Add Nagios probes like so:
./check_jmx -U service:jmx:rmi:///jndi/rmi://node182:8004/jmxrmi -O  
java.lang:type=Memory -A HeapMemoryUsage -K used -C 10000000
This connects to "node182" on port 8004.  It then looks at the Memory  
statistics (java.lang:type=Memory), at the HeapMemory attribute, and  
the used field inside that attribute (in jconsole, if you see a value  
in bold, you need to double-click to expand its contents).  I then set  
the critical level of the metric to be 100000000 bytes of memory used  
and warning level to 10000000 bytes.

The result is like this:

[brian@red plugin]$ ./check_jmx -U service:jmx:rmi:///jndi/rmi:// 
node182:8004/jmxrmi -O java.lang:type=Memory -A HeapMemoryUsage -K  
used -w 10000000 -c 100000000
JMX OK HeapMemoryUsage.used=9780336

If I poked a dead JVM (or change to the wrong port), I get the  

[brian@red plugin]$ ./check_jmx -U service:jmx:rmi:///jndi/rmi:// 
node182:8005/jmxrmi -O java.lang:type=Memory -A HeapMemoryUsage -K  
used -w 10000000 -c 100000000
JMX CRITICAL Connection refused

If I lower the critical level to below the current usage, you get:

[brian@red plugin]$ ./check_jmx -U service:jmx:rmi:///jndi/rmi:// 
node182:8004/jmxrmi -O java.lang:type=Memory -A HeapMemoryUsage -K  
used -w 100000 -c 1000000
JMX CRITICAL HeapMemoryUsage.used=4846000

THE BIG PROBLEM here is that Hadoop decides to hide a lot of  
interesting data node statistics behind a random name; want the max  
time it took to do the block reports?  For me, the query looks like  

[root@node182 plugin]# ./check_jmx -U service:jmx:rmi:///jndi/rmi:// 
node182:8004/jmxrmi -O hadoop.dfs:service=DataNode- 
DS-1394617310-,name=DataNodeStatistics - 
A BlockReportsMaxTime -w 10 -c 100
JMX CRITICAL hadoop.dfs:service=DataNode- 

Here the service is called "DataNode- 
DS-1394617310-", which really causes  
Hadoop to shoot itself in the foot with regards to Nagios monitoring.   
Locally, we patch things so the random string goes away:

[root@node182 plugin]# ./check_jmx -U service:jmx:rmi:///jndi/rmi:// 
node182:8004/jmxrmi -O  
hadoop.dfs:service=DataNode,name=DataNodeStatistics -A  
BlockReportsMaxTime -w 10 -c 150
JMX WARNING BlockReportsMaxTime=141

Care to file a bug for that anyone?

I assume you can set up Nagios from there.


On Oct 8, 2008, at 8:20 AM, Edward Capriolo wrote:

> The simple way would be use use nrpe and check_proc. I have never
> tested, but a command like 'ps -ef | grep java  | grep NameNode' would
> be a fairly decent check. That is not very robust but it should let
> you know if the process is alive.
> You could also monitor the web interfaces associated with the
> different servers remotely.
> check_tcp!hadoop1:56070
> Both the methods I suggested are quick hacks. I am going to
> investigate the JMX options as well  and work them into cacti

View raw message