hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: best command line way to check up/down status of HDFS?
Date Wed, 02 Jul 2008 10:19:00 GMT
Meng Mao wrote:
> For a Nagios script I'm writing, I'd like a command-line method that checks
> if HDFS is up and running.
> Is there a better way than to attempt a hadoop dfs command and check the
> error code?

1. There is JMX support built in to Hadoop. If you can bring up Hadoop 
running a JMX agent that is compatible with Nagios, you can keep a close 
eye on the internals.

2.. I'm making some lifecycle changes to Hadoop; if/when accepted every 
service (name,data, job,...) will have an internal ping() operation to 
check their health -this can be checked in-process only. I'm also adding 
the smartfrog support to do that in-processing pinging, fallback etc; I 
dont know how nagios would work there, but JMX support for these ops 
should also be possible.

3. When a datanode comes up it starts jetty on a specific port -you can 
do a GET against that jetty instance to see if it is responding. This is 
a good test as it really does verify that the service is live and 
responding. Indeed, that is the official definition of "liveness", at 
least according to Lamport.
  * review the code to make sure it turns caching off, or you can be 
burned probing for health long hall, seeing the happy page and thinking 
all is well. I forgot to do that in happyaxis.jsp, which is why axis 1.x 
health checks dont work long-haul.
  * I could imagine improving those pages with better ones, like 
something that checks that the available freespace is within a certain 
range, and returns an error code if there is less, e.g.
  http://datanode7:5000/checkDiskSpace?mingb=1500
would test for a min disk space of 1500GB.

There are also web pages for job trackers & the like; better for remote 
health checking than jps checks. JPS (and killall) is better for 
fallback when the things stop responding, but  not adequate for liveness 
checks.


Mime
View raw message