hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6473) Add hadoop health check/diagnostics to run from command line, JSP pages, other tools
Date Wed, 02 Jun 2010 10:17:39 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874521#action_12874521

Steve Loughran commented on HADOOP-6473:

I think the checks/entry point should have a production/developer switch or set of maskable

* HDFS: secondary namenode defined, hostname must resolve from NN
* DNS/rDNS must work against specifically defined hosts
* Maybe: stricter requirements about which interfaces come up on (e.g a valid range of IP
addresses for each service)
* log directory space requirements
* temp dir space requirements
and failure/error code if anything isn't met, also consider running these checks on every
service startup

* allow people to work on laptops with no external network, play in incomplete clusters.
* less disk space requirements

Logging also raises some questions
* Check/print log levels of the various services, warn if at DEBUG level in production
* Work out which back end to commons-logging is running, print its classname
* Print out the commons-logging, slf4j, jetty and log4j JVM config options
* print out which log4j.properties/XML file is resolving on the classpath

> Add hadoop health check/diagnostics to run from command line, JSP pages, other tools
> ------------------------------------------------------------------------------------
>                 Key: HADOOP-6473
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6473
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Steve Loughran
>            Priority: Minor
> If the lifecycle ping() is for short-duration "are we still alive" checks, Hadoop still
needs something bigger to check the overall system health,.This would be for end users, but
also for automated cluster deployment, a complete validation of the cluster, 
> It could be a command line tool, and something that runs on different nodes, checked
via IPC or JSP. the idea would be to do thorough checks with good diagnostics.  Oh, and they
should be executable through JUnit too.
> For example
>  -if running on windows, check that cygwin is on the path, fail with a pointer to a wiki
issue if not
>  -datanodes should check that it can create locks on the filesystem, create files, timestamps
are (roughly) aligned with local time.
>  -namenodes should try and create files/locks in the filesystem
>  -task tracker should try and exec() something
>  -run through the classpath and look for problems; duplicate JARs, unsupported java,
xerces versions, etc.
> * The number of tests should be extensible -rather than one single class with all the
tests, there'd be something separate for name, task, data, job tracker nodes
> * They can't be in the nodes themselves, as they should be executable even if the nodes
don't come up. 
> * output could be in human readable text or html, and a form that could be processed
through hadoop itself in future
> * these tests could have side effects, such as actually trying to submit work to a cluster

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message