hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: detecting stalled daemons?
Date Fri, 09 Oct 2009 01:20:35 GMT
Hi James,
This doesn't quite answer your original question, but if you want to help
track down these kinds of bugs, you should grab a stack trace next time this
happens.

You can do this either using "jstack" from the command line, by visiting
/stacks on the HTTP interface, or by sending the process a SIGQUIT (kill
-QUIT <pid>). If you go the SIGQUIT route, the stack dump will show up in
that daemon's stdout log (logs/hadoop-....out).

Oftentimes the stack trace will be enough for the developers to track down a
deadlock, or it may point to some sort of configuration issue on your
machine.

-Todd


On Wed, Oct 7, 2009 at 11:19 PM, james warren <james@rockyou.com> wrote:

> Quick question for the hadoop / linux masters out there:
>
> I recently observed a stalled tasktracker daemon on our production cluster,
> and was wondering if there were common tests to detect failures so that
> administration tools (e.g. monit) can automatically restart the daemon.
>  The
> particular observed symptoms were:
>
>   - the node was dropped by the jobtracker
>   - information in /proc listed the tasktracker process as sleeping, not
>   zombie
>   - the web interface (port 50060) was unresponsive, though telnet did
>   connect
>   - no error information in the hadoop logs -- they simply were no longer
>   being updated
>
> I certainly cannot be the first person to encounter this - anyone have a
> neat and tidy solution they could share?
>
> (And yes, we will eventually we go down the nagios / ganglia / cloudera
> desktop path but we're waiting until we're running CDH2.)
>
> Many thanks,
> -James Warren
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message