hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3628) Add a lifecycle interface for Hadoop components: namenodes, job clients, etc.
Date Tue, 02 Sep 2008 13:15:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627663#action_12627663

Steve Loughran commented on HADOOP-3628:

Konstantin Shvachko has proposed that instead of throwing an exception on a health failure,
different error text should be returned. His concern is that the cost of constructing and
marshalling an exception can be high, and should not be used if the exception is likely to
be returned regularly. 

This is an interesting thought. The arguments in favour of an exception are

1.For individual nodes, failure is the unexpected state. Normally all should be well.
2.Much of the cost of creating an exception is the cost of creating a stack trace; this can
be useful for later diagnostics.
3.By returning different exceptions for different problems, callers can diagnose and act on
different failures. It is much harder for programs to act 
on simple strings.
4.Exceptions are going to be raised if the far end is unreachable, so the caller needs to
be prepared for those exceptions, and know that any exception raised on a call is a sign of
a failure.

However, there are some good arguments in favour of returning a structure response instead.

1.Stack traces are less useful when they are just code inside the health check.
2.Unmarshalling exceptions reliably requires the caller to have a set of exception classes
and versions in their JVM. With nested exceptions, that implies the entire exception list
needs to be unmarshallable.
3.The aggregate health checks of the clusters themselves will inevitably include failed nodes.
Should an aggregate health check include those node failures in a report that says "overall,
we are healthy, here are the nodes that are not"

The alternative to sending exceptions back on a ping() would be to return a NodeHealth structure
that included node name, IPAddress, service type and a list of what was wrong with the node,
as well as an aggregate "live/not live" response. The list of what was wrong could include
standard constant values for machine interpretation, as well as human-readable messages.

What do others think?

> Add a lifecycle interface for Hadoop components: namenodes, job clients, etc.
> -----------------------------------------------------------------------------
>                 Key: HADOOP-3628
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3628
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs, mapred
>    Affects Versions: 0.19.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>         Attachments: AbstractHadoopComponent.java, hadoop-3628.patch, hadoop-3628.patch,
hadoop-3628.patch, hadoop-3628.patch, hadoop-3628.patch, hadoop-3628.patch, hadoop-3628.patch,
hadoop-3628.patch, hadoop-3628.patch, hadoop-3628.patch, hadoop-3628.patch
> I'd like to propose we have a standard interface for hadoop components, the things that
get started or stopped when you bring up a namenode. currently, some of these classes have
a stop() or shutdown() method, with no standard name/interface, but no way of seeing if they
are live, checking their health of shutting them down reliably. Indeed, there is a tendency
for the spawned threads to not want to die; to require the entire process to be killed to
stop the workers. 
> Having a standard interface would make it easier for 
>  * management tools to manage the different things
>  * monitoring the state of things
>  * subclassing
> The latter is interesting as right now TaskTracker and JobTracker start up threads in
their constructor; that's very dangerous as subclasses may have their methods called before
they are full initialised. Adding this interface would be the right time to clean up the startup
process so that subclassing is less risky.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message