tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siddharth Seth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TEZ-1267) Exception handling when Routing Events
Date Wed, 22 Oct 2014 08:25:34 GMT

    [ https://issues.apache.org/jira/browse/TEZ-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179718#comment-14179718

Siddharth Seth commented on TEZ-1267:

bq. The issue I mention is that we have inline call of ROUTE_EVENT_TRANSITION when handling
the events in pendingList, and we assume it would always succeed, just ignore the returned
state. So I separate the real handling TezEvent work into another method handleRoutedTezEvents(),
and call this method when we'd like to handle the events in the pendingList.
Nice catch! Would've ended up skipping a bunch of exceptions without some changes to this
On the way the code is changed to handle this right now, I think it's a little safer to just
move the entire RouteEventTransition.handle(...) function into a separate method, instead
of splitting it up into one method which Routes the events upstream, and another one which
handles events meant for this vertex. A couple of reasons for this - 1) there's multiple ways
to split the RouteEvent logic - along these lines (meant for self vs meant for other) vs All
logic for a event type goes together (separate by event type), and 2) It's a big change which
copies some code from the method into the other method - which most likely is correct, but
I'd prefer if this were made in 0.6.0, instead of the 0.5 line (especially not in 0.5.2),
3) Even after this split, there's one invocation of ROUTE_EVENT_TRANSITION.transition outside
of the state machine. The split of the method can be a separate jira.
For this, just moving the handle logic into a separate method and invoking it should be sufficient.
The RouteTransition would end up changing to just invoke this method, handle exceptions and
report the appropriate state.

One correction from my previous comment. ROUTE_EVENT should be handled in INITIALIZING and
INITED state - since the VMs and InputInitializers are setup by this point, and event arrival
doesn't depend on the state of the vertex - and need to be forwarded to the VM/II early. Sorry
about that.

The VertexManagerException is a useful change - since it is translated within the VertexManager
- There's some inconsistent usage of this though. There's some places where VertexManagerException
is caught, in others a generic Exception is caught (especially the invocation of ROUTE_EVENT_TRANSITION.transition).
Catching an Exception and reporting it as a VertexManager error in diagnostics can be incorrect
since the exception could have originated from Edge routing.
- Also, do you plan on introducing an EdgeManagerException + IIException in TEZ-1689 ? It
may be better to just use a single type of exception, with an enumeration defining the source.
- This needs to be annotated as private, and some limited javadoc on the intent of the exception
would be useful.
- The final exception reported back to the user should be VertexManagerException.getCause()
- users don't need to be aware of this wrapping.
- Is the entire stack trace too much information for the diagnostic message, or is that standard
practice for other cases as well ?

In the new tests, is it possible to verify the Termination cause ? That's a good set of tests

Minor stuff in the code
- Remove commented out line in TestExceptionPropagation
- Typo - "VertexManageEvent"
- s/Exception happen(s)/Exception in/

> Exception handling when Routing Events
> --------------------------------------
>                 Key: TEZ-1267
>                 URL: https://issues.apache.org/jira/browse/TEZ-1267
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Siddharth Seth
>            Assignee: Jeff Zhang
>            Priority: Critical
>         Attachments: TEZ-1267-2.patch, TEZ-1267-3.patch, Tez-1267.patch
> Events are generated by user code. In some places they're also handled by user code within
the AM. Currently, exceptions which are generated when handling user code will end up killing
the AM (and hence leading to a retry).
> Instead, failure to handle such events, should cause the application to fail.

This message was sent by Atlassian JIRA

View raw message