tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hitesh Shah (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TEZ-2311) AM can hang if kill received while recovering from previous attempt
Date Thu, 23 Jul 2015 21:07:04 GMT

    [ https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639496#comment-14639496

Hitesh Shah commented on TEZ-2311:

bq. Change on DAGImpl is due to one scenario (Some vertices' recoveredState is KILLED, while
others are still in RUNNING. In that case, we need to kill other RUNNING vertices in DAGImpl#VertexCompletedTransition.

I think the fix here probably should be the other way around. Using one vertex KILLED state
on recovery should not make the DAG start killing everything else. It seems the better fix
is to log the dag kill event being received in recovery log and if the dag kill does not finish
before the AM crashes, then on recovery, process the recovery log and complete the kill process
as needed. 

Infering a kill seems a bit confusing as there can be multiple scenarios where a vertex was
killed. Consider the case I mentioned above. In a normal flow, all vertices apart from A will
end up as KILLED with termination cause as "other vertex failure". When recovery, the vertices
will have termination cause "dag kill" which is incorrect. 

If the hang issue is being resolved by the vertex impl changes, we can converge on a fix for
that in this jira and consider the dag handling as a separate one unless you believe that
the hang will not be completely resolved without the DAGImpl change. 

> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>                 Key: TEZ-2311
>                 URL: https://issues.apache.org/jira/browse/TEZ-2311
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>            Assignee: Jeff Zhang
>              Labels: Recovery
>         Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch
> We saw an instance of a Tez job hanging despite receiving multiple kill requests from
clients.  The AM was recovering from a prior attempt when the first kill request arrived.

This message was sent by Atlassian JIRA

View raw message