tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TEZ-2311) AM can hang if kill received while recovering from previous attempt
Date Thu, 23 Jul 2015 15:12:04 GMT

    [ https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14638970#comment-14638970

Jason Lowe commented on TEZ-2311:

Ideally we would like this fixed in a 0.7 patch release since 0.8 is probably a ways out,
at least from the point of us being able to deploy it.   Could you elaborate on what's not
clear from the above analysis or what context is missing?  It seems wrong to me that the VertexImpl
recorded the fact that it wanted to recover into the KILLED state but then ignored that fact
when it later executed the recovery of tasks.  Here's the breakdown in more detail:

# We recover the fact that VertexImpl is supposed to recover into the KILLED state
# That causes it to generate TaskRecoverEvents to try to recover into the KILLED state, but
then the vertex sends task recover events to all the tasks and the vertex recovers into the
RUNNING state to wait for all tasks to finish recovering
# In the task recovering code, it explicitly ignores the desired recovering state because
taskEventRecoverTask.recoverData() is true.
# The tasks get an event with recoverData = true because of the first code block in the above
analysis.  When it generates the task recover events it's calling the event constructor form
that implicitly defaults recoverData to true.

It looks like we need a fix similar to the last patch hunk in TEZ-1011.  I don't think we
should be passing recoverData as true in the task recover event for this scenario, but I could
be mistaken since I'm a bit unclear on when recoverData is valid.  Maybe the bug is we should
try to recover data for the tasks but not forget that we're trying to recover them into the
killed state.

> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>                 Key: TEZ-2311
>                 URL: https://issues.apache.org/jira/browse/TEZ-2311
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>              Labels: Recovery
> We saw an instance of a Tez job hanging despite receiving multiple kill requests from
clients.  The AM was recovering from a prior attempt when the first kill request arrived.

This message was sent by Atlassian JIRA

View raw message