tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TEZ-2311) AM can hang if kill received while recovering from previous attempt
Date Thu, 23 Jul 2015 21:30:04 GMT

    [ https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639538#comment-14639538
] 

Jeff Zhang commented on TEZ-2311:
---------------------------------

I thought about the adding recovery log for DAG kill operation, but may be a little heavy
here. Think it again, seems not difficult ( Will post another patch ). The change in VertexImpl
doesn't resolve the hang issue completely. Consider one case that all the vertices are recovered
to KILLED, and one vertex is recovered to running and new task attempt is scheduled. That
new task attempt may wait there indefinitely for datamovement events from its upstream. Or
maybe task attempt is not scheduled, its VertexManager may wait there for something from upstream.


> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
>                 Key: TEZ-2311
>                 URL: https://issues.apache.org/jira/browse/TEZ-2311
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>            Assignee: Jeff Zhang
>              Labels: Recovery
>         Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill requests from
clients.  The AM was recovering from a prior attempt when the first kill request arrived.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message