hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5502) History link in resource manager is broken for KILLED jobs
Date Wed, 18 Sep 2013 20:45:53 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13771216#comment-13771216
] 

Jason Lowe commented on MAPREDUCE-5502:
---------------------------------------

Hmm, something must be wrong with the mapred client then as it explicitly checks with the
RM to see if the application is running and if so, tries to connect to the AM to kill it.

Looking deeper, it may be this code in YARNRunner.killJob:

{code}
    /* check if the status is not running, if not send kill to RM */
    JobStatus status = clientCache.getClient(arg0).getJobStatus(arg0);
    if (status.getState() != JobStatus.State.RUNNING) {
      try {
        resMgrDelegate.killApplication(TypeConverter.toYarn(arg0).getAppId());
      } catch (YarnException e) {
        throw new IOException(e);
      }
      return;
    }
{code}

So in this scenario the AM has finished the job but not unregistered yet.  AM is telling clients
that connect to it that the job status is SUCCEEDED/FAILED/KILLED (i.e.: not RUNNING but in
some terminal state) but the AM has yet to unregister with the RM so the RM is directing clients
to the AM when asked.  If the RM kills the app I think there's not a lot of options for getting
history consistently per the discussion above.

We could fix this particular scenario by having YARNRunner not try to kill the application
if the reported status is already a terminal state.  There's the risk of an insane AM that
thinks the job is completed and continues to report that but refuses to unregister from the
RM.  mapred job -kill would then be ineffective at killing such an application.  Seems an
unlikely scenario in practice, and there's always yarn -kill as a workaround if it did happen.

MAPREDUCE-5497 probably made the race window for this scenario very small in practice, as
it no longer waits 5 seconds after the job completes before unregistering.
                
> History link in resource manager is broken for KILLED jobs
> ----------------------------------------------------------
>
>                 Key: MAPREDUCE-5502
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5502
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.0.5-alpha
>            Reporter: Vrushali C
>            Assignee: Vrushali C
>              Labels: ui
>
> History link in resource manager is broken for KILLED jobs.
> Seems to happen with jobs with State 'KILLED' and FinalStatus 'KILLED'. If the State
is 'FINISHED' and FinalStatus is 'KILLED', then the "History" link is fine.
> It isn't easy to reproduce the problem since the time at which the app is killed determines
the state it ends up in, which is hard to guess. these particular jobs seem to get a Diagnostics
message of "Application killed by user." where as the other killed jobs get " Kill Job received
from client job_1378766187901_0002
> Job received Kill while in RUNNING state. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message