hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3131) YarnClientImpl should check FAILED and KILLED state in submitApplication
Date Fri, 20 Feb 2015 14:54:12 GMT

    [ https://issues.apache.org/jira/browse/YARN-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329001#comment-14329001

Jason Lowe commented on YARN-3131:

bq. I do not think that continuously polling until RUNNING is a good idea. The most common
case on a busy cluster is that an app can be submitted at time X but not start running until
a long time later.

The patch does not cause the client to poll until the job is RUNNING.  It polls until the
job has progressed past the SUBMITTED state.  The SUBMITTED state is a brief transient state
before the ACCEPTED state.  So the client will wait approximately as long as it does today,
and it fixes that flaky submit unit test in Tez.  It will not block until the AM is actually

bq. As I mentioned earlier, I still believe that doing some basic checks in-line in ClientRMService
itself and throwing an exception back straight away is probably a better idea than polling
for any RUNNING/FAILED state. 

I agree that a blocking method is much easier on the client, but I don't think this is an
easy change to make in the short term.  Again I think it requires a major change to the RPC
layer and the RM to support server-side asynchronous call handling, otherwise we have to throw
an army of threads at the client service to avoid blocking other clients and that has scaling
issues.  We could probably add an API to the scheduler to do an in-line sanity check on the
requested queue (which is a backwards-incompatible change for schedulers not in the Hadoop
repo).  However there are many other things that could go wrong during submission that take
a long time to perform, such as saving the application state and renewing delegation tokens.
 I'm not sure it's a win if we check for one thing in-line that could go wrong but still have
to poll for all the other things that could go wrong.  In the end, Tez and other YARN clients
need to know if the app was accepted or not.  The queue being wrong is just one of the ways
the submit could fail.

Continuing to poll in the SUBMITTED state also meshes with the thoughts on the SUBMITTED state
being something the client probably shouldn't see anyway.  See the discussion about NEW_SAVING
and SUBMITTED in YARN-3230.

Thanks, Chang, for updating the patch.  Please investigate the unit test failure, as it looks
like it could be related.  My only nit on the patch is it would be a bit clearer and more
efficient if we used EnumSet constants to capture the set of states we're waiting the app
to leave and the set of states that are failed-to-submit states.

I suppose another way to solve this problem is to take the approach discussed in YARN-3230
and have the RM not expose the NEW_SAVING and SUBMITTED states to the client -- they would
just see NEW.  We'd have to leave the states in the enumeration for backwards compatibility,
but we'd stop exposing them in app reports.  Any thoughts on that [~zjshen] or [~jianhe]?

> YarnClientImpl should check FAILED and KILLED state in submitApplication
> ------------------------------------------------------------------------
>                 Key: YARN-3131
>                 URL: https://issues.apache.org/jira/browse/YARN-3131
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Chang Li
>            Assignee: Chang Li
>         Attachments: yarn_3131_v1.patch
> Just run into a issue when submit a job into a non-existent queue and YarnClient raise
no exception. Though that job indeed get submitted successfully and just failed immediately
after, it will be better if YarnClient can handle the immediate fail situation like YarnRunner

This message was sent by Atlassian JIRA

View raw message