hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jian He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
Date Tue, 10 Jun 2014 21:07:02 GMT

    [ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027012#comment-14027012
] 

Jian He commented on YARN-1365:
-------------------------------

Anubhav, thanks for working on the patch. some comments:

The following two pieces of code can be merged in the same if block like this ?
{code}
          // Need to register an app attempt before AM can register
          appAttempt.masterService
              .registerAppAttempt(appAttempt.applicationAttemptId);

          appAttempt.eventHandler.handle(new AppAttemptAddedSchedulerEvent(
            appAttempt.getAppAttemptId(), false, false));
{code}

we can print the current state of RMAppAttempt also which will be useful for debugging.
{code}
LOG.info("Skipping notifying ATTEMPT_ADDED");
{code}

We should remove the following transition also and do the same to pass a flag in AppAddedSchedulerEvent
to notify not re-send APP_ACCEPTED event for recovered apps.
{code}
    // ACCECPTED state can once again receive APP_ACCEPTED event, because on
    // recovery the app returns ACCEPTED state and the app once again go
    // through the scheduler and triggers one more APP_ACCEPTED event at
    // ACCEPTED state.
    .addTransition(RMAppState.ACCEPTED, RMAppState.ACCEPTED,
        RMAppEventType.APP_ACCEPTED)
{code}

This message is not exactly true. If RM didn’t reboot and AM can just do allocate without
registration.
{code}
        String message =
            "Application Master is not registered for known application: "
                + appAttemptId.getApplicationId()
                + ". Looks like RM rebooted. Let AM resync.";
{code}

The following code is removed in finishApplicationMaster to not throw exception if the app
did not register before.
If an app did not register and do unregister directly, it’ll go through successfully.
should we return resync command here also and client is expected to do re-register and then
unregister?

{code}
if (!hasApplicationMasterRegistered(applicationAttemptId)) {
        String message =
            "Application Master is trying to unregister before registering for: "
                + applicationAttemptId.getApplicationId();
        LOG.error(message);
        RMAuditLogger.logFailure(
            this.rmContext.getRMApps()
                .get(applicationAttemptId.getApplicationId()).getUser(),
            AuditConstants.UNREGISTER_AM, "", "ApplicationMasterService",
            message, applicationAttemptId.getApplicationId(),
            applicationAttemptId);
        throw new InvalidApplicationMasterRequestException(message);
      }
{code}


> ApplicationMasterService to allow Register and Unregister of an app that was running
before restart
> ---------------------------------------------------------------------------------------------------
>
>                 Key: YARN-1365
>                 URL: https://issues.apache.org/jira/browse/YARN-1365
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Bikas Saha
>            Assignee: Anubhav Dhoot
>         Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch,
YARN-1365.initial.patch
>
>
> For an application that was running before restart, the ApplicationMasterService currently
throws an exception when the app tries to make the initial register or final unregister call.
These should succeed and the RMApp state machine should transition to completed like normal.
Unregistration should succeed for an app that the RM considers complete since the RM may have
died after saving completion in the store but before notifying the AM that the AM is free
to exit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message