hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravi Prakash <ravi...@ymail.com>
Subject Re: Job end notification does not always work (Hadoop 2.x)
Date Sun, 23 Jun 2013 13:30:39 GMT
Hi Alejandro,

Thanks for your reply! I was thinking more along the lines Prashant suggested i.e. a failure
during init() should still trigger an attempt to notify (by the AM). But now that you mention
it, maybe we would be better of including this as a YARN feature after all (specially with
all the new AMs being written). We could let the NM of the AM handle the notification burden,
so that the RM doesn't get unduly taxed. Thoughts?

Thanks
Ravi




________________________________
 From: Alejandro Abdelnur <tucu@cloudera.com>
To: "common-user@hadoop.apache.org" <user@hadoop.apache.org> 
Sent: Saturday, June 22, 2013 7:37 PM
Subject: Re: Job end notification does not always work (Hadoop 2.x)
 


If the AM fails before doing the job end notification, at any stage of the execution for whatever
reason, the job end notification will never be deliver. There is not way to fix this unless
the notification is done by a Yarn service. The 2 'candidate' services for doing this would
be the RM and the HS. The job notification URL is in the job conf. The RM never sees the job
conf, that rules out the RM out unless we add, at AM registration time the possibility to
specify a callback URL. The HS has access to the job conf, but the HS is currently a 'passive'
service.

thx


On Sat, Jun 22, 2013 at 3:48 PM, Arun C Murthy <acm@hortonworks.com> wrote:

Prashanth, 
>
>
> Please file a jira.
>
>
> One thing to be aware of - AMs get restarted a certain number of times for fault-tolerance
- which means we can't just assume that failure of a single AM is equivalent to failure of
the job.
>
>
> Only the ResourceManager is in the appropriate position to judge failure of AM v/s failure-of-job.
>
>
>hth,
>Arun
>
>
>On Jun 22, 2013, at 2:44 PM, Prashant Kommireddi <prash1784@gmail.com> wrote:
>
>Thanks Ravi.
>>
>>Well, in this case its a no-effort :) A failure of AM init should be considered as
failure of the job? I looked at the code and best-effort makes sense with respect to retry
logic etc. You make a good point that there would be no notification in case AM OOMs, but
I do feel AM init failure should send a notification by other means.
>>
>>
>>
>>
>>
>>On Sat, Jun 22, 2013 at 2:38 PM, Ravi Prakash <ravihoo@ymail.com> wrote:
>>
>>Hi Prashant,
>>>
>>>I would tend to agree with you. Although job-end notification is only a "best-effort"
mechanism (i.e. we cannot always guarantee notification for example when the AM OOMs), I agree
with you that we can do more. If you feel strongly about this, please create a JIRA and possibly
upload a patch.
>>>
>>>Thanks
>>>Ravi
>>>
>>>
>>>
>>>
>>>
>>>
>>>________________________________
>>> From: Prashant Kommireddi <prash1784@gmail.com>
>>>To: "user@hadoop.apache.org" <user@hadoop.apache.org> 
>>>Sent: Thursday, June 20, 2013 9:45 PM
>>>Subject: Job end notification does not always work (Hadoop 2.x)
>>> 
>>>
>>>
>>>Hello,
>>>
>>>I came across an issue that occurs with the job notification callbacks in MR2.
It works fine if the Application master has started, but does not send a callback if the initializing
of AM fails.
>>>
>>>Here is the code from MRAppMaster.java
>>>
>>>.....
>>>.......
>>>
>>>// set job classloader if configured MRApps.setJobClassLoader(conf); initAndStartAppMaster(appMaster,
conf, jobUserName); } catch (Throwable t) { LOG.fatal("Error starting MRAppMaster", t); System.exit(1);
} }
>>>
>>>protected static void initAndStartAppMaster(final MRAppMaster appMaster,
      final YarnConfiguration conf, String jobUserName) throws IOException,
      InterruptedException {
    UserGroupInformation.setConfiguration(conf);
    UserGroupInformation appMasterUgi = UserGroupInformation
        .createRemoteUser(jobUserName);
    appMasterUgi.doAs(new PrivilegedExceptionAction<Object>() {
      @Override
      public Object run() throws Exception {
        appMaster.init(conf);
        appMaster.start();
        if(appMaster.errorHappenedShutDown) {
          throw new IOException("Was asked to shut down.");
        }
        return null;
      }
    });
  }
>>>appMaster.init(conf) does not dispatch JobFinishEventHandler which is responsible
for sending a HTTP callback (via shutDownJob()). If there was an exception at this time, the
process would simply terminate (via System.exit(1) )
>>>
>>>appMaster.start() however rightly uses the JobFinishEventHandler and things work
fine.
>>>
>>>Shouldn't a failure on init(..) also send a callback suggesting the job failed?
>>>
>>>Thanks,
>>>
Prashant
>>>
>>>
>>>
>>>
>>>
>>
>
>--
>Arun C. Murthy
>Hortonworks Inc.
>http://hortonworks.com/
>
> 
>


-- 
Alejandro 
Mime
View raw message