hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@hortonworks.com>
Subject Re: Job end notification does not always work (Hadoop 2.x)
Date Sat, 22 Jun 2013 22:48:24 GMT
Prashanth, 

 Please file a jira.

 One thing to be aware of - AMs get restarted a certain number of times for fault-tolerance
- which means we can't just assume that failure of a single AM is equivalent to failure of
the job.

 Only the ResourceManager is in the appropriate position to judge failure of AM v/s failure-of-job.

hth,
Arun

On Jun 22, 2013, at 2:44 PM, Prashant Kommireddi <prash1784@gmail.com> wrote:

> Thanks Ravi.
> 
> Well, in this case its a no-effort :) A failure of AM init should be considered as failure
of the job? I looked at the code and best-effort makes sense with respect to retry logic etc.
You make a good point that there would be no notification in case AM OOMs, but I do feel AM
init failure should send a notification by other means.
> 
> 
> 
> On Sat, Jun 22, 2013 at 2:38 PM, Ravi Prakash <ravihoo@ymail.com> wrote:
> Hi Prashant,
> 
> I would tend to agree with you. Although job-end notification is only a "best-effort"
mechanism (i.e. we cannot always guarantee notification for example when the AM OOMs), I agree
with you that we can do more. If you feel strongly about this, please create a JIRA and possibly
upload a patch.
> 
> Thanks
> Ravi
> 
> 
> From: Prashant Kommireddi <prash1784@gmail.com>
> To: "user@hadoop.apache.org" <user@hadoop.apache.org> 
> Sent: Thursday, June 20, 2013 9:45 PM
> Subject: Job end notification does not always work (Hadoop 2.x)
> 
> Hello,
> 
> I came across an issue that occurs with the job notification callbacks in MR2. It works
fine if the Application master has started, but does not send a callback if the initializing
of AM fails.
> 
> Here is the code from MRAppMaster.java
> 
> .....
> .......
>       // set job classloader if configured
>       MRApps.setJobClassLoader(conf);
>       initAndStartAppMaster(appMaster, conf, jobUserName);
>     } catch (Throwable t) {
>       LOG.fatal("Error starting MRAppMaster", t);
>       System.exit(1);
>     }
>   }
> 
> protected static void initAndStartAppMaster(final MRAppMaster appMaster,
>       final YarnConfiguration conf, String jobUserName) throws IOException,
>       InterruptedException {
>     UserGroupInformation.setConfiguration(conf);
>     UserGroupInformation appMasterUgi = UserGroupInformation
>         .createRemoteUser(jobUserName);
>     appMasterUgi.doAs(new PrivilegedExceptionAction<Object>() {
>       @Override
>       public Object run() throws Exception {
>         appMaster.init(conf);
>         appMaster.start();
>         if(appMaster.errorHappenedShutDown) {
>           throw new IOException("Was asked to shut down.");
>         }
>         return null;
>       }
>     });
>   }
> appMaster.init(conf) does not dispatch JobFinishEventHandler which is responsible for
sending a HTTP callback (via shutDownJob()). If there was an exception at this time, the process
would simply terminate (via System.exit(1) )
> 
> appMaster.start() however rightly uses the JobFinishEventHandler and things work fine.
> 
> Shouldn't a failure on init(..) also send a callback suggesting the job failed?
> 
> Thanks,
> Prashant
> 
> 
> 
> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/



Mime
View raw message