hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alejandro Abdelnur <t...@cloudera.com>
Subject Re: Job end notification does not always work (Hadoop 2.x)
Date Tue, 25 Jun 2013 13:21:47 GMT
Devaraj,

if a job can finish but you cannot determine it status after it ended, then
the system is not usable. Thus, HS is a required component.

thx


On Tue, Jun 25, 2013 at 6:11 AM, Devaraj k <devaraj.k@huawei.com> wrote:

>  I agree, for getting status/counters we need HS. I mean Job can finish
> without HS also.  ****
>
> ** **
>
> Thanks****
>
> Devaraj k****
>
> ** **
>
> *From:* Alejandro Abdelnur [mailto:tucu@cloudera.com]
> *Sent:* 25 June 2013 18:05
> *To:* common-user@hadoop.apache.org
>
> *Subject:* Re: Job end notification does not always work (Hadoop 2.x)****
>
>  ** **
>
> Devaraj,****
>
> ** **
>
> If you don't run the HS, once your jobs finished you cannot retrieve
> status/counters from it, from Java AP or Web UI. So I'd for any practical
> usage, you need it.****
>
> ** **
>
> thx****
>
> ** **
>
> On Mon, Jun 24, 2013 at 8:42 PM, Devaraj k <devaraj.k@huawei.com> wrote:**
> **
>
> It is not mandatory to have running HS in the cluster. Still the user can
> submit the job without HS in the cluster, and user may expect the Job/App
> End Notification.****
>
>  ****
>
> Thanks****
>
> Devaraj k****
>
>  ****
>
> *From:* Alejandro Abdelnur [mailto:tucu@cloudera.com]
> *Sent:* 24 June 2013 21:42
> *To:* user@hadoop.apache.org
> *Cc:* user@hadoop.apache.org****
>
>
> *Subject:* Re: Job end notification does not always work (Hadoop 2.x)****
>
>  ****
>
> if we ought to do this in a yarn service it
> should be the RM or the HS. the RM is, IMO, the natural fit. the HS, would
> be a good choice if we are concerned about the extra work this would cause
> in the RM. the problem with the current HS is that it is MR specific, we
> should generalize it for diff AM types. ****
>
>  ****
>
> thx****
>
>
> Alejandro****
>
> (phone typing)****
>
>
> On Jun 23, 2013, at 23:28, Devaraj k <devaraj.k@huawei.com> wrote:****
>
>  Even if we handle all the failure cases in AM for Job End Notification,
> we may miss cases like abrupt kill of AM when it is in last retry. If we
> choose NM to give the notification, again RM needs to identify which NM
> should give the end-notification as we don't have any direct protocol
> between AM and NM.****
>
>  ****
>
> I feel it would be better to move End-Notification responsibility to RM as
> Yarn Service because it ensures 100% notification and also useful for other
> types of applications as well. ****
>
>  ****
>
>  ****
>
> Thanks****
>
> Devaraj K****
>
>  ****
>
> *From:* Ravi Prakash [mailto:ravihoo@ymail.com <ravihoo@ymail.com>]
> *Sent:* 23 June 2013 19:01
> *To:* user@hadoop.apache.org
> *Subject:* Re: Job end notification does not always work (Hadoop 2.x)****
>
>  ****
>
> Hi Alejandro,
>
> Thanks for your reply! I was thinking more along the lines Prashant
> suggested i.e. a failure during init() should still trigger an attempt to
> notify (by the AM). But now that you mention it, maybe we would be better
> of including this as a YARN feature after all (specially with all the new
> AMs being written). We could let the NM of the AM handle the notification
> burden, so that the RM doesn't get unduly taxed. Thoughts?
>
> Thanks
> Ravi****
>
>  ****
>
>  ****
>    ------------------------------
>
> *From:* Alejandro Abdelnur <tucu@cloudera.com>
> *To:* "common-user@hadoop.apache.org" <user@hadoop.apache.org>
> *Sent:* Saturday, June 22, 2013 7:37 PM
> *Subject:* Re: Job end notification does not always work (Hadoop 2.x)****
>
>  ****
>
> If the AM fails before doing the job end notification, at any stage of the
> execution for whatever reason, the job end notification will never be
> deliver. There is not way to fix this unless the notification is done by a
> Yarn service. The 2 'candidate' services for doing this would be the RM and
> the HS. The job notification URL is in the job conf. The RM never sees the
> job conf, that rules out the RM out unless we add, at AM registration time
> the possibility to specify a callback URL. The HS has access to the job
> conf, but the HS is currently a 'passive' service.****
>
>
> thx****
>
>  ****
>
> On Sat, Jun 22, 2013 at 3:48 PM, Arun C Murthy <acm@hortonworks.com>
> wrote:****
>
> Prashanth, ****
>
>  ****
>
>  Please file a jira.****
>
>  ****
>
>  One thing to be aware of - AMs get restarted a certain number of times
> for fault-tolerance - which means we can't just assume that failure of a
> single AM is equivalent to failure of the job.****
>
>  ****
>
>  Only the ResourceManager is in the appropriate position to judge failure
> of AM v/s failure-of-job.****
>
>  ****
>
> hth,****
>
> Arun****
>
>  ****
>
> On Jun 22, 2013, at 2:44 PM, Prashant Kommireddi <prash1784@gmail.com>
> wrote:****
>
>
>
> ****
>
> Thanks Ravi.
>
> Well, in this case its a no-effort :) A failure of AM init should be
> considered as failure of the job? I looked at the code and best-effort
> makes sense with respect to retry logic etc. You make a good point that
> there would be no notification in case AM OOMs, but I do feel AM init
> failure should send a notification by other means.****
>
>  ****
>
> On Sat, Jun 22, 2013 at 2:38 PM, Ravi Prakash <ravihoo@ymail.com> wrote:**
> **
>
> Hi Prashant,
>
> I would tend to agree with you. Although job-end notification is only a
> "best-effort" mechanism (i.e. we cannot always guarantee notification for
> example when the AM OOMs), I agree with you that we can do more. If you
> feel strongly about this, please create a JIRA and possibly upload a patch.
>
> Thanks
> Ravi****
>
>  ****
>
>  ****
>    ------------------------------
>
> *From:* Prashant Kommireddi <prash1784@gmail.com>
> *To:* "user@hadoop.apache.org" <user@hadoop.apache.org>
> *Sent:* Thursday, June 20, 2013 9:45 PM
> *Subject:* Job end notification does not always work (Hadoop 2.x)****
>
>  ****
>
> Hello,****
>
> I came across an issue that occurs with the job notification callbacks in
> MR2. It works fine if the Application master has started, but does not send
> a callback if the initializing of AM fails.****
>
> Here is the code from MRAppMaster.java
>
> .....
> .......****
>
>       // set job classloader if configured****
>
>       MRApps.setJobClassLoader(conf);****
>
>       initAndStartAppMaster(appMaster, conf, jobUserName);****
>
>     } catch (Throwable t) {****
>
>       LOG.fatal("Error starting MRAppMaster", t);****
>
>       System.exit(1);****
>
>     }****
>
>   }
>
> protected static void initAndStartAppMaster(final MRAppMaster appMaster,****
>
>       final YarnConfiguration conf, String jobUserName) throws IOException,****
>
>       InterruptedException {****
>
>     UserGroupInformation.setConfiguration(conf);****
>
>     UserGroupInformation appMasterUgi = UserGroupInformation****
>
>         .createRemoteUser(jobUserName);****
>
>     appMasterUgi.doAs(new PrivilegedExceptionAction<Object>() {****
>
>       @Override****
>
>       public Object run() throws Exception {****
>
>         appMaster.init(conf);****
>
>         appMaster.start();****
>
>         if(appMaster.errorHappenedShutDown) {****
>
>           throw new IOException("Was asked to shut down.");****
>
>         }****
>
>         return null;****
>
>       }****
>
>     });****
>
>   }****
>
>  appMaster.init(conf) does not dispatch JobFinishEventHandler which is
> responsible for sending a HTTP callback (via shutDownJob()). If there was
> an exception at this time, the process would simply terminate (via
> System.exit(1) )****
>
> appMaster.start() however rightly uses the JobFinishEventHandler and
> things work fine.****
>
> Shouldn't a failure on init(..) also send a callback suggesting the job
> failed?****
>
> Thanks,****
>
> Prashant****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> --****
>
> Arun C. Murthy****
>
> Hortonworks Inc.
> http://hortonworks.com/****
>
>  ****
>
>
>
> ****
>
>  ****
>
> --
> Alejandro ****
>
>  ****
>
>
>
> ****
>
> ** **
>
> --
> Alejandro ****
>



-- 
Alejandro

Mime
View raw message