Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 53DB6102F7 for ; Sat, 22 Jun 2013 22:49:00 +0000 (UTC) Received: (qmail 32482 invoked by uid 500); 22 Jun 2013 22:48:55 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 32267 invoked by uid 500); 22 Jun 2013 22:48:55 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 32260 invoked by uid 99); 22 Jun 2013 22:48:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 22 Jun 2013 22:48:55 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of acm@hortonworks.com designates 209.85.160.44 as permitted sender) Received: from [209.85.160.44] (HELO mail-pb0-f44.google.com) (209.85.160.44) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 22 Jun 2013 22:48:51 +0000 Received: by mail-pb0-f44.google.com with SMTP id uo1so9380041pbc.17 for ; Sat, 22 Jun 2013 15:48:30 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=from:content-type:message-id:mime-version:subject:date:references :to:in-reply-to:x-mailer:x-gm-message-state; bh=KSigxd518hU/olwur3AEIoLuyuHwqRqZksTwvQa8EVk=; b=dHMoxse3HFI9aBvkRXuvEX7x3zl23VTmr5W70c2jBLHwH03R2Ljmd43owU2gab/BAz f3XoBUWKIqMOU8kUD4D4gWlHG9bs8lyX4vG0cKh2LSR+AiBhMW553HdY190TDXG92nYL gVQbgHXO8GpsFA1Giop66IU1z0Lb3xuplOyotaCGkYVC5VL3uOlqPfxIRmFe7yz0u15x fDZ7Tl30gusMUesyUZHiPTGKC4gD6LoRfO6RHYLuxImdCPcKCdqb09+pkIbejjBAK6PI RQQrJO6DyvwX45mSiRzP/tDEicGDa7IeviFerhiBaSvxy8IBSlg3z/67OyTg/KxO86Te /lzQ== X-Received: by 10.68.218.100 with SMTP id pf4mr12809430pbc.72.1371941310910; Sat, 22 Jun 2013 15:48:30 -0700 (PDT) Received: from [10.0.1.25] (c-98-234-189-94.hsd1.ca.comcast.net. [98.234.189.94]) by mx.google.com with ESMTPSA id xd2sm12078523pac.15.2013.06.22.15.48.26 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sat, 22 Jun 2013 15:48:27 -0700 (PDT) From: Arun C Murthy Content-Type: multipart/alternative; boundary="Apple-Mail=_6C546689-589D-49C4-8224-B7BAA6ED77B3" Message-Id: Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Re: Job end notification does not always work (Hadoop 2.x) Date: Sat, 22 Jun 2013 15:48:24 -0700 References: <1371937095.36328.YahooMailNeo@web141205.mail.bf1.yahoo.com> To: user@hadoop.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1508) X-Gm-Message-State: ALoCoQltsTO5TYDm/se8Txxf15eTDiApO4/tJ9G/J+Ktr/K48wjCvDx+bSe8OvlEauxqnWDZwW2i X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_6C546689-589D-49C4-8224-B7BAA6ED77B3 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 Prashanth,=20 Please file a jira. One thing to be aware of - AMs get restarted a certain number of times = for fault-tolerance - which means we can't just assume that failure of a = single AM is equivalent to failure of the job. Only the ResourceManager is in the appropriate position to judge = failure of AM v/s failure-of-job. hth, Arun On Jun 22, 2013, at 2:44 PM, Prashant Kommireddi = wrote: > Thanks Ravi. >=20 > Well, in this case its a no-effort :) A failure of AM init should be = considered as failure of the job? I looked at the code and best-effort = makes sense with respect to retry logic etc. You make a good point that = there would be no notification in case AM OOMs, but I do feel AM init = failure should send a notification by other means. >=20 >=20 >=20 > On Sat, Jun 22, 2013 at 2:38 PM, Ravi Prakash = wrote: > Hi Prashant, >=20 > I would tend to agree with you. Although job-end notification is only = a "best-effort" mechanism (i.e. we cannot always guarantee notification = for example when the AM OOMs), I agree with you that we can do more. If = you feel strongly about this, please create a JIRA and possibly upload a = patch. >=20 > Thanks > Ravi >=20 >=20 > From: Prashant Kommireddi > To: "user@hadoop.apache.org" =20 > Sent: Thursday, June 20, 2013 9:45 PM > Subject: Job end notification does not always work (Hadoop 2.x) >=20 > Hello, >=20 > I came across an issue that occurs with the job notification callbacks = in MR2. It works fine if the Application master has started, but does = not send a callback if the initializing of AM fails. >=20 > Here is the code from MRAppMaster.java >=20 > ..... > ....... > // set job classloader if configured > MRApps.setJobClassLoader(conf); > initAndStartAppMaster(appMaster, conf, jobUserName); > } catch (Throwable t) { > LOG.fatal("Error starting MRAppMaster", t); > System.exit(1); > } > } >=20 > protected static void initAndStartAppMaster(final MRAppMaster = appMaster, > final YarnConfiguration conf, String jobUserName) throws = IOException, > InterruptedException { > UserGroupInformation.setConfiguration(conf); > UserGroupInformation appMasterUgi =3D UserGroupInformation > .createRemoteUser(jobUserName); > appMasterUgi.doAs(new PrivilegedExceptionAction() { > @Override > public Object run() throws Exception { > appMaster.init(conf); > appMaster.start(); > if(appMaster.errorHappenedShutDown) { > throw new IOException("Was asked to shut down."); > } > return null; > } > }); > } > appMaster.init(conf) does not dispatch JobFinishEventHandler which is = responsible for sending a HTTP callback (via shutDownJob()). If there = was an exception at this time, the process would simply terminate (via = System.exit(1) ) >=20 > appMaster.start() however rightly uses the JobFinishEventHandler and = things work fine. >=20 > Shouldn't a failure on init(..) also send a callback suggesting the = job failed? >=20 > Thanks, > Prashant >=20 >=20 >=20 >=20 -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ --Apple-Mail=_6C546689-589D-49C4-8224-B7BAA6ED77B3 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1 prash1784@gmail.com> = wrote:
Thanks Ravi.

Well, in this case = its a no-effort :) A failure of AM init should be considered as failure = of the job? I looked at the code and best-effort makes sense with = respect to retry logic etc. You make a good point that there would be no = notification in case AM OOMs, but I do feel AM init failure should send = a notification by other means.



On= Sat, Jun 22, 2013 at 2:38 PM, Ravi Prakash <ravihoo@ymail.com> wrote:
Hi Prashant,

I would tend to agree with you. = Although job-end notification is only a "best-effort" mechanism (i.e. we = cannot always guarantee notification for example when the AM OOMs), I = agree with you that we can do more. If you feel strongly about this, = please create a JIRA and possibly upload a patch.

Thanks
Ravi



From: Prashant Kommireddi <prash1784@gmail.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Sent: Thursday, June 20, = 2013 9:45 PM
Subject: = Job end notification does not always work (Hadoop 2.x)
=

Hello,

I came = across an issue that occurs with the job notification callbacks in MR2. = It works fine if the Application master has started, but does not send a = callback if the initializing of AM fails.

Here is the code from = MRAppMaster.java

.....
.......
      // set job =
classloader if configured
      MRApps.setJobClassLoader(conf);
      initAndStartAppMaster(appMaster, conf, jobUserName);
    } catch (Throwable t) {
      LOG.fatal("Error starting MRAppMaster", t);
      System.exit(1);
    }
  }

protected static void initAndStartAppMaster(final MRAppMaster = appMaster, final YarnConfiguration conf, String jobUserName) throws = IOException, InterruptedException { UserGroupInformation.setConfiguration(conf); UserGroupInformation appMasterUgi =3D UserGroupInformation .createRemoteUser(jobUserName); appMasterUgi.doAs(new PrivilegedExceptionAction<Object>() { @Override public Object run() throws Exception { appMaster.init(conf); appMaster.start(); if(appMaster.errorHappenedShutDown) { throw new IOException("Was asked to shut down."); } return null; } }); }
appMaster.init(conf) does not dispatch = JobFinishEventHandler which is responsible for sending a HTTP callback = (via shutDownJob()). If there was an exception at this time, the process = would simply terminate (via System.exit(1) )

appMaster.start() however rightly uses the = JobFinishEventHandler and things work fine.

Shouldn't a = failure on init(..) also send a callback suggesting the job = failed?

Thanks,
Prashant



=


--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

=

= --Apple-Mail=_6C546689-589D-49C4-8224-B7BAA6ED77B3--