Mailing-List: contact yarn-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-dev@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of bikas@hortonworks.com
 designates 209.85.217.182 as permitted sender)
From: Bikas Saha <bikas@hortonworks.com>
References: 
 <CALwhT972JBKKXgmM4RtKrQFFBzb+_CJAQBL6xCa63Mf3NZMw0w@mail.gmail.com>
	<6001ea89020f379d32bbe4764175225c@mail.gmail.com>
 <CALwhT943ODnPiQWJRnWMEWb63rrp=yFX8ecBdpOu6FAURYuF=w@mail.gmail.com>
In-Reply-To: 
 <CALwhT943ODnPiQWJRnWMEWb63rrp=yFX8ecBdpOu6FAURYuF=w@mail.gmail.com>
MIME-Version: 1.0
Thread-Index: AQH4OhTbW4jkuthvsCKGUUxiLAZCigG0WFW1AkNY2QeZH2qAoA==
Date: Mon, 12 Aug 2013 19:38:34 -0700
Message-ID: <07e3f9240461dee632b4bd3757aeb056@mail.gmail.com>
Subject: RE: AM timeout on RM failure?
To: yarn-dev@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1

We can fix it once we have an idea on how long RM takes to restart for
some large clusters. I am hoping it will be considerably shorter than 15
mins.

-----Original Message-----
From: Karthik Kambatla [mailto:kasha@cloudera.com]
Sent: Monday, August 12, 2013 11:38 AM
To: yarn-dev@hadoop.apache.org
Subject: Re: AM timeout on RM failure?

The RMProxy code, by default, uses 15 minutes for connect.max-wait, but
the AM aborts trying to connect only after 20 mins. Wonder where the
additional
5 minutes comes from? Let me run it again and see.

Also, 15 minutes seems a little excessive, compared to other similar
timeouts being 10 mins. I can fix this as part of YARN-1056 if you agree
we should bring it down.

Thanks
Karthik


On Mon, Aug 12, 2013 at 10:22 AM, Bikas Saha <bikas@hortonworks.com>
wrote:

> You should probably look at the RMProxy code and the configs it uses.
> I am hoping that all clients including the MR AM now use that proxy
> and so older configs are no longer valid.
>
> Bikas
>
> -----Original Message-----
> From: Karthik Kambatla [mailto:kasha@cloudera.com]
> Sent: Sunday, August 11, 2013 8:45 PM
> To: yarn-dev@hadoop.apache.org
> Subject: AM timeout on RM failure?
>
> Hi YARN devs,
>
> I am working on the ZKRMStateStore, and had a very basic question - on
> RM failure, how long does the AM fail before crashing, or more
> importantly what controls it.
>
> Looking into the code, I see the following two parameters:
>
>    1. yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms - set
to
>    1 min
>    2. Fix configs
>
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_inte
> rval
> .secs}
>    - set by default to 15 mins and 30 seconds respectively
>
> The AM crashes only after 20 minutes.
>
> Are there any other configs that influence this?
>
> Thanks
> Karthik
>