Return-Path: X-Original-To: apmail-hadoop-yarn-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7FF0A10AC4 for ; Tue, 13 Aug 2013 02:39:20 +0000 (UTC) Received: (qmail 83658 invoked by uid 500); 13 Aug 2013 02:39:19 -0000 Delivered-To: apmail-hadoop-yarn-dev-archive@hadoop.apache.org Received: (qmail 83340 invoked by uid 500); 13 Aug 2013 02:39:18 -0000 Mailing-List: contact yarn-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-dev@hadoop.apache.org Delivered-To: mailing list yarn-dev@hadoop.apache.org Received: (qmail 83332 invoked by uid 99); 13 Aug 2013 02:39:18 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Aug 2013 02:39:18 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of bikas@hortonworks.com designates 209.85.217.182 as permitted sender) Received: from [209.85.217.182] (HELO mail-lb0-f182.google.com) (209.85.217.182) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Aug 2013 02:39:12 +0000 Received: by mail-lb0-f182.google.com with SMTP id v20so5454568lbc.27 for ; Mon, 12 Aug 2013 19:38:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-gm-message-state:from:references:in-reply-to:mime-version :thread-index:date:message-id:subject:to:content-type; bh=yBmUhA9rhXtIiIsIqT1cF7lSE8B+tiw7DROhBfcXKwU=; b=WwgXOjvPqr3zey8buyY0FMLQIdxU2QR2/Oun824O2gDw4uERxXkuGcX8lVcsdUY4Td OS8BvhHDKBv2U8TCHMH6zoAvjDnoGkcdAQDhWD/037EzYzsZQQsOXe5GmX5qX17jIPhd jDMpx2QV4gmXYTJcIDIZ5q5NTBXUyapcfrkKdjKspXCkmOUEa+Y99YkH/Xjfy4zOQ6aL ZC6y4xJkiptcMHa1QJPCqpjQmzTo4bmTKFHsOECI1QL+5qxikjtC1tHrmCQGBrsx1UwD MSbytMK3gLouDKnxrNoyF3lR7+5wmk2qGGj+yBtpHoC+mlbBzvoKHrC8XR7v7ivWmp8X iZpQ== X-Gm-Message-State: ALoCoQk0y8OJKCvaNk9zlBmgDk1O67JWv8Tau53PcNmnapuJN8V1djWv3M841n1cMS5KqhFsyV3U X-Received: by 10.152.120.101 with SMTP id lb5mr652882lab.29.1376361532163; Mon, 12 Aug 2013 19:38:52 -0700 (PDT) From: Bikas Saha References: <6001ea89020f379d32bbe4764175225c@mail.gmail.com> In-Reply-To: MIME-Version: 1.0 X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQH4OhTbW4jkuthvsCKGUUxiLAZCigG0WFW1AkNY2QeZH2qAoA== Date: Mon, 12 Aug 2013 19:38:34 -0700 Message-ID: <07e3f9240461dee632b4bd3757aeb056@mail.gmail.com> Subject: RE: AM timeout on RM failure? To: yarn-dev@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org We can fix it once we have an idea on how long RM takes to restart for some large clusters. I am hoping it will be considerably shorter than 15 mins. -----Original Message----- From: Karthik Kambatla [mailto:kasha@cloudera.com] Sent: Monday, August 12, 2013 11:38 AM To: yarn-dev@hadoop.apache.org Subject: Re: AM timeout on RM failure? The RMProxy code, by default, uses 15 minutes for connect.max-wait, but the AM aborts trying to connect only after 20 mins. Wonder where the additional 5 minutes comes from? Let me run it again and see. Also, 15 minutes seems a little excessive, compared to other similar timeouts being 10 mins. I can fix this as part of YARN-1056 if you agree we should bring it down. Thanks Karthik On Mon, Aug 12, 2013 at 10:22 AM, Bikas Saha wrote: > You should probably look at the RMProxy code and the configs it uses. > I am hoping that all clients including the MR AM now use that proxy > and so older configs are no longer valid. > > Bikas > > -----Original Message----- > From: Karthik Kambatla [mailto:kasha@cloudera.com] > Sent: Sunday, August 11, 2013 8:45 PM > To: yarn-dev@hadoop.apache.org > Subject: AM timeout on RM failure? > > Hi YARN devs, > > I am working on the ZKRMStateStore, and had a very basic question - on > RM failure, how long does the AM fail before crashing, or more > importantly what controls it. > > Looking into the code, I see the following two parameters: > > 1. yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms - set to > 1 min > 2. Fix configs > > yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_inte > rval > .secs} > - set by default to 15 mins and 30 seconds respectively > > The AM crashes only after 20 minutes. > > Are there any other configs that influence this? > > Thanks > Karthik >