Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 98A0210146 for ; Mon, 10 Feb 2014 14:45:24 +0000 (UTC) Received: (qmail 88199 invoked by uid 500); 10 Feb 2014 14:45:16 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 88106 invoked by uid 500); 10 Feb 2014 14:45:15 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 88099 invoked by uid 99); 10 Feb 2014 14:45:14 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Feb 2014 14:45:14 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of john.lilley@redpoint.net designates 206.225.164.216 as permitted sender) Received: from [206.225.164.216] (HELO hub021-nj-1.exch021.serverdata.net) (206.225.164.216) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Feb 2014 14:45:06 +0000 Received: from MBX021-E3-NJ-2.exch021.domain.local ([10.240.4.78]) by HUB021-NJ-1.exch021.domain.local ([10.240.4.30]) with mapi id 14.03.0158.001; Mon, 10 Feb 2014 06:44:45 -0800 From: John Lilley To: "user@hadoop.apache.org" Subject: very long timeout on failed RM connect Thread-Topic: very long timeout on failed RM connect Thread-Index: Ac8mbh5TAEx/5k+yQOGS7XCMO1zSYw== Date: Mon, 10 Feb 2014 14:44:43 +0000 Message-ID: <869970D71E26D7498BDAC4E1CA92226B86E3AC1D@MBX021-E3-NJ-2.exch021.domain.local> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [173.160.43.60] Content-Type: multipart/alternative; boundary="_000_869970D71E26D7498BDAC4E1CA92226B86E3AC1DMBX021E3NJ2exch_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_869970D71E26D7498BDAC4E1CA92226B86E3AC1DMBX021E3NJ2exch_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Our application (running outside the Hadoop cluster) connects to the RM thr= ough YarnClient. This works fine, except we've found that if the RM addres= s or port is misconfigured in our software, or a firewall blocks access, th= e first call into the client (in this case getNodeReports) hangs for a very= long time. I've tried conf.set("ipc.client.connect.max.retries", "2"); But this doesn't help. Is there a configuration setting I can make on the = YarnClient that will reduce this hang time? I understand why this long-winded retry strategy exists, in order to preven= t a highly-loaded cluster from failing jobs. But it is not appropriate for= an interactive application. Thanks John --_000_869970D71E26D7498BDAC4E1CA92226B86E3AC1DMBX021E3NJ2exch_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Our application (running outside the Hadoop cluster)= connects to the RM through YarnClient. This works fine, except we= 217;ve found that if the RM address or port is misconfigured in our softwar= e, or a firewall blocks access, the first call into the client (in this case getNodeReports) hangs for a very long time.&= nbsp; I’ve tried

&nbs= p; conf.set("ipc.client.connect.ma= x.retries", "2");

But this doesn’t help. Is there a config= uration setting I can make on the YarnClient that will reduce this hang tim= e?

I understand why this long-winded retry strategy exi= sts, in order to prevent a highly-loaded cluster from failing jobs. B= ut it is not appropriate for an interactive application.

Thanks

John

--_000_869970D71E26D7498BDAC4E1CA92226B86E3AC1DMBX021E3NJ2exch_--