Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 85BDE1786F for ; Mon, 9 Feb 2015 04:43:28 +0000 (UTC) Received: (qmail 71487 invoked by uid 500); 9 Feb 2015 04:43:23 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 71354 invoked by uid 500); 9 Feb 2015 04:43:23 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 71337 invoked by uid 99); 9 Feb 2015 04:43:22 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Feb 2015 04:43:22 +0000 X-ASF-Spam-Status: No, hits=4.2 required=5.0 tests=FSL_HELO_BARE_IP_2,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of xgong@hortonworks.com designates 64.78.52.187 as permitted sender) Received: from [64.78.52.187] (HELO relayvx12c.securemail.intermedia.net) (64.78.52.187) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Feb 2015 04:43:16 +0000 Received: from emg-ca-1-2 (localhost [127.0.0.1]) by emg-ca-1-2.localdomain (Postfix) with ESMTP id 12D6353E28 for ; Sun, 8 Feb 2015 20:42:35 -0800 (PST) Subject: Re: Max Connect retries MIME-Version: 1.0 x-echoworx-emg-received: Sun, 8 Feb 2015 20:42:35.062 -0800 x-echoworx-msg-id: 15c7906d-787d-4445-bfb6-2b8b4613e40d x-echoworx-action: delivered Received: from 10.254.155.17 ([10.254.155.17]) by emg-ca-1-2 (JAMES SMTP Server 2.3.2) with SMTP ID 938 for ; Sun, 8 Feb 2015 20:42:35 -0800 (PST) Received: from MBX080-W4-CO-1.exch080.serverpod.net (unknown [10.224.117.101]) by emg-ca-1-2.localdomain (Postfix) with ESMTP id D721653E28 for ; Sun, 8 Feb 2015 20:42:34 -0800 (PST) Received: from MBX080-W4-CO-2.exch080.serverpod.net (10.224.117.102) by MBX080-W4-CO-1.exch080.serverpod.net (10.224.117.101) with Microsoft SMTP Server (TLS) id 15.0.1044.25; Sun, 8 Feb 2015 20:42:33 -0800 Received: from MBX080-W4-CO-2.exch080.serverpod.net ([10.224.117.102]) by mbx080-w4-co-2.exch080.serverpod.net ([10.224.117.102]) with mapi id 15.00.1044.021; Sun, 8 Feb 2015 20:42:33 -0800 From: Xuan Gong To: "user@hadoop.apache.org" Thread-Topic: Max Connect retries Thread-Index: AQHQQ4Bvc9flKFLENUmZRPqCBTHgbJznvp+A Date: Mon, 9 Feb 2015 04:42:32 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-transport-fromentityheader: Hosted x-originating-ip: [50.46.180.171] x-source-routing-agent: Processed Content-Type: multipart/alternative; boundary="_000_D0FD7A263458Dxgonghortonworkscom_" X-Virus-Checked: Checked by ClamAV on apache.org --_000_D0FD7A263458Dxgonghortonworkscom_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable That is for client connect retry in ipc level. You can decrease the max.retries by configuring ipc.client.connect.max.retries.on.timeouts in core-site.xml Thanks Xuan Gong From: Telles Nobrega > Reply-To: "user@hadoop.apache.org" > Date: Saturday, February 7, 2015 at 8:37 PM To: "user@hadoop.apache.org" > Subject: Max Connect retries Hi, I changed my cluster config so a failed nodemanager can be detected in = about 30 seconds. When I'm running a wordcount the reduce gets stuck in 25%= for a quite while and logs show nodes trying to connect to the failed node= : org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-telles-844= fb3f0-dfd8-456d-89c3-1d7cfdbdcad2/10.3.2.99:49911. = Already tried 28 time(s); maxRetries=3D45 2015-02-08 04:26:42,088 INFO [IPC Server handler 16 on 50037] org.apache.ha= doop.mapred.TaskAttemptListenerImpl: MapCompletionEvents request from attem= pt_1423319128424_0025_r_000000_0. startIndex 24 maxEvents 10000 Is this the expected behaviour? should I change max retries to a lower valu= es? if so, which config is that? Thanks --_000_D0FD7A263458Dxgonghortonworkscom_ Content-Type: text/html; charset="us-ascii" Content-ID: <65F515FCA05A6D44B70743C8469D320C@exch080.serverpod.net> Content-Transfer-Encoding: quoted-printable

That is for client connect re= try in ipc level. 

You can decrease the max.retr= ies by configuring 

ipc.client.connect.max.retrie= s.on.timeouts

in core-site.xml



Thanks

Xuan Gong

From: Telles Nobrega <tellesnobrega@gmail.com>
Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Date: Saturday, February 7, 2015 at= 8:37 PM
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Subject: Max Connect retries

Hi, I changed my cluster config so a failed nodemanager ca= n be detected in about 30 seconds. When I'm running a wordcount the reduce = gets stuck in 25% for a quite while and logs show nodes trying to connect t= o the failed node:

org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-telle=
s-844fb3f0-dfd8-456d-89c3-1d7cfdbdcad2/1=
0.3.2.99:49911. Already tried 28 time(s); maxRetries=3D45
2015-02-08 04:26:42,088 INFO [IPC Server handler 16 on 50037] org.apache.ha=
doop.mapred.TaskAttemptListenerImpl: MapCompletionEvents request from attem=
pt_1423319128424_0025_r_000000_0. startIndex 24 maxEvents 10000
Is this the expe=
cted behaviour? should I change max retries to a lower values? if so, which=
  config is that?
Thanks

--_000_D0FD7A263458Dxgonghortonworkscom_--