Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 79741 invoked from network); 13 Sep 2007 19:14:14 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Sep 2007 19:14:14 -0000 Received: (qmail 40816 invoked by uid 500); 13 Sep 2007 19:14:05 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 40796 invoked by uid 500); 13 Sep 2007 19:14:05 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 40787 invoked by uid 99); 13 Sep 2007 19:14:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Sep 2007 12:14:05 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jssarma@facebook.com designates 204.15.23.140 as permitted sender) Received: from [204.15.23.140] (HELO sf2pmxf02.TheFacebook.com) (204.15.23.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Sep 2007 19:15:44 +0000 Received: from SF2PMXB01.TheFacebook.com ([192.168.16.15]) by sf2pmxf02.TheFacebook.com with Microsoft SMTPSVC(6.0.3790.3959); Thu, 13 Sep 2007 12:15:36 -0700 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: ipc.client.timeout Date: Thu, 13 Sep 2007 12:13:42 -0700 Message-ID: In-Reply-To: <005e01c7ef92$ba603780$2201a8c0@ds.corp.yahoo.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: ipc.client.timeout Thread-Index: Acfvid7zjR+HEiyzSs6n4jfiiVDQOQACDyUgAaj4QFA= From: "Joydeep Sen Sarma" To: X-OriginalArrivalTime: 13 Sep 2007 19:15:36.0007 (UTC) FILETIME=[76814D70:01C7F63A] X-Virus-Checked: Checked by ClamAV on apache.org I would love to use a lower timeout. It seems that retries are either buggy or missing in some cases - that cause lots of failures. The cases I can see right now (0.13.1): - namenode.complete: looks like it retries - but may not be idempotent? org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not complete write to file /user/facebook/profiles/binary/users_joined/_task_0018_r_000003_0/.part- 00003.crc by DFSClient_task_0018_r_000003_0 at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:353) - namenode.addBlock: no retry policy (looking at DFSClient.java) - namenode.mkdirs: no retry policy ('') We see plenty of all of these with a lowered timeout. With a high timeout - we have seen very slow recovery from some failures (jobs would hang on submission). Don't understand the fs protocol well enough - any idea if these are fixable? Thx, Joydeep -----Original Message----- From: Devaraj Das [mailto:ddas@yahoo-inc.com]=20 Sent: Wednesday, September 05, 2007 1:00 AM To: hadoop-user@lucene.apache.org Subject: RE: ipc.client.timeout This is to take care of cases where a particular server is too loaded to respond to client RPCs quick enough. Setting the timeout to a large value ensures that RPCs won't timeout that often and thereby potentially lead to lesser failures (for e.g., a map/reduce task kills itself when it fails to invoke an RPC on the tasktracker for three times in a row) and retries.=20 > -----Original Message----- > From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]=20 > Sent: Wednesday, September 05, 2007 12:26 PM > To: hadoop-user@lucene.apache.org > Subject: ipc.client.timeout >=20 > The default is set to 60s. many of my dfs -put commands would=20 > seem to hang - and lowering the timeout (to 1s) seems to=20 > have made things a whole lot better. >=20 > =20 >=20 > General curiosity - isn't 60s just huge for a rpc timeout? (a=20 > web search indicates that nutch may be setting it to 10s -=20 > and even that seems fairly large). Would love to get a=20 > backgrounder on why the default is set to so large a value .. >=20 > =20 >=20 > Thanks, >=20 > =20 >=20 > Joydeep >=20 >=20