Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 71261 invoked from network); 2 May 2007 16:40:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 2 May 2007 16:40:37 -0000 Received: (qmail 31314 invoked by uid 500); 2 May 2007 16:40:42 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 31294 invoked by uid 500); 2 May 2007 16:40:42 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 31280 invoked by uid 99); 2 May 2007 16:40:42 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 May 2007 09:40:42 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 May 2007 09:40:35 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id B2E25714049 for ; Wed, 2 May 2007 09:40:15 -0700 (PDT) Message-ID: <28975102.1178124015730.JavaMail.jira@brutus> Date: Wed, 2 May 2007 09:40:15 -0700 (PDT) From: "Doug Cutting (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-1263) retry logic when dfs exist or open fails temporarily, e.g because of timeout In-Reply-To: <7986637.1176787635420.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12493139 ] Doug Cutting commented on HADOOP-1263: -------------------------------------- > FuzzyExponentialBackoffRetry I think that's overkill. "Exponential backoff" is still the standard term, even when randomness is involved (e.g., http://en.wikipedia.org/wiki/Truncated_binary_exponential_backoff). > retry logic when dfs exist or open fails temporarily, e.g because of timeout > ---------------------------------------------------------------------------- > > Key: HADOOP-1263 > URL: https://issues.apache.org/jira/browse/HADOOP-1263 > Project: Hadoop > Issue Type: Improvement > Components: dfs > Affects Versions: 0.12.3 > Reporter: Christian Kunz > Assigned To: Hairong Kuang > Attachments: retry.patch > > > Sometimes, when many (e.g. 1000+) map jobs start at about the same time and require supporting files from filecache, it happens that some map tasks fail because of rpc timeouts. With only the default number of 10 handlers on the namenode, the probability is high that the whole job fails (see Hadoop-1182). It is much better with a higher number of handlers, but some map tasks still fail. > This could be avoided if rpc clients did retry when encountering a timeout before throwing an exception. > Examples of exceptions: > java.net.SocketTimeoutException: timed out waiting for rpc response > at org.apache.hadoop.ipc.Client.call(Client.java:473) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163) > at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source) > at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320) > at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170) > at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125) > at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110) > at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245) > at org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327) > at org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253) > at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169) > at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86) > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117) > java.net.SocketTimeoutException: timed out waiting for rpc response > at org.apache.hadoop.ipc.Client.call(Client.java:473) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163) > at org.apache.hadoop.dfs.$Proxy1.open(Unknown Source) > at org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:511) > at org.apache.hadoop.dfs.DFSClient$DFSInputStream.(DFSClient.java:498) > at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:207) > at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:129) > at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110) > at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:82) > at org.apache.hadoop.fs.ChecksumFileSystem.copyToLocalFile(ChecksumFileSystem.java:577) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:766) > at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:370) > at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:877) > at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:545) > at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:913) > at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1603) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.