Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 72508 invoked from network); 25 Apr 2007 23:33:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 25 Apr 2007 23:33:37 -0000 Received: (qmail 67321 invoked by uid 500); 25 Apr 2007 23:33:43 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 67194 invoked by uid 500); 25 Apr 2007 23:33:43 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 67185 invoked by uid 99); 25 Apr 2007 23:33:43 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Apr 2007 16:33:43 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Apr 2007 16:33:35 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 80AC4714075 for ; Wed, 25 Apr 2007 16:33:15 -0700 (PDT) Message-ID: <12953577.1177543995524.JavaMail.jira@brutus> Date: Wed, 25 Apr 2007 16:33:15 -0700 (PDT) From: "Hairong Kuang (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-1263) retry logic when dfs exist or open fails temporarily, e.g because of timeout In-Reply-To: <7986637.1176787635420.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491839 ] Hairong Kuang commented on HADOOP-1263: --------------------------------------- The annotation method proposed in HADOOP-601 to provide a general retry framework in rpc seems to be a simple solution, but since it is not implemented, for this jira, I plan to implement the retry mechanism for only ClientProtocol using the retry framework implemented in HADOOP-997. Here are what I plan to do: 1. Add an exponential backoff policy to RetryPolicies. 2. Create a retry proxy for the dfs client using the following method-to-RetryPolicy hashmap: * TRY-ONCE-THEN-FAIL: create, addBlock, complete * EXPONENTIAL-BACKOFF: open, setReplication, abandonBlock, abandonFileInProgress, reportBadBlocks, exists, isDir, getListing, getHints, renewLease, getStats, getDatanodeReport, getBlockSize, getEditLogSize * I still have not decided which retry policy to use for (1) rename, delete, mkdirs because a retry following a successful operation at the server side will return false instead of true; (2) setSafeMode, refreshNodes, rollEditLog, rollFsImage, finalizeUpgrade, metaSave because I still need time to read the code for these methods. Any suggestion is welcome! > retry logic when dfs exist or open fails temporarily, e.g because of timeout > ---------------------------------------------------------------------------- > > Key: HADOOP-1263 > URL: https://issues.apache.org/jira/browse/HADOOP-1263 > Project: Hadoop > Issue Type: Improvement > Components: dfs > Affects Versions: 0.12.3 > Reporter: Christian Kunz > Assigned To: Hairong Kuang > > Sometimes, when many (e.g. 1000+) map jobs start at about the same time and require supporting files from filecache, it happens that some map tasks fail because of rpc timeouts. With only the default number of 10 handlers on the namenode, the probability is high that the whole job fails (see Hadoop-1182). It is much better with a higher number of handlers, but some map tasks still fail. > This could be avoided if rpc clients did retry when encountering a timeout before throwing an exception. > Examples of exceptions: > java.net.SocketTimeoutException: timed out waiting for rpc response > at org.apache.hadoop.ipc.Client.call(Client.java:473) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163) > at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source) > at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320) > at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170) > at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125) > at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110) > at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245) > at org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327) > at org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253) > at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169) > at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86) > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117) > java.net.SocketTimeoutException: timed out waiting for rpc response > at org.apache.hadoop.ipc.Client.call(Client.java:473) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163) > at org.apache.hadoop.dfs.$Proxy1.open(Unknown Source) > at org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:511) > at org.apache.hadoop.dfs.DFSClient$DFSInputStream.(DFSClient.java:498) > at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:207) > at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:129) > at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110) > at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:82) > at org.apache.hadoop.fs.ChecksumFileSystem.copyToLocalFile(ChecksumFileSystem.java:577) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:766) > at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:370) > at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:877) > at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:545) > at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:913) > at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1603) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.