From hadoop-dev-return-9706-apmail-lucene-hadoop-dev-archive=lucene.apache.org@lucene.apache.org Sat Mar 31 03:03:50 2007 Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 62801 invoked from network); 31 Mar 2007 03:03:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 31 Mar 2007 03:03:49 -0000 Received: (qmail 105 invoked by uid 500); 31 Mar 2007 03:03:56 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 99976 invoked by uid 500); 31 Mar 2007 03:03:56 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 99967 invoked by uid 99); 31 Mar 2007 03:03:56 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Mar 2007 20:03:56 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Mar 2007 20:03:48 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id A57E3714068 for ; Fri, 30 Mar 2007 20:03:27 -0700 (PDT) Message-ID: <20605074.1175310207674.JavaMail.jira@brutus> Date: Fri, 30 Mar 2007 20:03:27 -0700 (PDT) From: "Christian Kunz (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-1182) DFS Scalability issue with filecache in large clusters In-Reply-To: <17169809.1175142685420.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485688 ] Christian Kunz commented on HADOOP-1182: ---------------------------------------- Increasing the number of handlers in namenode server seems to help a lot -- no ''Call queue overflow discarding oldest call' warn messages anymore. CPU still going up to sustained 99.9% for a while at beginning of the job. We will test more. > DFS Scalability issue with filecache in large clusters > ------------------------------------------------------ > > Key: HADOOP-1182 > URL: https://issues.apache.org/jira/browse/HADOOP-1182 > Project: Hadoop > Issue Type: Bug > Components: dfs > Affects Versions: 0.12.1 > Reporter: Christian Kunz > > When using filecache to distribute supporting files for map/reduce applications in a 1000 node cluster, many map tasks fail because of timeouts. There was no such problem using a 200 node cluster for the same applications with comparable input data. Either the whole job fails because of too many map failures, or even worse, some map tasks hang indefinitely. > java.net.SocketTimeoutException: timed out waiting for rpc response > at org.apache.hadoop.ipc.Client.call(Client.java:473) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163) > at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source) > at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320) > at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170) > at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125) > at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110) > at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245) > at org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327) > at org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253) > at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169) > at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86) > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.