Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 16074 invoked from network); 12 May 2008 22:54:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 May 2008 22:54:06 -0000 Received: (qmail 56278 invoked by uid 500); 12 May 2008 22:54:01 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 56233 invoked by uid 500); 12 May 2008 22:54:01 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 56222 invoked by uid 99); 12 May 2008 22:54:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 May 2008 15:54:00 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jamesthepiper@gmail.com designates 209.85.200.171 as permitted sender) Received: from [209.85.200.171] (HELO wf-out-1314.google.com) (209.85.200.171) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 May 2008 22:53:14 +0000 Received: by wf-out-1314.google.com with SMTP id 28so2604024wfa.2 for ; Mon, 12 May 2008 15:53:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=fp6uKeSlWy86SuOr8F2vr6BUr0O2B5HEiUGaERYfm4k=; b=Mj7IdDNwlpWgLpEhqDcnrJ9lOuUDiFQtPmFoATU/4YiNgmUxffJlJdfIujZQuOmFxosR4Rk2ngQrSn9KpuT8Aj6WW3MwFnWM96evSGMhfPFZPzykkt8MHVl/Cb2tXc/rUzvWq4JknUXX6y8zyexhiy0NhOtCkI0BdBUjbtfJ4NI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=fBwOsH7ISbRV0hFn1mxRdgGfaIj4iCVPBc+mkWeP66iytMfis0DoqPalss1agjE8MldQl5YXUoCe9XgV+NUdY/8xCYg0fn0aLfw01Wm7Sz2mZe9fyrFw8evQl2omXuqGHQoVfioWT9t7AVl9eHiy5NYhj/8Fvrr6yvKH7h/ryJc= Received: by 10.142.200.20 with SMTP id x20mr3488236wff.145.1210632808415; Mon, 12 May 2008 15:53:28 -0700 (PDT) Received: by 10.143.188.10 with HTTP; Mon, 12 May 2008 15:53:28 -0700 (PDT) Message-ID: <7cfed280805121553i5c2a998dk51a16a5e64b5c825@mail.gmail.com> Date: Mon, 12 May 2008 15:53:28 -0700 From: "James Moore" To: core-user@hadoop.apache.org Subject: Re: Read timed out, Abandoning block blk_-5476242061384228962 In-Reply-To: <4aa34eb70805112145h7ba873fcu1d02dc251bc15431@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <7cfed280805091124l256c1651sdf890d6b841ab243@mail.gmail.com> <7cfed280805092054x5b97e158vced9fa9a43314d7f@mail.gmail.com> <4aa34eb70805112145h7ba873fcu1d02dc251bc15431@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org On Sun, May 11, 2008 at 9:45 PM, Dhruba Borthakur wrote: > How much memory does your machine have and how many files does your > HDFS have? One possibility is that the memory pressure of the > map-reduce jobs causes more GC runs for the namenode process. I'm using machines with 16G of memory, and there aren't that many files yet - about 14,000 total, according to dfs -lsr /. The admin web page reports 1.32 TB used, but I'm assuming that's including replication (set to 4). 19 machines total in the cluster (planned to have 20, but one didn't come up initially and I haven't bothered to add it), and I'm currently not running maps/reduces on the master. And earlier, when I said I'm not seeing the timeouts, I spoke too soon. One of my last runs had 3 maps fail, with these sorts of errors: Failed to rename output with the exception: java.net.SocketTimeoutException: timed out waiting for rpc response at org.apache.hadoop.ipc.Client.call(Client.java:514) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:198) at org.apache.hadoop.dfs.$Proxy4.delete(Unknown Source) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at org.apache.hadoop.dfs.$Proxy4.delete(Unknown Source) at org.apache.hadoop.dfs.DFSClient.delete(DFSClient.java:426) at org.apache.hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java:149) at org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:430) at org.apache.hadoop.mapred.JobTracker$TaskCommitQueue.run(JobTracker.java:2013) This is somewhat painful, since these are Nutch fetch maps, meaning that they're getting data via HTTP from crawled sites. Looks like it's not a permanent problem, though - just doing some crawls twice. The default timeouts for the RPCs seem high enough (ipc.client.timeout == 60000). Is there something else I should look at? Watching 'top' during a run, it didn't seem like the master ever went below 80%, so it seems a little unlikely that the machine isn't responding for 60 seconds. Should I increase the memory for the datanode processes? It's got -Xmx2000m right now. And is there a way to do that with the existing scripts/configurations? Obviously I can do it by hand, but I'm not seeing something like the mapred.child.java.opts setting for the master. -- James Moore | james@restphone.com blog.restphone.com