Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of jamesthepiper@gmail.com
 designates 209.85.200.171 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=fBwOsH7ISbRV0hFn1mxRdgGfaIj4iCVPBc+mkWeP66iytMfis0DoqPalss1agjE8MldQl5YXUoCe9XgV+NUdY/8xCYg0fn0aLfw01Wm7Sz2mZe9fyrFw8evQl2omXuqGHQoVfioWT9t7AVl9eHiy5NYhj/8Fvrr6yvKH7h/ryJc=
Message-ID: <7cfed280805121553i5c2a998dk51a16a5e64b5c825@mail.gmail.com>
Date: Mon, 12 May 2008 15:53:28 -0700
From: "James Moore" <jamesthepiper@gmail.com>
To: core-user@hadoop.apache.org
Subject: Re: Read timed out, Abandoning block blk_-5476242061384228962
In-Reply-To: <4aa34eb70805112145h7ba873fcu1d02dc251bc15431@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <7cfed280805091124l256c1651sdf890d6b841ab243@mail.gmail.com>
	 <C449ED7A.4A51%hairong@yahoo-inc.com>
	 <7cfed280805092054x5b97e158vced9fa9a43314d7f@mail.gmail.com>
	 <4aa34eb70805112145h7ba873fcu1d02dc251bc15431@mail.gmail.com>

On Sun, May 11, 2008 at 9:45 PM, Dhruba Borthakur <dhruba@gmail.com> wrote:
>  How much memory does your machine have and how many files does your
>  HDFS have? One possibility is that the memory pressure of the
>  map-reduce jobs causes more GC runs for the namenode process.

I'm using machines with 16G of memory, and there aren't that many
files yet - about 14,000 total, according to dfs -lsr /.  The admin
web page reports 1.32 TB used, but I'm assuming that's including
replication (set to 4).

19 machines total in the cluster (planned to have 20, but one didn't
come up initially and I haven't bothered to add it), and I'm currently
not running maps/reduces on the master.

And earlier, when I said I'm not seeing the timeouts, I spoke too
soon.  One of my last runs had 3 maps fail, with these sorts of
errors:

Failed to rename output with the exception:
java.net.SocketTimeoutException: timed out waiting for rpc response
	at org.apache.hadoop.ipc.Client.call(Client.java:514)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:198)
	at org.apache.hadoop.dfs.$Proxy4.delete(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
	at org.apache.hadoop.dfs.$Proxy4.delete(Unknown Source)
	at org.apache.hadoop.dfs.DFSClient.delete(DFSClient.java:426)
	at org.apache.hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java:149)
	at org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:430)
	at org.apache.hadoop.mapred.JobTracker$TaskCommitQueue.run(JobTracker.java:2013)

This is somewhat painful, since these are Nutch fetch maps, meaning
that they're getting data via HTTP from crawled sites.  Looks like
it's not a permanent problem, though - just doing some crawls twice.

The default timeouts for the RPCs seem high enough (ipc.client.timeout
== 60000).  Is there something else I should look at?  Watching 'top'
during a run, it didn't seem like the master ever went below 80%, so
it seems a little unlikely that the machine isn't responding for 60
seconds.

Should I increase the memory for the datanode processes?  It's got
-Xmx2000m right now.  And is there a way to do that with the existing
scripts/configurations?  Obviously I can do it by hand, but I'm not
seeing something like the mapred.child.java.opts setting for the
master.

-- 
James Moore | james@restphone.com
blog.restphone.com