Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of bbansal@linkedin.com
 designates 69.28.149.24 as permitted sender)
DomainKey-Signature: s=prod; d=linkedin.com; c=nofws; q=dns;
  h=X-IronPort-AV:Received:User-Agent:Date:Subject:From:To:
   Message-ID:Thread-Topic:Thread-Index:Mime-version:
   Content-type;
  b=ifkgwDcUVsQmckMPteL5OuR7G6BqWjgsC9bPk23hJl7KYVXeKkdjkvSI
   Y3grOBYj0Wr2znDLnzZPhyKy8bFfBprzYecI919nhT23XGH8bWbNrWBKg
   /2lTKQRxRlhcpgH;
User-Agent: Microsoft-Entourage/11.4.0.080122
Date: Thu, 02 Apr 2009 13:32:44 -0700
Subject: Lost TaskTracker Errors
From: Bhupesh Bansal <bbansal@linkedin.com>
To: "core-user@hadoop.apache.org" <core-user@hadoop.apache.org>
Message-ID: <C5FA6EFC.1F588%bbansal@linkedin.com>
Thread-Topic: Lost TaskTracker Errors
Thread-Index: Acmz0i05a8TY8B/FEd6wgQAX8guM3A==
Mime-version: 1.0
Content-type: multipart/alternative;
	boundary="B_3321523964_305694748"

--B_3321523964_305694748
Content-type: text/plain;
	charset="ISO-8859-1"
Content-transfer-encoding: quoted-printable

Hey Folks,=20

Since last 2-3 days I am seeing many of these errors popping up in our
hadoop cluster.=20

Task attempt_200904011612_0025_m_000120_0 failed to report status for 604
seconds. Killing

JobTracker logs are doesn=B9t have any more info  And task tracker logs are
clean.=20

The failures occurred with these symptoms
1. Datanodes will start timing out
2. hdfs will get extremely slow (hdfs =ADls will take like 2 mins Vs 1s in
normal mode)

The datanode logs on failing tasktracker nodes are filled up with
2009-04-02 11:39:46,828 WARN org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(172.16.216.64:50010,
storageID=3DDS-707090154-172.16.216.64-50010-1223506297192, infoPort=3D50075,
ipcPort=3D50020):Failed to transfer blk_-7774359493260170883_282858 to
172.16.216.62:50010 got java.net.SocketTimeoutException: 480000 millis
timeout while waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=3D/172.16.216.64:36689
remote=3D/172.16.216.62:50010]
        at=20
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.jav=
a
:185)
        at=20
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream=
.
java:159)
        at=20
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream=
.
java:198)
        at=20
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873)
        at=20
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
        at=20
org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2855)
        at java.lang.Thread.run(Thread.java:619)


We are running a 10 Node cluster (hadoop-0.18.1) on Dual Quad core boxes (8=
G
RAM) with these properties
1. mapred.child.java.opts =3D Xmx600M
2. mapred.tasktracker.map.tasks.maximum =3D 8
3. mapred.tasktracker.reduce.tasks.maximum =3D 4
4. dfs.datanode.handler.count =3D 10
5. dfs.datanode.du.reserved =3D 102400000
6. dfs.datanode.max.xcievers =3D 512

The map jobs writes a Ton of data for each record, does increasing
=B3dfs.datanode.handler.count=B2 will help in this case ??  What other
configuration change can I try ??


Best
Bhupesh


--B_3321523964_305694748--