Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 46665 invoked from network); 2 Apr 2009 20:46:41 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Apr 2009 20:46:41 -0000 Received: (qmail 78123 invoked by uid 500); 2 Apr 2009 20:34:55 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 78036 invoked by uid 500); 2 Apr 2009 20:34:55 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 78026 invoked by uid 99); 2 Apr 2009 20:34:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Apr 2009 20:34:55 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=HTML_MESSAGE,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of bbansal@linkedin.com designates 69.28.149.24 as permitted sender) Received: from [69.28.149.24] (HELO esv4-mav02.corp.linkedin.com) (69.28.149.24) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Apr 2009 20:34:45 +0000 DomainKey-Signature: s=prod; d=linkedin.com; c=nofws; q=dns; h=X-IronPort-AV:Received:User-Agent:Date:Subject:From:To: Message-ID:Thread-Topic:Thread-Index:Mime-version: Content-type; b=ifkgwDcUVsQmckMPteL5OuR7G6BqWjgsC9bPk23hJl7KYVXeKkdjkvSI Y3grOBYj0Wr2znDLnzZPhyKy8bFfBprzYecI919nhT23XGH8bWbNrWBKg /2lTKQRxRlhcpgH; DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=linkedin.com; i=bbansal@linkedin.com; q=dns/txt; s=proddkim; t=1238704485; x=1270240485; h=from:sender:reply-to:subject:date:message-id:to:cc: mime-version:content-transfer-encoding:content-id: content-description:resent-date:resent-from:resent-sender: resent-to:resent-cc:resent-message-id:in-reply-to: references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:list-owner:list-archive; z=From:=20Bhupesh=20Bansal=20 |Subject:=20Lost=20TaskTracker=20Errors|Date:=20Thu,=2002 =20Apr=202009=2013:32:44=20-0700|Message-ID:=20|To:=20"core-user@hadoop.apach e.org"=20|Mime-version:=201. 0; bh=4WnVGx46EhLVSdrXS3EsR2oR8x+AoGINN+Ljsj2VTfQ=; b=kYVDqYE+13n/naif54WI5O/spabosblgnRdMauGpaWnVXOOozqcelrbX ZT1bgnduHtIKGCvLWbbPQG5xPnrkeFd+ajphg/xqFNMRwE9GwknTtcy3T jbgoDkRG36v4GkB; X-IronPort-AV: E=Sophos;i="4.39,315,1235980800"; d="scan'208,217";a="5438046" Received: from 172.16.20.117 ([172.16.20.117]) by CORP-MAIL.linkedin.biz ([172.18.46.135]) via Exchange Front-End Server mail-access.linkedin.biz ([172.18.46.133]) with Microsoft Exchange Server HTTP-DAV ; Thu, 2 Apr 2009 20:33:31 +0000 User-Agent: Microsoft-Entourage/11.4.0.080122 Date: Thu, 02 Apr 2009 13:32:44 -0700 Subject: Lost TaskTracker Errors From: Bhupesh Bansal To: "core-user@hadoop.apache.org" Message-ID: Thread-Topic: Lost TaskTracker Errors Thread-Index: Acmz0i05a8TY8B/FEd6wgQAX8guM3A== Mime-version: 1.0 Content-type: multipart/alternative; boundary="B_3321523964_305694748" X-Virus-Checked: Checked by ClamAV on apache.org --B_3321523964_305694748 Content-type: text/plain; charset="ISO-8859-1" Content-transfer-encoding: quoted-printable Hey Folks,=20 Since last 2-3 days I am seeing many of these errors popping up in our hadoop cluster.=20 Task attempt_200904011612_0025_m_000120_0 failed to report status for 604 seconds. Killing JobTracker logs are doesn=B9t have any more info And task tracker logs are clean.=20 The failures occurred with these symptoms 1. Datanodes will start timing out 2. hdfs will get extremely slow (hdfs =ADls will take like 2 mins Vs 1s in normal mode) The datanode logs on failing tasktracker nodes are filled up with 2009-04-02 11:39:46,828 WARN org.apache.hadoop.dfs.DataNode: DatanodeRegistration(172.16.216.64:50010, storageID=3DDS-707090154-172.16.216.64-50010-1223506297192, infoPort=3D50075, ipcPort=3D50020):Failed to transfer blk_-7774359493260170883_282858 to 172.16.216.62:50010 got java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=3D/172.16.216.64:36689 remote=3D/172.16.216.62:50010] at=20 org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.jav= a :185) at=20 org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream= . java:159) at=20 org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream= . java:198) at=20 org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873) at=20 org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967) at=20 org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2855) at java.lang.Thread.run(Thread.java:619) We are running a 10 Node cluster (hadoop-0.18.1) on Dual Quad core boxes (8= G RAM) with these properties 1. mapred.child.java.opts =3D Xmx600M 2. mapred.tasktracker.map.tasks.maximum =3D 8 3. mapred.tasktracker.reduce.tasks.maximum =3D 4 4. dfs.datanode.handler.count =3D 10 5. dfs.datanode.du.reserved =3D 102400000 6. dfs.datanode.max.xcievers =3D 512 The map jobs writes a Ton of data for each record, does increasing =B3dfs.datanode.handler.count=B2 will help in this case ?? What other configuration change can I try ?? Best Bhupesh --B_3321523964_305694748--