Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 96298 invoked from network); 30 Aug 2007 10:35:04 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Aug 2007 10:35:03 -0000 Received: (qmail 62010 invoked by uid 500); 30 Aug 2007 10:34:58 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 61989 invoked by uid 500); 30 Aug 2007 10:34:58 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 61980 invoked by uid 99); 30 Aug 2007 10:34:58 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Aug 2007 03:34:58 -0700 X-ASF-Spam-Status: No, hits=-98.5 required=10.0 tests=ALL_TRUSTED,WEIRD_PORT X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Aug 2007 10:34:50 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 94B5871418E for ; Thu, 30 Aug 2007 03:34:30 -0700 (PDT) Message-ID: <11438313.1188470070599.JavaMail.jira@brutus> Date: Thu, 30 Aug 2007 03:34:30 -0700 (PDT) From: "Devaraj Das (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Updated: (HADOOP-1763) Too many lost task trackers - Job failures In-Reply-To: <7892284.1187823751671.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj Das updated HADOOP-1763: -------------------------------- Assignee: Devaraj Das Status: Patch Available (was: Open) > Too many lost task trackers - Job failures > ------------------------------------------ > > Key: HADOOP-1763 > URL: https://issues.apache.org/jira/browse/HADOOP-1763 > Project: Hadoop > Issue Type: Bug > Components: mapred > Environment: Version: 0.15.0-dev, r565628 > Compiled: Tue Aug 14 20:55:37 UTC 2007 by hadoopqa > 1400 Node cluster > Reporter: Srikanth Kakani > Assignee: Devaraj Das > Priority: Blocker > Attachments: 1763.patch, 1763.try.patch > > > Steps to reproduce: > 1 .Run a map reduce application running more than 3000 mappers, each running longer than > 2. Observe the lost task trackers. > Observations: > 1. Most of the lost taskTracker messages correspond to maps that have already completed > 2. Based on the logs below the taskTracker is unable to connect to the job tracker and so the jobTracker deletes the job after 20 minutes > One example: > task_200708210155_0003_m_000000_0 node1 KILLED 0.00% 21-Aug-2007 09:39:09 Lost task tracker <-- Please note the time > Counters: > Map-Reduce Framework > Map input records 28,861 > Map output records 1,349,114 > Map input bytes 200,018,562 > Map output bytes 714,878,712 > Node 1 task tracker logs: > 2007-08-21 09:08:51,109 INFO org.apache.hadoop.mapred.TaskTracker: Task task_200708210155_0003_m_000000_0 is done. <-- Please note the time > . > . > . > 2007-08-21 09:08:52,212 INFO org.mortbay.http.SocketListener: LOW ON THREADS ((40-40+0)<1) on SocketListener0@0.0.0.0:50060 > 2007-08-21 09:08:52,217 WARN org.mortbay.http.SocketListener: OUT OF THREADS: SocketListener0@0.0.0.0:50060 > . > . > . > 2007-08-21 09:18:53,877 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.net.SocketTimeoutException: timed out waiting for rpc response > at org.apache.hadoop.ipc.Client.call(Client.java:472) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165) > at org.apache.hadoop.mapred.$Proxy0.heartbeat(Unknown Source) > at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:941) > at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:840) > at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1227) > at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1911) > . > . > . > 2007-08-21 09:47:45,207 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'wm501219' with reponseId '5247 > 2007-08-21 09:47:46,023 INFO org.apache.hadoop.mapred.TaskTracker: Recieved RenitTrackerAction from JobTracker > 2007-08-21 09:47:46,041 INFO org.apache.hadoop.mapred.TaskRunner: task_200708210155_0003_m_000000_0 done; removing files. > 2007-08-21 09:47:46,240 INFO org.apache.hadoop.mapred.TaskRunner: task_200708210155_0003_m_002237_0 done; removing files. > Tasktracker is pretty active otherwise: > tracker_wm511293.inktomisearch.com:50050 wm511293.inktomisearch.com 1 6 3 > JobTracker logs: > 2007-08-21 09:01:11,951 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200708210155_0003_m_000000_0' to tip tip_200708210155_0003_m_000000, for tracker 'tracker_wm511293.inktomisearch.com:50050' > . > 2007-08-21 09:06:27,745 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200708210155_0003_m_000000_1' to tip tip_200708210155_0003_m_000000, for tracker 'tracker_wm511783.inktomisearch.com:50050' > . > 2007-08-21 09:08:51,212 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_200708210155_0003_m_000000_0' has completed tip_200708210155_0003_m_000000 successfully. > 2007-08-21 09:08:51,213 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_200708210155_0003_m_000000_0' has completed succesfully > . > 2007-08-21 09:11:27,227 INFO org.apache.hadoop.mapred.TaskInProgress: Already complete TIP tip_200708210155_0003_m_000000 has completed task task_200708210155_0003_m_000000_1 > . > 2007-08-21 09:39:09,014 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200708210155_0003_m_000000_0: Lost task tracker > 2007-08-21 09:39:09,014 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_200708210155_0003_m_000000_0' has been lost. > . > 2007-08-21 09:39:09,348 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_200708210155_0003_m_000000_0' from 'tracker_wm511293.inktomisearch.com:50050' > . > 2007-08-21 09:47:20,855 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200708210155_0003_m_000000_1: Lost task tracker > 2007-08-21 09:47:20,855 WARN org.apache.hadoop.mapred.TaskInProgress: Recieved duplicate status update of 'KILLED' for 'task_200708210155_0003_m_000000_1' of TIP 'tip_200708210155_0003_m_000000' > Notes: > 1. I do not see the taskTracker dying during that period > 2. Is retry logic not accurate/agressive enough? (did something change recently, this behavior is more evident in 0.15) > 3. Inconsistencies with jobTracker logs? Lost task tracker detection bad? > 4. TaskTracker: > CPU usage: 9:10-9:20 50% > 9:20-9:40 0% > Network Usage: 6M incl dfs operations > 5. JobTracker > CPU udage: Avg: 9% > Network Usage: Negligible -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.