hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1763) Too many lost task trackers - Job failures
Date Tue, 04 Sep 2007 21:40:45 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524883
] 

Doug Cutting commented on HADOOP-1763:
--------------------------------------

> On a 8 cpu node, CPU wise, 60 threads can't do any more work than 20 threads can.

Not quite.  The ipc.Server reads requests asynchronously in the listener thread, but responses
are written synchronously in handler threads.  So handler threads are not just doing computation,
they're also doing i/o, writing responses.  If, e.g., responses are large, and/or clients
are slow to read them, then more handler threads will increase throughput.


> Too many lost task trackers - Job failures
> ------------------------------------------
>
>                 Key: HADOOP-1763
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1763
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>         Environment: Version: 0.15.0-dev, r565628
> Compiled: Tue Aug 14 20:55:37 UTC 2007 by hadoopqa
> 1400 Node cluster
>            Reporter: Srikanth Kakani
>            Assignee: Devaraj Das
>            Priority: Blocker
>         Attachments: 1763.patch, 1763.try.patch
>
>
> Steps to reproduce:
> 1 .Run a map reduce application running more than 3000 mappers, each running longer than
> 2. Observe the lost task trackers.
> Observations:
> 1. Most of the lost taskTracker messages correspond to maps that have already completed
> 2. Based on the logs below the taskTracker is unable to connect to the job tracker and
so the jobTracker deletes the job after 20 minutes
> One example:
> task_200708210155_0003_m_000000_0	node1	KILLED	0.00%		21-Aug-2007 09:39:09 	Lost task
tracker   <-- Please note the time
> Counters:
> Map-Reduce Framework
> 	Map input records 	28,861
> 	Map output records 	1,349,114
> 	Map input bytes 	200,018,562
> 	Map output bytes 	714,878,712
> Node 1 task tracker logs:
> 2007-08-21 09:08:51,109 INFO org.apache.hadoop.mapred.TaskTracker: Task task_200708210155_0003_m_000000_0
is done. <-- Please note the time
> .
> .
> .
> 2007-08-21 09:08:52,212 INFO org.mortbay.http.SocketListener: LOW ON THREADS ((40-40+0)<1)
on SocketListener0@0.0.0.0:50060
> 2007-08-21 09:08:52,217 WARN org.mortbay.http.SocketListener: OUT OF THREADS: SocketListener0@0.0.0.0:50060
> .
> .
> .
> 2007-08-21 09:18:53,877 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception:
java.net.SocketTimeoutException: timed out waiting for rpc response
>         at org.apache.hadoop.ipc.Client.call(Client.java:472)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165)
>         at org.apache.hadoop.mapred.$Proxy0.heartbeat(Unknown Source)
>         at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:941)
>         at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:840)
>         at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1227)
>         at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1911)
> .
> .
> .
> 2007-08-21 09:47:45,207 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status'
to 'wm501219' with reponseId '5247
> 2007-08-21 09:47:46,023 INFO org.apache.hadoop.mapred.TaskTracker: Recieved RenitTrackerAction
from JobTracker
> 2007-08-21 09:47:46,041 INFO org.apache.hadoop.mapred.TaskRunner: task_200708210155_0003_m_000000_0
done; removing files.
> 2007-08-21 09:47:46,240 INFO org.apache.hadoop.mapred.TaskRunner: task_200708210155_0003_m_002237_0
done; removing files.
> Tasktracker is pretty active otherwise:
> tracker_wm511293.inktomisearch.com:50050	wm511293.inktomisearch.com	1	6	3
> JobTracker logs:
> 2007-08-21 09:01:11,951 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200708210155_0003_m_000000_0'
to tip tip_200708210155_0003_m_000000, for tracker 'tracker_wm511293.inktomisearch.com:50050'
> .
> 2007-08-21 09:06:27,745 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200708210155_0003_m_000000_1'
to tip tip_200708210155_0003_m_000000, for tracker 'tracker_wm511783.inktomisearch.com:50050'
> .
> 2007-08-21 09:08:51,212 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_200708210155_0003_m_000000_0'
has completed tip_200708210155_0003_m_000000 successfully.
> 2007-08-21 09:08:51,213 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_200708210155_0003_m_000000_0'
has completed succesfully
> .
> 2007-08-21 09:11:27,227 INFO org.apache.hadoop.mapred.TaskInProgress: Already complete
TIP tip_200708210155_0003_m_000000 has completed task task_200708210155_0003_m_000000_1
> .
> 2007-08-21 09:39:09,014 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200708210155_0003_m_000000_0:
Lost task tracker
> 2007-08-21 09:39:09,014 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_200708210155_0003_m_000000_0'
has been lost.
> .
> 2007-08-21 09:39:09,348 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task
'task_200708210155_0003_m_000000_0' from 'tracker_wm511293.inktomisearch.com:50050'
> .
> 2007-08-21 09:47:20,855 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200708210155_0003_m_000000_1:
Lost task tracker
> 2007-08-21 09:47:20,855 WARN org.apache.hadoop.mapred.TaskInProgress: Recieved duplicate
status update of 'KILLED' for 'task_200708210155_0003_m_000000_1' of TIP 'tip_200708210155_0003_m_000000'
> Notes:
> 1. I do  not see the taskTracker dying during that period
> 2. Is retry logic not accurate/agressive enough? (did something change recently, this
behavior is more evident in 0.15)
> 3. Inconsistencies with jobTracker logs? Lost task tracker detection bad?
> 4. TaskTracker:
>           CPU usage: 9:10-9:20 50%
>                                   9:20-9:40 0%
>           Network Usage: 6M incl dfs operations
> 5. JobTracker
>           CPU udage: Avg: 9%
>           Network Usage:  Negligible

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message