hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Evans <ev...@yahoo-inc.com>
Subject Re: Lost tasktracker errors
Date Fri, 04 Jan 2013 14:34:38 GMT
Is there anything in the task tracker's logs?  Did the machines go down?
Are there full disks on those nodes?

--Bobby

On 1/4/13 5:52 AM, "Royston Sellman" <royston.sellman@googlemail.com>
wrote:

>I'm running a job over a 380 billion row 20 TB dataset which is computing
>sum(), max() etc. The job is running fine at around 3 million rows per
>second for several hours then grinding to a halt as it loses one after
>another of the tasktrackers.  We see a healthy mix of successful map and
>reduce attempts on the tasktracker...
>
> 
>
>2013-01-03 23:41:40,249 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_m_041109_0 1.0%
>
>2013-01-03 23:41:40,256 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_m_041105_0 1.0%
>
>2013-01-03 23:41:40,260 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_m_041105_0 1.0%
>
>2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: Task
>attempt_201301031813_0001_m_041105_0 is done.
>
>2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker:
>reported
>output size for attempt_201301031813_0001_m_041105_0  was 111
>
>2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker:
>addFreeSlot : current free slots : 8
>
>2013-01-03 23:41:40,374 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_m_041106_0 0.9884119%
>
>2013-01-03 23:41:40,432 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>jvm_201301031813_0001_m_2021872807 exited with exit code 0. Number of
>tasks
>it ran: 1
>
>2013-01-03 23:41:40,807 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_m_041103_0 0.9884134%
>
>2013-01-03 23:41:43,190 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_m_041101_0 1.0%
>
>2013-01-03 23:41:43,193 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_m_041101_0 1.0%
>
>2013-01-03 23:41:43,194 INFO org.apache.hadoop.mapred.TaskTracker: Task
>attempt_201301031813_0001_m_041101_0 is done.
>
>2013-01-03 23:41:43,194 INFO org.apache.hadoop.mapred.TaskTracker:
>reported
>output size for attempt_201301031813_0001_m_041101_0  was 111
>
>2013-01-03 23:41:43,194 INFO org.apache.hadoop.mapred.TaskTracker:
>addFreeSlot : current free slots : 9
>
>2013-01-03 23:41:43,303 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_m_041109_0 1.0%
>
>2013-01-03 23:41:43,306 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_m_041109_0 1.0%
>
>2013-01-03 23:41:43,307 INFO org.apache.hadoop.mapred.TaskTracker: Task
>attempt_201301031813_0001_m_041109_0 is done.
>
>2013-01-03 23:41:43,307 INFO org.apache.hadoop.mapred.TaskTracker:
>reported
>output size for attempt_201301031813_0001_m_041109_0  was 111
>
>2013-01-03 23:41:43,307 INFO org.apache.hadoop.mapred.TaskTracker:
>addFreeSlot : current free slots : 10
>
>2013-01-03 23:41:43,361 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>jvm_201301031813_0001_m_36690963 exited with exit code 0. Number of tasks
>it
>ran: 1
>
>2013-01-03 23:41:43,428 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_m_041106_0 1.0%
>
>2013-01-03 23:41:43,432 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_m_041106_0 1.0%
>
>2013-01-03 23:41:43,433 INFO org.apache.hadoop.mapred.TaskTracker: Task
>attempt_201301031813_0001_m_041106_0 is done.
>
>2013-01-03 23:41:43,433 INFO org.apache.hadoop.mapred.TaskTracker:
>reported
>output size for attempt_201301031813_0001_m_041106_0  was 111
>
>2013-01-03 23:41:43,433 INFO org.apache.hadoop.mapred.TaskTracker:
>addFreeSlot : current free slots : 11
>
>2013-01-03 23:41:43,457 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>jvm_201301031813_0001_m_-2095784622 exited with exit code 0. Number of
>tasks
>it ran: 1
>
>2013-01-03 23:41:43,595 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>jvm_201301031813_0001_m_1190449426 exited with exit code 0. Number of
>tasks
>it ran: 1
>
>2013-01-03 23:41:43,862 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_m_041103_0 1.0%
>
>2013-01-03 23:41:43,866 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_m_041103_0 1.0%
>
>2013-01-03 23:41:43,867 INFO org.apache.hadoop.mapred.TaskTracker: Task
>attempt_201301031813_0001_m_041103_0 is done.
>
>2013-01-03 23:41:43,867 INFO org.apache.hadoop.mapred.TaskTracker:
>reported
>output size for attempt_201301031813_0001_m_041103_0  was 111
>
>2013-01-03 23:41:43,867 INFO org.apache.hadoop.mapred.TaskTracker:
>addFreeSlot : current free slots : 12
>
>2013-01-03 23:41:44,021 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>jvm_201301031813_0001_m_-505301168 exited with exit code 0. Number of
>tasks
>it ran: 1
>
>2013-01-03 23:41:45,539 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050235882% reduce > copy (40975 of
>271884 at 0.00 MB/s) >
>
> 
>
> 
>
>Then it seems to get stuck on reduce attempts, before exiting with a
>SIG_TERM (143) exit code...
>
> 
>
>2013-01-03 23:50:49,642 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:50:55,678 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:51:01,717 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:51:04,755 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:51:10,796 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:51:16,831 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:51:19,870 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:51:25,911 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:51:31,953 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:51:34,987 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:51:41,023 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:51:47,063 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:51:50,102 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:51:56,143 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:52:02,179 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:52:05,213 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:52:11,254 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:52:17,295 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:52:20,334 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:52:26,375 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:52:32,411 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:52:35,445 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:52:41,486 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:52:44,526 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:52:50,567 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:52:56,608 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:52:59,648 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:53:05,689 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:53:11,730 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:53:14,764 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:53:20,804 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:53:26,844 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:53:29,883 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:53:35,924 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:53:41,964 INFO org.apache.hadoop.mapred.TaskTracker:
>attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>271884 at 0.00 MB/s) >
>
>2013-01-03 23:53:42,616 INFO org.apache.hadoop.mapred.TaskTracker:
>Recieved
>ReinitTrackerAction from JobTracker
>
>2013-01-03 23:53:42,881 INFO org.apache.hadoop.util.ProcessTree: Killing
>process group6089 with signal TERM. Exit code 0
>
>2013-01-03 23:53:42,881 INFO org.apache.hadoop.mapred.TaskTracker:
>addFreeSlot : current free slots : 12
>
>2013-01-03 23:53:43,348 WARN
>org.apache.hadoop.mapred.DefaultTaskController:
>Exit code from task is : 143
>
>2013-01-03 23:53:43,348 INFO
>org.apache.hadoop.mapred.DefaultTaskController:
>Output from DefaultTaskController's launchTask follows:
>
>2013-01-03 23:53:43,348 INFO org.apache.hadoop.mapred.TaskController:
>
>2013-01-03 23:53:43,349 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>jvm_201301031813_0001_r_-2077112828 exited with exit code 143. Number of
>tasks it ran: 0
>
>2013-01-03 23:54:28,484 INFO
>org.apache.hadoop.filecache.TrackerDistributedCacheManager: Cleanup...
>
>java.lang.InterruptedException: sleep interrupted
>
>        at java.lang.Thread.sleep(Native Method)
>
>        at
>org.apache.hadoop.filecache.TrackerDistributedCacheManager$CleanupThread.r
>un
>(TrackerDistributedCacheManager.java:947)
>
> 
>
>Just around this time (actually a bit before) the JobTracker log reports
>the
>following:
>
>2013-01-03 23:52:31,348 INFO org.apache.hadoop.mapred.JobTracker: Lost
>tracker 'tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005'
>
>2013-01-03 23:52:31,348 INFO org.apache.hadoop.mapred.TaskInProgress:
>Error
>from attempt_201301031813_0001_m_000010_0: Lost task tracker:
>tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005
>
>2013-01-03 23:52:31,349 INFO org.apache.hadoop.mapred.TaskInProgress:
>Error
>from attempt_201301031813_0001_m_000039_0: Lost task tracker:
>tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005
>
>2013-01-03 23:52:31,349 INFO org.apache.hadoop.mapred.TaskInProgress:
>Error
>from attempt_201301031813_0001_m_000041_0: Lost task tracker:
>tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005
>
>2013-01-03 23:52:31,349 INFO org.apache.hadoop.mapred.TaskInProgress:
>Error
>from attempt_201301031813_0001_m_000089_0: Lost task tracker:
>tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005
>
>2013-01-03 23:52:31,349 INFO org.apache.hadoop.mapred.TaskInProgress:
>Error
>from attempt_201301031813_0001_m_000092_0: Lost task tracker:
>tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005
>
>Etc etc.
>
> 
>
>Can anyone help with this?
>
> 
>
>Royston
>
> 
>
> 
>
> 
>


Mime
View raw message