hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Royston Sellman <royston.sell...@googlemail.com>
Subject Re: Lost tasktracker errors
Date Fri, 04 Jan 2013 15:02:44 GMT
Hi Bobby,

Thanks for the response Bobby,

The tasktracker logs such as "hadoop-hdfs-tasktracker-hd-37-03.log" contained the log messages
included in our previous message. It seems to show a series of successful map attempts with
a few reduce attempts interspersed, then it gets to a point and only shows a series of reduce
attempts that appear to be stuck at the same level of progress, before outputting the 143
exit code and the interrupted sleep message at the end.

There is nothing in the tasktracker~.out files...

The machines did not go down but the affected TTs did not log anything till I got up in the
morning, saw the job had frozen, did stop-all.sh. Then the stalled TTs logged the shutdown.

The disks are not full (67% usage across 12 disks per worker).

It seems that the 143 exit code indicates that an external process has terminated our tasktracker
JVM. Is this correct?

If so, what would the likely suspects be that would terminate our tasktrackers? Is it possible
this is related to our operating system (Scientific Linux 6) and PAM limits?

We had already increased our hard limit on the number of open files for the "hdfs" user (that
launches hdfs and mapred daemons) to 32768 in the hope that this would solve the issue. Can
you see anything wrong with our security limits:

[hdfs@hd-37-03 hdfs]$ ulimit -aH
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 191988
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 32768
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Thanks for your help.

Royston

On 4 Jan 2013, at 14:34, Robert Evans <evans@yahoo-inc.com> wrote:

> Is there anything in the task tracker's logs?  Did the machines go down?
> Are there full disks on those nodes?
> 
> --Bobby
> 
> On 1/4/13 5:52 AM, "Royston Sellman" <royston.sellman@googlemail.com>
> wrote:
> 
>> I'm running a job over a 380 billion row 20 TB dataset which is computing
>> sum(), max() etc. The job is running fine at around 3 million rows per
>> second for several hours then grinding to a halt as it loses one after
>> another of the tasktrackers.  We see a healthy mix of successful map and
>> reduce attempts on the tasktracker...
>> 
>> 
>> 
>> 2013-01-03 23:41:40,249 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041109_0 1.0%
>> 
>> 2013-01-03 23:41:40,256 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041105_0 1.0%
>> 
>> 2013-01-03 23:41:40,260 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041105_0 1.0%
>> 
>> 2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: Task
>> attempt_201301031813_0001_m_041105_0 is done.
>> 
>> 2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker:
>> reported
>> output size for attempt_201301031813_0001_m_041105_0  was 111
>> 
>> 2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker:
>> addFreeSlot : current free slots : 8
>> 
>> 2013-01-03 23:41:40,374 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041106_0 0.9884119%
>> 
>> 2013-01-03 23:41:40,432 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>> jvm_201301031813_0001_m_2021872807 exited with exit code 0. Number of
>> tasks
>> it ran: 1
>> 
>> 2013-01-03 23:41:40,807 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041103_0 0.9884134%
>> 
>> 2013-01-03 23:41:43,190 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041101_0 1.0%
>> 
>> 2013-01-03 23:41:43,193 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041101_0 1.0%
>> 
>> 2013-01-03 23:41:43,194 INFO org.apache.hadoop.mapred.TaskTracker: Task
>> attempt_201301031813_0001_m_041101_0 is done.
>> 
>> 2013-01-03 23:41:43,194 INFO org.apache.hadoop.mapred.TaskTracker:
>> reported
>> output size for attempt_201301031813_0001_m_041101_0  was 111
>> 
>> 2013-01-03 23:41:43,194 INFO org.apache.hadoop.mapred.TaskTracker:
>> addFreeSlot : current free slots : 9
>> 
>> 2013-01-03 23:41:43,303 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041109_0 1.0%
>> 
>> 2013-01-03 23:41:43,306 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041109_0 1.0%
>> 
>> 2013-01-03 23:41:43,307 INFO org.apache.hadoop.mapred.TaskTracker: Task
>> attempt_201301031813_0001_m_041109_0 is done.
>> 
>> 2013-01-03 23:41:43,307 INFO org.apache.hadoop.mapred.TaskTracker:
>> reported
>> output size for attempt_201301031813_0001_m_041109_0  was 111
>> 
>> 2013-01-03 23:41:43,307 INFO org.apache.hadoop.mapred.TaskTracker:
>> addFreeSlot : current free slots : 10
>> 
>> 2013-01-03 23:41:43,361 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>> jvm_201301031813_0001_m_36690963 exited with exit code 0. Number of tasks
>> it
>> ran: 1
>> 
>> 2013-01-03 23:41:43,428 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041106_0 1.0%
>> 
>> 2013-01-03 23:41:43,432 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041106_0 1.0%
>> 
>> 2013-01-03 23:41:43,433 INFO org.apache.hadoop.mapred.TaskTracker: Task
>> attempt_201301031813_0001_m_041106_0 is done.
>> 
>> 2013-01-03 23:41:43,433 INFO org.apache.hadoop.mapred.TaskTracker:
>> reported
>> output size for attempt_201301031813_0001_m_041106_0  was 111
>> 
>> 2013-01-03 23:41:43,433 INFO org.apache.hadoop.mapred.TaskTracker:
>> addFreeSlot : current free slots : 11
>> 
>> 2013-01-03 23:41:43,457 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>> jvm_201301031813_0001_m_-2095784622 exited with exit code 0. Number of
>> tasks
>> it ran: 1
>> 
>> 2013-01-03 23:41:43,595 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>> jvm_201301031813_0001_m_1190449426 exited with exit code 0. Number of
>> tasks
>> it ran: 1
>> 
>> 2013-01-03 23:41:43,862 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041103_0 1.0%
>> 
>> 2013-01-03 23:41:43,866 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041103_0 1.0%
>> 
>> 2013-01-03 23:41:43,867 INFO org.apache.hadoop.mapred.TaskTracker: Task
>> attempt_201301031813_0001_m_041103_0 is done.
>> 
>> 2013-01-03 23:41:43,867 INFO org.apache.hadoop.mapred.TaskTracker:
>> reported
>> output size for attempt_201301031813_0001_m_041103_0  was 111
>> 
>> 2013-01-03 23:41:43,867 INFO org.apache.hadoop.mapred.TaskTracker:
>> addFreeSlot : current free slots : 12
>> 
>> 2013-01-03 23:41:44,021 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>> jvm_201301031813_0001_m_-505301168 exited with exit code 0. Number of
>> tasks
>> it ran: 1
>> 
>> 2013-01-03 23:41:45,539 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050235882% reduce > copy (40975 of
>> 271884 at 0.00 MB/s) >
>> 
>> 
>> 
>> 
>> 
>> Then it seems to get stuck on reduce attempts, before exiting with a
>> SIG_TERM (143) exit code...
>> 
>> 
>> 
>> 2013-01-03 23:50:49,642 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:50:55,678 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:51:01,717 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:51:04,755 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:51:10,796 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:51:16,831 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:51:19,870 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:51:25,911 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:51:31,953 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:51:34,987 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:51:41,023 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:51:47,063 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:51:50,102 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:51:56,143 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:52:02,179 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:52:05,213 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:52:11,254 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:52:17,295 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:52:20,334 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:52:26,375 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:52:32,411 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:52:35,445 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:52:41,486 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:52:44,526 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:52:50,567 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:52:56,608 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:52:59,648 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:53:05,689 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:53:11,730 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:53:14,764 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:53:20,804 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:53:26,844 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:53:29,883 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:53:35,924 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:53:41,964 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of
>> 271884 at 0.00 MB/s) >
>> 
>> 2013-01-03 23:53:42,616 INFO org.apache.hadoop.mapred.TaskTracker:
>> Recieved
>> ReinitTrackerAction from JobTracker
>> 
>> 2013-01-03 23:53:42,881 INFO org.apache.hadoop.util.ProcessTree: Killing
>> process group6089 with signal TERM. Exit code 0
>> 
>> 2013-01-03 23:53:42,881 INFO org.apache.hadoop.mapred.TaskTracker:
>> addFreeSlot : current free slots : 12
>> 
>> 2013-01-03 23:53:43,348 WARN
>> org.apache.hadoop.mapred.DefaultTaskController:
>> Exit code from task is : 143
>> 
>> 2013-01-03 23:53:43,348 INFO
>> org.apache.hadoop.mapred.DefaultTaskController:
>> Output from DefaultTaskController's launchTask follows:
>> 
>> 2013-01-03 23:53:43,348 INFO org.apache.hadoop.mapred.TaskController:
>> 
>> 2013-01-03 23:53:43,349 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>> jvm_201301031813_0001_r_-2077112828 exited with exit code 143. Number of
>> tasks it ran: 0
>> 
>> 2013-01-03 23:54:28,484 INFO
>> org.apache.hadoop.filecache.TrackerDistributedCacheManager: Cleanup...
>> 
>> java.lang.InterruptedException: sleep interrupted
>> 
>>       at java.lang.Thread.sleep(Native Method)
>> 
>>       at
>> org.apache.hadoop.filecache.TrackerDistributedCacheManager$CleanupThread.r
>> un
>> (TrackerDistributedCacheManager.java:947)
>> 
>> 
>> 
>> Just around this time (actually a bit before) the JobTracker log reports
>> the
>> following:
>> 
>> 2013-01-03 23:52:31,348 INFO org.apache.hadoop.mapred.JobTracker: Lost
>> tracker 'tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005'
>> 
>> 2013-01-03 23:52:31,348 INFO org.apache.hadoop.mapred.TaskInProgress:
>> Error
>> from attempt_201301031813_0001_m_000010_0: Lost task tracker:
>> tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005
>> 
>> 2013-01-03 23:52:31,349 INFO org.apache.hadoop.mapred.TaskInProgress:
>> Error
>> from attempt_201301031813_0001_m_000039_0: Lost task tracker:
>> tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005
>> 
>> 2013-01-03 23:52:31,349 INFO org.apache.hadoop.mapred.TaskInProgress:
>> Error
>> from attempt_201301031813_0001_m_000041_0: Lost task tracker:
>> tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005
>> 
>> 2013-01-03 23:52:31,349 INFO org.apache.hadoop.mapred.TaskInProgress:
>> Error
>> from attempt_201301031813_0001_m_000089_0: Lost task tracker:
>> tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005
>> 
>> 2013-01-03 23:52:31,349 INFO org.apache.hadoop.mapred.TaskInProgress:
>> Error
>> from attempt_201301031813_0001_m_000092_0: Lost task tracker:
>> tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005
>> 
>> Etc etc.
>> 
>> 
>> 
>> Can anyone help with this?
>> 
>> 
>> 
>> Royston
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 


Mime
View raw message