hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Kunz (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-1233) erpeated job failures because of too many 'lost tasktrackers'
Date Mon, 09 Apr 2007 20:33:32 GMT
erpeated job failures because of too many 'lost tasktrackers'
-------------------------------------------------------------

                 Key: HADOOP-1233
                 URL: https://issues.apache.org/jira/browse/HADOOP-1233
             Project: Hadoop
          Issue Type: Bug
    Affects Versions: 0.12.1
            Reporter: Christian Kunz


Several attempts to run large jobs (100,000+ map taks, 1000+ reducers) on a 1000 node cluster
failed, mainly because of too many lost and eventually blacklisted tasktrackers.


Jobtracker log:
2007-04-09 00:16:55,180 INFO org.apache.hadoop.mapred.JobInProgress: TaskTracker at 'tracker_<host>'
turned 'flaky'
2007-04-09 00:16:55,180 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_0100_m_068220_0'
from 'tracker_<host>:50050'
2007-04-09 00:16:55,294 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_0100_m_000177_0:
Lost task tracker
2007-04-09 00:16:55,294 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_0100_m_000177_0'
has been lost.
2007-04-09 00:16:55,294 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_0100_m_001426_0:
Lost task tracker
2007-04-09 00:16:55,294 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_0100_m_001426_0'
has been lost.
...

Checking a few  tasktracker logs, they did not show any exceptions at the same time, but started
to get a lot of communication errors some time later (20 seconds up to a few minutes).
e.g.:
2007-04-09 00:17:16,788 WARN org.apache.hadoop.ipc.Server: handler output error
java.nio.channels.ClosedChannelException
        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
        at org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:108)
        at org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:89)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:578)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message