hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sagar Naik <sn...@attributor.com>
Subject Re: My tasktrackers keep getting lost...
Date Tue, 03 Feb 2009 04:38:08 GMT
Can u post the output from
hadoop-argus-<hostname>-jobtracker.out

-Sagar

jason hadoop wrote:
> When I was at Attributor we experienced periodic odd XFS hangs that would
> freeze up the Hadoop Server processes resulting in them going away.
> Sometimes XFS would deadlock all writes to the log file and the server would
> freeze trying to log a message. Can't even JSTACK the jvm.
> We never had any traction on resolving the XFS deadlocks and simply reboot
> the machines when the problem occured.
>
> On Mon, Feb 2, 2009 at 7:09 PM, Ian Soboroff <ian.soboroff@nist.gov> wrote:
>
>   
>> I hope someone can help me out.  I'm getting started with Hadoop,
>> have written the firt part of my project (a custom InputFormat), and am
>> now using that to test out my cluster setup.
>>
>> I'm running 0.19.0.  I have five dual-core Linux workstations with most
>> of a 250GB disk available for playing, and am controlling things from my
>> Mac Pro.  (This is not the production cluster, that hasn't been
>> assembled yet.  This is just to get the code working and figure out the
>> bumps.)
>>
>> My test data is about 18GB of web pages, and the test app at the moment
>> just counts the number of web pages in each bundle file.  The map jobs
>> run just fine, but when it gets into the reduce, the TaskTrackers all
>> get "lost" to the JobTracker.  I can't see why, because the TaskTrackers
>> are all still running on the slaves.  Also, the jobdetails URL starts
>> returning an HTTP 500 error, although other links from that page still
>> work.
>>
>> I've tried going onto the slaves and manually restarting the
>> tasktrackers with hadoop-daemon.sh, and also turning on job restarting
>> in the site conf and then running stop-mapred/start-mapred.  The
>> trackers start up and try to clean up and get going again, but they then
>> just get lost again.
>>
>> Here's some error output from the master jobtracker:
>>
>> 2009-02-02 13:39:40,904 INFO org.apache.hadoop.mapred.JobTracker: Removed
>> completed task 'attempt_200902021252_0002_r_000005_1' from
>> 'tracker_darling:localhost.localdomain/127.0.0.1:58336'
>> 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker:
>> attempt_200902021252_0002_m_004592_1 is 796370 ms debug.
>> 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching
>> task attempt_200902021252_0002_m_004592_1 timed out.
>> 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker:
>> attempt_200902021252_0002_m_004582_1 is 794199 ms debug.
>> 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching
>> task attempt_200902021252_0002_m_004582_1 timed out.
>> 2009-02-02 13:41:22,271 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
>> 'duplicate' heartbeat from 'tracker_cheyenne:localhost.localdomain/
>> 127.0.0.1:52769'; resending the previous 'lost' response
>> 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
>> 'duplicate' heartbeat from 'tracker_tigris:localhost.localdomain/
>> 127.0.0.1:52808'; resending the previous 'lost' response
>> 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
>> 'duplicate' heartbeat from 'tracker_monocacy:localhost.localdomain/
>> 127.0.0.1:54464'; Resending the previous 'lost' response
>> 2009-02-02 13:41:22,298 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
>> 'duplicate' heartbeat from 'tracker_129.6.101.41:127.0.0.1/127.0.0.1:58744';
>> resending the previous 'lost' response
>> 2009-02-02 13:41:22,421 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
>> 'duplicate' heartbeat from 'tracker_rhone:localhost.localdomain/
>> 127.0.0.1:45749'; resending the previous 'lost' response
>> 2009-02-02 13:41:22,421 INFO org.apache.hadoop.ipc.Server: IPC Server
>> handler 9
>> on 54311 caught: java.lang.NullPointerException
>>        at org.apache.hadoop.mapred.MapTask.write(MapTask.java:123)
>>        at
>> org.apache.hadoop.mapred.LaunchTaskAction.write(LaunchTaskAction.java
>> :48)
>>        at
>> org.apache.hadoop.mapred.HeartbeatResponse.write(HeartbeatResponse.ja
>> va:101)
>>        at
>> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:1
>> 59)
>>        at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
>>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:907)
>>
>> 2009-02-02 13:41:27,275 WARN org.apache.hadoop.mapred.JobTracker: Status
>> from unknown Tracker : tracker_monocacy:localhost.localdomain/
>> 127.0.0.1:54464
>>
>> And from a slave:
>>
>> 2009-02-02 13:26:39,440 INFO
>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 129.6.101.18:50060,
>> dest: 129.6.101.12:37304, bytes: 6, op: MAPRED_SHUFFLE, cliID:
>> attempt_200902021252_0002_m_000111_0
>> 2009-02-02 13:41:40,165 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
>> exception: java.io.IOException: Call to rogue/129.6.101.41:54311 failed on
>> local exception: null
>>        at org.apache.hadoop.ipc.Client.call(Client.java:699)
>>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>        at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source)
>>        at
>> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1164)
>>        at
>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:997)
>>        at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1678)
>>        at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698)
>> Caused by: java.io.EOFException
>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>        at
>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>        at
>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>        at org.apache.hadoop.io.UTF8.readChars(UTF8.java:211)
>>        at org.apache.hadoop.io.UTF8.readString(UTF8.java:203)
>>        at
>> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:230)
>>        at
>> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:171)
>>        at
>> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:219)
>>        at
>> org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66)
>>        at
>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:506)
>>        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:438)
>>
>> 2009-02-02 13:41:40,165 INFO org.apache.hadoop.mapred.TaskTracker:
>> Resending 'status' to 'rogue' with reponseId '835
>> 2009-02-02 13:41:40,177 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_200902021252_0002_r_000007_0: Task
>> attempt_200902021252_0002_r_000007_0 failed to report status for 1600
>> seconds. Killing!
>> 2009-02-02 13:41:40,287 INFO org.apache.hadoop.mapred.TaskTracker: Sent out
>> 6 bytes for reduce: 4 from map: attempt_200902021252_0002_m_000118_0 given
>> 6/2 from 24 with (-1, -1)
>> 2009-02-02 13:41:40,333 INFO
>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 129.6.101.18:50060,
>> dest: 129.6.101.12:39569, bytes: 6, op: MAPRED_SHUFFLE, cliID:
>> attempt_200902021252_0002_m_000118_0
>> 2009-02-02 13:41:40,337 INFO org.apache.hadoop.mapred.TaskTracker: Process
>> Thread Dump: lost task
>>
>> ... and follows lots of stacks.  I'm not sure which error output is most
>> helpful for this report.
>>
>> Anyway, I can see that the logs tell me that something went wrong on the
>> slave, and as a result the master has lost contact with the slave
>> tasktracker, which is still alive, just not running my job anymore.  Can
>> anyone tell me where to start looking?
>>
>> Thanks,
>> Ian
>>
>>
>>     
>
>   

Mime
View raw message