hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher Gillett (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-3466) DataNode fails to deliver blocks, holds thousands of open socket connections
Date Thu, 29 May 2008 20:19:46 GMT
DataNode fails to deliver blocks, holds thousands of open socket connections
----------------------------------------------------------------------------

                 Key: HADOOP-3466
                 URL: https://issues.apache.org/jira/browse/HADOOP-3466
             Project: Hadoop Core
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.17.0
         Environment: Hadoop-0.17.0 (and 0.16.4), 25 node Linux grid (24 DataNodes, 1 master
node)
            Reporter: Christopher Gillett
            Priority: Critical


A job that used to run correctly on our grid (in 0.15.0) now fails. The failure occurs after
the map phase is complete, and about 2/3rds of the way through the reduce phase.   This job
is processing a modest amount of input data (approximately 220G)

When the error occurs the nodes hosting DataNodes have literally thousands of open socket
connections on them.  The DataNode instances are holding large amounts of memory.  Sometimes
the DataNodes crash or exit, other times they continue to run.

The error which gets kicked out from the application perspective is:

08/05/27 11:30:08 INFO mapred.JobClient: map 100% reduce 89%
08/05/27 11:30:41 INFO mapred.JobClient: map 100% reduce 90%
08/05/27 11:32:45 INFO mapred.JobClient: map 100% reduce 86%
08/05/27 11:32:45 INFO mapred.JobClient: Task Id :
 task_200805271056_0001_r_000007_0, Status : FAILED
java.io.IOException: Could not get block locations. Aborting...
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanode
 Error(DFSClient.java:1832)
at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java:1487)
at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1579)

I then discovered that 1 or more DataNode instances on the slave nodes
 are down (we run 1 DataNode instance per machine). The cause for at
 least some of the DataNode failures is a JVM internal error that gets
 raised due to a complete out-of-memory scenario (on a 4G, 4-way machine). 

Watching the DataNodes run, I can see them consuming more and more
 memory. For those failures for which there is a JVM traceback, I see (in
 part...NOTE 0.16.4 TRACEBACK):
#
# java.lang.OutOfMemoryError: requested 16 bytes for CHeapObj-new. Out
 of swap space?
#
# Internal Error (414C4C4F434154494F4E0E494E4C494E450E4850500017),
 pid=4246, tid=2283883408
#
# Java VM: Java HotSpot(TM) Server VM (1.6.0_02-b05 mixed mode)
# If you would like to submit a bug report, please visit:
# http://java.sun.com/webapps/bugreport/crash.jsp
#
--------------- T H R E A D ---------------
Current thread (0x8a942000): JavaThread
 "org.apache.hadoop.dfs.DataNode$DataXceiver@3f4f44" daemon [_thread_in_Java, id=15064]
Stack: [0x881c4000,0x88215000), sp=0x882139e0, free space=318k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code,
 C=native code)
V [libjvm.so+0x53b707]
V [libjvm.so+0x225fe1]
V [libjvm.so+0x16fdc5]
V [libjvm.so+0x22aef3]
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
v blob 0xf4f235a7
J java.io.DataInputStream.readInt()I
j
 org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(Ljava/io/DataOutputStream;Ljava/io/DataInputStream;Ljava/io/DataOutputStream;Ljava/lang/String;Lorg/a
pache/hadoop/dfs/DataNode$Throttler;I)V+126
j
 org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(Ljava/io/DataInputStream;)V+746
j org.apache.hadoop.dfs.DataNode$DataXceiver.run()V+174
j java.lang.Thread.run()V+11
v ~StubRoutines::call_stub
--------------- P R O C E S S ---------------
Java Threads: ( => current thread )
0x0ae3f400 JavaThread "process reaper" daemon [_thread_blocked,
 id=26870]
0x852e6000 JavaThread
 "org.apache.hadoop.dfs.DataNode$DataXceiver@e5dce1" daemon [_thread_in_vm, id=26869]
0x08a1cc00 JavaThread "PacketResponder 0 for Block
 blk_-6186975972786687394" daemon [_thread_blocked, id=26769]
0x852e5000 JavaThread
 "org.apache.hadoop.dfs.DataNode$DataXceiver@c40bf8" daemon [_thread_in_native, id=26768]
0x0956e000 JavaThread "PacketResponder 0 for Block
 blk_-2322514873363546651" daemon [_thread_blocked, id=26767]
0x852e4400 JavaThread
 "org.apache.hadoop.dfs.DataNode$DataXceiver@1ca61f9" daemon [_thread_in_native, id=26766]
0x09d3a400 JavaThread "PacketResponder 0 for Block
 blk_8926941945313450801" daemon [_thread_blocked, id=26764]
0x852e3c00 JavaThread
 "org.apache.hadoop.dfs.DataNode$DataXceiver@1e186d9" daemon [_thread_in_native, id=26763]
0x0953d000 JavaThread "PacketResponder 0 for Block
 blk_4785883052769066976" daemon [_thread_blocked, id=26762]
0xb13a5c00 JavaThread
 "org.apache.hadoop.dfs.DataNode$DataXceiver@13d62aa" daemon [_thread_in_native, id=26761]

The interesting part here is that if I count the number of JavaThreads
 running org.apache.hadoop.dfs.DataNode I see 4,538 (!) in the
 traceback. The number of threads was surprising.

Other DataNodes just exit without panicking the JVM. In either failure
 mode, the last few lines of the DataNode log file is apparently
 innocuous:

2008-05-27 11:31:47,663 INFO org.apache.hadoop.dfs.DataNode: Datanode 2
 got response for connect ack from downstream datanode with
 firstbadlink as
2008-05-27 11:31:47,663 INFO org.apache.hadoop.dfs.DataNode: Datanode 2
 forwarding connect ack to upstream firstbadlink is
2008-05-27 11:31:48,268 INFO org.apache.hadoop.dfs.DataNode: Receiving
 block blk_-2241766430103062484 src: /10.2.14.10:33626 dest:
 /10.2.14.10:50010
2008-05-27 11:31:48,740 INFO org.apache.hadoop.dfs.DataNode: Receiving
 block blk_313239508245918539 src: /10.2.14.24:37836 dest:
 /10.2.14.24:50010
2008-05-27 11:31:48,740 INFO org.apache.hadoop.dfs.DataNode: Datanode 0
 forwarding connect ack to upstream firstbadlink is
2008-05-27 11:31:49,044 INFO org.apache.hadoop.dfs.DataNode: Receiving
 block blk_1684581399908730353 src: /10.2.14.16:51605 dest:
 /10.2.14.16:50010
2008-05-27 11:31:49,044 INFO org.apache.hadoop.dfs.DataNode: Datanode 0
 forwarding connect ack to upstream firstbadlink is
2008-05-27 11:31:49,509 INFO org.apache.hadoop.dfs.DataNode: Receiving
 block blk_2493969670086107736 src: /10.2.14.18:47557 dest:
 /10.2.14.18:50010
2008-05-27 11:31:49,513 INFO org.apache.hadoop.dfs.DataNode: Datanode 1
 got response for connect ack from downstream datanode with
 firstbadlink as
2008-05-27 11:31:49,513 INFO org.apache.hadoop.dfs.DataNode: Datanode 1
 forwarding connect ack to upstream firstbadlink is

Finally, the task-level output (in userlogs) doesn't reveal much
 either:

2008-05-27 11:38:30,724 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200805271056_0001_r_000007_1 Need 34 map output(s)
2008-05-27 11:38:30,753 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200805271056_0001_r_000007_1 done copying
 task_200805271056_0001_m_001976_0 output from worker9.
2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200805271056_0001_r_000007_1: Got 0 new map-outputs & 0 obsolete
 map-outputs from tasktracker and 0 map-outputs from previous failures
2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200805271056_0001_r_000007_1 Got 33 known map output location(s);
 scheduling...
2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200805271056_0001_r_000007_1 Scheduled 1 of 33 known outputs (0 slow
 hosts and 32 dup hosts)
2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200805271056_0001_r_000007_1 Copying task_200805271056_0001_m_001248_0
 output from worker8.
2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200805271056_0001_r_000007_1 Need 33 map output(s)
2008-05-27 11:38:31,752 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200805271056_0001_r_000007_1 done copying
 task_200805271056_0001_m_001248_0 output from worker8.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message