hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: 0.16.4 DataNode problem...
Date Tue, 27 May 2008 17:56:08 GMT

What does hadoop dfs -fsck show you?


On 5/27/08 8:58 AM, "C G" <parallelguy@yahoo.com> wrote:

> Hi All:
>    
>   I'm seeing an inability to run one of our applications over a reasonably
> small dataset (~200G input) while running 0.16.4.  Previously we were on
> 0.15.0 and the same application ran fine with the same dataset.
>    
>   A lengthy description follows, including log file output, etc.  The failure
> mode smells like a bug in 0.16.4, but I'm not 100% positive about that.
>    
>   My questions are:
>    
>   1.  Any known issues upgrading from 0.15.0 to 0.16.4?  Our code runs just
> fine over small datasets, but dies on these larger ones.  We followed the
> upgrade instructions in the wiki, etc.
>    
>   2.  Would an upgrade to 0.17.0 help resolve these problems?
>    
>   3.  Would a re-format/re-load of HDFS help correct these issues?  This is
> the thing I hope for the least in that I have 3T of data on-board HDFS and it
> will take days to dump it all and reload it.
>    
>   4.  Any other advice or help?
>    
>   I've been looking at this for the past few days and have been unable to make
> progress of solving it.  I would hate to have to fall back to 0.15.0 (see
> above regarding 3T data reloads, not to mention being stuck on an old
> release).    Any help, thoughts, comments, etc., would be very helpful.
> Thanks!
>    
>   Description:
>   Following an upgrade from 0.15.0 to 0.16.4 (and after recompiling our apps,
> etc.), a job that used to run correctly on our grid now fails.  The failure
> occurs after the map phase is complete, and about 2/3rds of the way through
> the reduce job.  The error which gets kicked out from the application
> perspective is:
>    
>   08/05/27 11:30:08 INFO mapred.JobClient:  map 100% reduce 89%
> 08/05/27 11:30:41 INFO mapred.JobClient:  map 100% reduce 90%
> 08/05/27 11:32:45 INFO mapred.JobClient:  map 100% reduce 86%
> 08/05/27 11:32:45 INFO mapred.JobClient: Task Id :
> task_200805271056_0001_r_000007_0, Status : FAILED
> java.io.IOException: Could not get block locations. Aborting...
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanode
> Error(DFSClient.java:1832)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java:148
> 7)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.jav
> a:1579)
>    
>    
>   I then discovered that 1 or more DataNode instances on the slave nodes are
> down (we run 1 DataNode instance per machine).  The cause for at least some of
> the DataNode failures is a JVM internal error that gets raised due to a
> complete out-of-memory scenario (on a 4G, 4-way machine).
>    
>   Watching the DataNodes run, I can see them consuming more and more memory.
> For those failures for which there is a JVM traceback, I see (in part):
>   #
> # java.lang.OutOfMemoryError: requested 16 bytes for CHeapObj-new. Out of swap
> space?
> #
> #  Internal Error (414C4C4F434154494F4E0E494E4C494E450E4850500017), pid=4246,
> tid=2283883408
> #
> # Java VM: Java HotSpot(TM) Server VM (1.6.0_02-b05 mixed mode)
> # If you would like to submit a bug report, please visit:
> #   http://java.sun.com/webapps/bugreport/crash.jsp
> #
>   ---------------  T H R E A D  ---------------
>   Current thread (0x8a942000):  JavaThread
> "org.apache.hadoop.dfs.DataNode$DataXceiver@3f4f44" daemon [_thread_in_Java,
> id=15064]
>   Stack: [0x881c4000,0x88215000),  sp=0x882139e0,  free space=318k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
> code)
> V  [libjvm.so+0x53b707]
> V  [libjvm.so+0x225fe1]
> V  [libjvm.so+0x16fdc5]
> V  [libjvm.so+0x22aef3]
>   Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> v  blob 0xf4f235a7
> J  java.io.DataInputStream.readInt()I
> j  
> org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(Ljava/io/DataOutputS
> tream;Ljava/io/DataInputStream;Ljava/io/DataOutputStream;Ljava/lang/String;Lor
> g/a
> pache/hadoop/dfs/DataNode$Throttler;I)V+126
> j  
> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(Ljava/io/DataInputStream
> ;)V+746
> j  org.apache.hadoop.dfs.DataNode$DataXceiver.run()V+174
> j  java.lang.Thread.run()V+11
> v  ~StubRoutines::call_stub
>   ---------------  P R O C E S S  ---------------
>   Java Threads: ( => current thread )
>   0x0ae3f400 JavaThread "process reaper" daemon [_thread_blocked, id=26870]
>   0x852e6000 JavaThread "org.apache.hadoop.dfs.DataNode$DataXceiver@e5dce1"
> daemon [_thread_in_vm, id=26869]
>   0x08a1cc00 JavaThread "PacketResponder 0 for Block blk_-6186975972786687394"
> daemon [_thread_blocked, id=26769]
>   0x852e5000 JavaThread "org.apache.hadoop.dfs.DataNode$DataXceiver@c40bf8"
> daemon [_thread_in_native, id=26768]
>   0x0956e000 JavaThread "PacketResponder 0 for Block blk_-2322514873363546651"
> daemon [_thread_blocked, id=26767]
>   0x852e4400 JavaThread "org.apache.hadoop.dfs.DataNode$DataXceiver@1ca61f9"
> daemon [_thread_in_native, id=26766]
>   0x09d3a400 JavaThread "PacketResponder 0 for Block blk_8926941945313450801"
> daemon [_thread_blocked, id=26764]
>   0x852e3c00 JavaThread "org.apache.hadoop.dfs.DataNode$DataXceiver@1e186d9"
> daemon [_thread_in_native, id=26763]
>   0x0953d000 JavaThread "PacketResponder 0 for Block blk_4785883052769066976"
> daemon [_thread_blocked, id=26762]
>   0xb13a5c00 JavaThread "org.apache.hadoop.dfs.DataNode$DataXceiver@13d62aa"
> daemon [_thread_in_native, id=26761]
>    
>   The interesting part here is that if I count the number of JavaThreads
> running org.apache.hadoop.dfs.DataNode I see 4,538 (!) in the traceback.  The
> number of threads was surprising.
>    
>   Other DataNodes just exit without panicking the JVM.  In either failure
> mode, the last few lines of the DataNode log file is apparently innocuous:
>    
>   2008-05-27 11:31:47,663 INFO org.apache.hadoop.dfs.DataNode: Datanode 2 got
> response for connect ack  from downstream datanode with firstbadlink as
> 2008-05-27 11:31:47,663 INFO org.apache.hadoop.dfs.DataNode: Datanode 2
> forwarding connect ack to upstream firstbadlink is
> 2008-05-27 11:31:48,268 INFO org.apache.hadoop.dfs.DataNode: Receiving block
> blk_-2241766430103062484 src: /10.2.14.10:33626 dest: /10.2.14.10:50010
> 2008-05-27 11:31:48,740 INFO org.apache.hadoop.dfs.DataNode: Receiving block
> blk_313239508245918539 src: /10.2.14.24:37836 dest: /10.2.14.24:50010
> 2008-05-27 11:31:48,740 INFO org.apache.hadoop.dfs.DataNode: Datanode 0
> forwarding connect ack to upstream firstbadlink is
> 2008-05-27 11:31:49,044 INFO org.apache.hadoop.dfs.DataNode: Receiving block
> blk_1684581399908730353 src: /10.2.14.16:51605 dest: /10.2.14.16:50010
> 2008-05-27 11:31:49,044 INFO org.apache.hadoop.dfs.DataNode: Datanode 0
> forwarding connect ack to upstream firstbadlink is
> 2008-05-27 11:31:49,509 INFO org.apache.hadoop.dfs.DataNode: Receiving block
> blk_2493969670086107736 src: /10.2.14.18:47557 dest: /10.2.14.18:50010
> 2008-05-27 11:31:49,513 INFO org.apache.hadoop.dfs.DataNode: Datanode 1 got
> response for connect ack  from downstream datanode with firstbadlink as
> 2008-05-27 11:31:49,513 INFO org.apache.hadoop.dfs.DataNode: Datanode 1
> forwarding connect ack to upstream firstbadlink is
>    
>   Finally, the task-level output (in userlogs) doesn't reveal much either:
>    
>   2008-05-27 11:38:30,724 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 Need 34 map output(s)
> 2008-05-27 11:38:30,753 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 done copying
> task_200805271056_0001_m_001976_0 output from worker9.
> 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 Got 33 known map output location(s);
> scheduling...
> 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 Scheduled 1 of 33 known outputs (0 slow
> hosts and 32 dup hosts)
> 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 Copying task_200805271056_0001_m_001248_0
> output from worker8.
> 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 Need 33 map output(s)
> 2008-05-27 11:38:31,752 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 done copying
> task_200805271056_0001_m_001248_0 output from worker8.
>    
> 
>        
>        


Mime
View raw message