hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghu Angadi <rang...@yahoo-inc.com>
Subject Re: Has anyone had hdfs block move synchronization failures with hadoop 0.15.0?
Date Tue, 04 Dec 2007 22:00:24 GMT

I would think after an hours or so things are ok.. but that might not 
have helped the job.

Raghu.

Jason Venner wrote:
> We have a small cluster of 9 machines on a shared Gig Switch (with a lot 
> of other machines)
> 
> The other day, running a job, the reduce stalled, when the map was 
> 99.99x% done.
> 7 of the 9 machines were idle, and 2 of the machines were using 100% of 
> 1 cpu (1 job per machine).
> 
> So it appears that there was a synchronization failure, in that one 
> machine thought the transfer hadn't started and the other machine 
> thought it had.
> 
> We did have a momentary network outage on the switch during this job. We 
> tried stopping the hadoop processes on the machines with the sending 
> failures, and after 10 minutes they went 'dead' but the job never resumed.
> 
> Looking into the log files of the spinning machines, they were endlessly 
> trying to start a block move to any of a set of other machines in the 
> cluster. The shape of their log message repeats are below.
> 
> 007-12-03 15:42:44,755 INFO org.apache.hadoop.dfs.DataNode: Starting 
> thread to transfer block blk_3105072074036734167 to 
> [Lorg.apache.hadoop.dfs.DatanodeInfo;@6fc40f
> 2007-12-03 15:42:44,757 WARN org.apache.hadoop.dfs.DataNode: Failed to 
> transfer blk_3105072074036734167 to XX.YY.ZZ.AAA:50010 got 
> java.net.SocketException: Broken pipe
>       at java.net.SocketOutputStream.socketWrite0(Native Method)
>       at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
>       at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>       at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>       at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>       at java.io.DataOutputStream.write(DataOutputStream.java:90)
>       at 
> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1175)
>       at 
> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1208)
>       at 
> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1460)
>       at java.lang.Thread.run(Thread.java:619)
> 
> 
> -- On the machines that the transfers were targeted to, the following 
> was in the log file.
> 
> 2007-12-03 15:42:18,508 ERROR org.apache.hadoop.dfs.DataNode: 
> DataXceiver: java.io.IOException: Block blk_3105072074036734167 has 
> already been started (though not completed), and thus cannot be created.
>       at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:568)
>       at 
> org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:1257)
>       at 
> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:901)
>       at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
>       at java.lang.Thread.run(Thread.java:619)
> 


Mime
View raw message