hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Venner <ja...@attributor.com>
Subject Has anyone had hdfs block move synchronization failures with hadoop 0.15.0?
Date Tue, 04 Dec 2007 18:06:43 GMT
We have a small cluster of 9 machines on a shared Gig Switch (with a lot 
of other machines)

The other day, running a job, the reduce stalled, when the map was 
99.99x% done.
7 of the 9 machines were idle, and 2 of the machines were using 100% of 
1 cpu (1 job per machine).

So it appears that there was a synchronization failure, in that one 
machine thought the transfer hadn't started and the other machine 
thought it had.

We did have a momentary network outage on the switch during this job. We 
tried stopping the hadoop processes on the machines with the sending 
failures, and after 10 minutes they went 'dead' but the job never resumed.

Looking into the log files of the spinning machines, they were endlessly 
trying to start a block move to any of a set of other machines in the 
cluster. The shape of their log message repeats are below.

007-12-03 15:42:44,755 INFO org.apache.hadoop.dfs.DataNode: Starting 
thread to transfer block blk_3105072074036734167 to 
[Lorg.apache.hadoop.dfs.DatanodeInfo;@6fc40f
2007-12-03 15:42:44,757 WARN org.apache.hadoop.dfs.DataNode: Failed to 
transfer blk_3105072074036734167 to XX.YY.ZZ.AAA:50010 got 
java.net.SocketException: Broken pipe
       at java.net.SocketOutputStream.socketWrite0(Native Method)
       at 
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
       at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
       at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
       at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
       at java.io.DataOutputStream.write(DataOutputStream.java:90)
       at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1175)
       at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1208)
       at 
org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1460)
       at java.lang.Thread.run(Thread.java:619)


-- On the machines that the transfers were targeted to, the following 
was in the log file.

2007-12-03 15:42:18,508 ERROR org.apache.hadoop.dfs.DataNode: 
DataXceiver: java.io.IOException: Block blk_3105072074036734167 has 
already been started (though not completed), and thus cannot be created.
       at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:568)
       at 
org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:1257)
       at 
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:901)
       at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
       at java.lang.Thread.run(Thread.java:619)


Mime
View raw message