hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hairong Kuang" <hair...@yahoo-inc.com>
Subject RE: Has anyone had hdfs block move synchronization failures with hadoop 0.15.0?
Date Tue, 04 Dec 2007 19:28:57 GMT
Hi Jason,

Could you please check the namenode log if you see any message starting
with "PendingReplicationMonitor timed out block"?

Hairong

-----Original Message-----
From: Jason Venner [mailto:jason@attributor.com] 
Sent: Tuesday, December 04, 2007 10:07 AM
To: hadoop-user@lucene.apache.org
Subject: Has anyone had hdfs block move synchronization failures with
hadoop 0.15.0?

We have a small cluster of 9 machines on a shared Gig Switch (with a lot
of other machines)

The other day, running a job, the reduce stalled, when the map was
99.99x% done.
7 of the 9 machines were idle, and 2 of the machines were using 100% of
1 cpu (1 job per machine).

So it appears that there was a synchronization failure, in that one
machine thought the transfer hadn't started and the other machine
thought it had.

We did have a momentary network outage on the switch during this job. We
tried stopping the hadoop processes on the machines with the sending
failures, and after 10 minutes they went 'dead' but the job never
resumed.

Looking into the log files of the spinning machines, they were endlessly
trying to start a block move to any of a set of other machines in the
cluster. The shape of their log message repeats are below.

007-12-03 15:42:44,755 INFO org.apache.hadoop.dfs.DataNode: Starting
thread to transfer block blk_3105072074036734167 to
[Lorg.apache.hadoop.dfs.DatanodeInfo;@6fc40f
2007-12-03 15:42:44,757 WARN org.apache.hadoop.dfs.DataNode: Failed to
transfer blk_3105072074036734167 to XX.YY.ZZ.AAA:50010 got
java.net.SocketException: Broken pipe
       at java.net.SocketOutputStream.socketWrite0(Native Method)
       at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
       at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
       at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
       at
java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
       at java.io.DataOutputStream.write(DataOutputStream.java:90)
       at
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1175)
       at
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1208)
       at
org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1460)
       at java.lang.Thread.run(Thread.java:619)


-- On the machines that the transfers were targeted to, the following
was in the log file.

2007-12-03 15:42:18,508 ERROR org.apache.hadoop.dfs.DataNode: 
DataXceiver: java.io.IOException: Block blk_3105072074036734167 has
already been started (though not completed), and thus cannot be created.
       at
org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:568)
       at
org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:1257)
       at
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:901)
       at
org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
       at java.lang.Thread.run(Thread.java:619)


Mime
View raw message