hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Venner <ja...@attributor.com>
Subject Re: Has anyone had hdfs block move synchronization failures with hadoop 0.15.0?
Date Tue, 04 Dec 2007 19:46:53 GMT
The following blocks have timeout errors from that job, in the logfile,
There are no timed out block messages for that block #
There are some socket timeouts.

img47: 
/data1/image_hadoop/hadoop-0.15.0/logs/hadoop-argus-datanode-img47.log.2007-12-03:2007-12-03

13:41:08,520 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer 
blk_3105072074036734167 to 10.50.30.99:50010 got 
java.net.SocketTimeoutException: connect timed out
img47: 
/data1/image_hadoop/hadoop-0.15.0/logs/hadoop-argus-datanode-img47.log.2007-12-03:2007-12-03

13:57:55,378 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer 
blk_3105072074036734167 to 10.50.30.99:50010 got 
java.net.SocketTimeoutException: connect timed out
img47: 
/data1/image_hadoop/hadoop-0.15.0/logs/hadoop-argus-datanode-img47.log.2007-12-03:2007-12-03

14:10:05,737 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer 
blk_3105072074036734167 to 10.50.30.99:50010 got 
java.net.SocketTimeoutException: connect timed out
img47: 
/data1/image_hadoop/hadoop-0.15.0/logs/hadoop-argus-datanode-img47.log.2007-12-03:2007-12-03

14:22:37,109 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer 
blk_3105072074036734167 to 10.50.30.99:50010 got 
java.net.SocketTimeoutException: connect timed out
img49: 
/data1/image_hadoop/hadoop-0.15.0/logs/hadoop-argus-datanode-img49.log.2007-12-03:2007-12-03

15:44:04,088 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: 
java.io.img58: 
/data1/image_hadoop/hadoop-0.15.0/logs/hadoop-argus-datanode-img58.log.2007-12-03:2007-12-03

13:29:04,158 INFO org.apache.hadoop.dfs.DataNode: Received block 
blk_3105072074036734167 from /10.50.30.100 and Read timed out
img52: 
/data1/image_hadoop/hadoop-0.15.0/logs/hadoop-argus-datanode-img52.log.2007-12-03:2007-12-03

13:30:25,935 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer 
blk_3105072074036734167 to 10.50.30.101:50010 got 
java.net.SocketTimeoutException: connect timed out
img53: 
/data1/image_hadoop/hadoop-0.15.0/logs/hadoop-argus-datanode-img53.log.2007-12-03:2007-12-03

13:35:05,987 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer 
blk_3105072074036734167 to 10.50.30.99:50010 got 
java.net.SocketTimeoutException: connect timed out
img53: 
/data1/image_hadoop/hadoop-0.15.0/logs/hadoop-argus-datanode-img53.log.2007-12-03:2007-12-03

14:14:37,950 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer 
blk_3105072074036734167 to 10.50.30.99:50010 got 
java.net.SocketTimeoutException: connect timed out
img562007-12-03:2007-12-03 13:34:52,788 WARN 
org.apache.hadoop.dfs.DataNode: Failed to transfer 
blk_3105072074036734167 to 10.50.30.99:50010 got 
java.net.SocketTimeoutException: connect timed out
img56: 
/data1/image_hadoop/hadoop-0.15.0/logs/hadoop-argus-datanode-img56.log.2007-12-03:2007-12-03

13:39:25,020 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer 
blk_3105072074036734167 to 10.50.30.101:50010 got 
java.net.SocketTimeoutException: connect timed out
img56: 
/data1/image_hadoop/hadoop-0.15.0/logs/hadoop-argus-datanode-img56.log.2007-12-03:2007-12-03

14:10:19,417 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer 
blk_3105072074036734167 to 10.50.30.99:50010 got 
java.net.SocketTimeoutException: connect timed out


Hairong Kuang wrote:
> Hi Jason,
>
> Could you please check the namenode log if you see any message starting
> with "PendingReplicationMonitor timed out block"?
>
> Hairong
>
> -----Original Message-----
> From: Jason Venner [mailto:jason@attributor.com] 
> Sent: Tuesday, December 04, 2007 10:07 AM
> To: hadoop-user@lucene.apache.org
> Subject: Has anyone had hdfs block move synchronization failures with
> hadoop 0.15.0?
>
> We have a small cluster of 9 machines on a shared Gig Switch (with a lot
> of other machines)
>
> The other day, running a job, the reduce stalled, when the map was
> 99.99x% done.
> 7 of the 9 machines were idle, and 2 of the machines were using 100% of
> 1 cpu (1 job per machine).
>
> So it appears that there was a synchronization failure, in that one
> machine thought the transfer hadn't started and the other machine
> thought it had.
>
> We did have a momentary network outage on the switch during this job. We
> tried stopping the hadoop processes on the machines with the sending
> failures, and after 10 minutes they went 'dead' but the job never
> resumed.
>
> Looking into the log files of the spinning machines, they were endlessly
> trying to start a block move to any of a set of other machines in the
> cluster. The shape of their log message repeats are below.
>
> 007-12-03 15:42:44,755 INFO org.apache.hadoop.dfs.DataNode: Starting
> thread to transfer block blk_3105072074036734167 to
> [Lorg.apache.hadoop.dfs.DatanodeInfo;@6fc40f
> 2007-12-03 15:42:44,757 WARN org.apache.hadoop.dfs.DataNode: Failed to
> transfer blk_3105072074036734167 to XX.YY.ZZ.AAA:50010 got
> java.net.SocketException: Broken pipe
>        at java.net.SocketOutputStream.socketWrite0(Native Method)
>        at
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
>        at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>        at
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>        at
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>        at
> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1175)
>        at
> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1208)
>        at
> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1460)
>        at java.lang.Thread.run(Thread.java:619)
>
>
> -- On the machines that the transfers were targeted to, the following
> was in the log file.
>
> 2007-12-03 15:42:18,508 ERROR org.apache.hadoop.dfs.DataNode: 
> DataXceiver: java.io.IOException: Block blk_3105072074036734167 has
> already been started (though not completed), and thus cannot be created.
>        at
> org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:568)
>        at
> org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:1257)
>        at
> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:901)
>        at
> org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
>        at java.lang.Thread.run(Thread.java:619)
>
>   

Mime
View raw message