hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghu Angadi <rang...@yahoo-inc.com>
Subject Re: Has anyone had hdfs block move synchronization failures with hadoop 0.15.0?
Date Tue, 04 Dec 2007 22:10:48 GMT
Thanks. I thought this might have been caused by the network outage you 
mentioned. If this is repeatable, please file a jira and any details on 
how to reproduce.

Raghu.

Jason Venner wrote:
> This failure seems to be repeatable with this job and this cluster.
> I reran, and had the same problem, 2 machines unable to transfer some 
> blocks.
> 
> I have a mapper, a combiner and a reducer. My combiner results in about 
> a 4 to 1 reduction in data volumes.
> 
> This is the same job that shows up with the slow reducer transfer rates 
> I have asked about earlier.
> 
> reduce > copy (643 of 789 at 0.12 MB/s) >
> reduce > copy (656 of 789 at 0.12 MB/s) >
> reduce > copy (644 of 789 at 0.12 MB/s) >
> reduce > copy (644 of 789 at 0.12 MB/s) >
> reduce > copy (656 of 789 at 0.12 MB/s) >
> reduce > copy (656 of 789 at 0.12 MB/s) >
> reduce > copy (643 of 789 at 0.12 MB/s) >
> reduce > copy (623 of 789 at 0.12 MB/s) >
> reduce > copy (621 of 789 at 0.12 MB/s) >
> 
> Raghu Angadi wrote:
>>
>> I would think after an hours or so things are ok.. but that might not 
>> have helped the job.
>>
>> Raghu.
>>
>> Jason Venner wrote:
>>> We have a small cluster of 9 machines on a shared Gig Switch (with a 
>>> lot of other machines)
>>>
>>> The other day, running a job, the reduce stalled, when the map was 
>>> 99.99x% done.
>>> 7 of the 9 machines were idle, and 2 of the machines were using 100% 
>>> of 1 cpu (1 job per machine).
>>>
>>> So it appears that there was a synchronization failure, in that one 
>>> machine thought the transfer hadn't started and the other machine 
>>> thought it had.
>>>
>>> We did have a momentary network outage on the switch during this job. 
>>> We tried stopping the hadoop processes on the machines with the 
>>> sending failures, and after 10 minutes they went 'dead' but the job 
>>> never resumed.
>>>
>>> Looking into the log files of the spinning machines, they were 
>>> endlessly trying to start a block move to any of a set of other 
>>> machines in the cluster. The shape of their log message repeats are 
>>> below.
>>>
>>> 007-12-03 15:42:44,755 INFO org.apache.hadoop.dfs.DataNode: Starting 
>>> thread to transfer block blk_3105072074036734167 to 
>>> [Lorg.apache.hadoop.dfs.DatanodeInfo;@6fc40f
>>> 2007-12-03 15:42:44,757 WARN org.apache.hadoop.dfs.DataNode: Failed 
>>> to transfer blk_3105072074036734167 to XX.YY.ZZ.AAA:50010 got 
>>> java.net.SocketException: Broken pipe
>>>       at java.net.SocketOutputStream.socketWrite0(Native Method)
>>>       at 
>>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
>>>       at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>>>       at 
>>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>>>       at 
>>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>>>       at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>>       at 
>>> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1175)
>>>       at 
>>> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1208)
>>>       at 
>>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1460)
>>>       at java.lang.Thread.run(Thread.java:619)
>>>
>>>
>>> -- On the machines that the transfers were targeted to, the following 
>>> was in the log file.
>>>
>>> 2007-12-03 15:42:18,508 ERROR org.apache.hadoop.dfs.DataNode: 
>>> DataXceiver: java.io.IOException: Block blk_3105072074036734167 has 
>>> already been started (though not completed), and thus cannot be created.
>>>       at 
>>> org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:568)
>>>       at 
>>> org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:1257)
>>>       at 
>>> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:901)
>>>       at 
>>> org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
>>>       at java.lang.Thread.run(Thread.java:619)
>>>
>>


Mime
View raw message