hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Venner <ja...@attributor.com>
Subject Re: Has anyone had hdfs block move synchronization failures with hadoop 0.15.0?
Date Tue, 04 Dec 2007 22:06:40 GMT
This failure seems to be repeatable with this job and this cluster.
I reran, and had the same problem, 2 machines unable to transfer some 
blocks.

I have a mapper, a combiner and a reducer. My combiner results in about 
a 4 to 1 reduction in data volumes.

This is the same job that shows up with the slow reducer transfer rates 
I have asked about earlier.

reduce > copy (643 of 789 at 0.12 MB/s) >
reduce > copy (656 of 789 at 0.12 MB/s) >
reduce > copy (644 of 789 at 0.12 MB/s) >
reduce > copy (644 of 789 at 0.12 MB/s) >
reduce > copy (656 of 789 at 0.12 MB/s) >
reduce > copy (656 of 789 at 0.12 MB/s) >
reduce > copy (643 of 789 at 0.12 MB/s) >
reduce > copy (623 of 789 at 0.12 MB/s) >
reduce > copy (621 of 789 at 0.12 MB/s) >

Raghu Angadi wrote:
>
> I would think after an hours or so things are ok.. but that might not 
> have helped the job.
>
> Raghu.
>
> Jason Venner wrote:
>> We have a small cluster of 9 machines on a shared Gig Switch (with a 
>> lot of other machines)
>>
>> The other day, running a job, the reduce stalled, when the map was 
>> 99.99x% done.
>> 7 of the 9 machines were idle, and 2 of the machines were using 100% 
>> of 1 cpu (1 job per machine).
>>
>> So it appears that there was a synchronization failure, in that one 
>> machine thought the transfer hadn't started and the other machine 
>> thought it had.
>>
>> We did have a momentary network outage on the switch during this job. 
>> We tried stopping the hadoop processes on the machines with the 
>> sending failures, and after 10 minutes they went 'dead' but the job 
>> never resumed.
>>
>> Looking into the log files of the spinning machines, they were 
>> endlessly trying to start a block move to any of a set of other 
>> machines in the cluster. The shape of their log message repeats are 
>> below.
>>
>> 007-12-03 15:42:44,755 INFO org.apache.hadoop.dfs.DataNode: Starting 
>> thread to transfer block blk_3105072074036734167 to 
>> [Lorg.apache.hadoop.dfs.DatanodeInfo;@6fc40f
>> 2007-12-03 15:42:44,757 WARN org.apache.hadoop.dfs.DataNode: Failed 
>> to transfer blk_3105072074036734167 to XX.YY.ZZ.AAA:50010 got 
>> java.net.SocketException: Broken pipe
>>       at java.net.SocketOutputStream.socketWrite0(Native Method)
>>       at 
>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
>>       at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>>       at 
>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>>       at 
>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>>       at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>       at 
>> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1175)
>>       at 
>> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1208)
>>       at 
>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1460)
>>       at java.lang.Thread.run(Thread.java:619)
>>
>>
>> -- On the machines that the transfers were targeted to, the following 
>> was in the log file.
>>
>> 2007-12-03 15:42:18,508 ERROR org.apache.hadoop.dfs.DataNode: 
>> DataXceiver: java.io.IOException: Block blk_3105072074036734167 has 
>> already been started (though not completed), and thus cannot be created.
>>       at 
>> org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:568)
>>       at 
>> org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:1257)
>>       at 
>> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:901)
>>       at 
>> org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
>>       at java.lang.Thread.run(Thread.java:619)
>>
>

Mime
View raw message