hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dave bayer <da...@cloudfactory.org>
Subject Re: time outs when accessing port 50010
Date Mon, 21 Dec 2009 19:57:27 GMT

On Nov 25, 2009, at 11:27 AM, David J. O'Dell wrote:

> I've intermittently seen the following errors on both of my  
> clusters, it happens when writing files.
> I was hoping this would go away with the new version but I see the  
> same behavior on both versions.
> The namenode logs don't show any problems, its always on the client  
> and datanodes.

[leaving errors below for reference]

I've seen similar errors on my 0.19.2 cluster when the cluster is  
decently busy. I've traced this more or less to the host in question  
doing verification on its blocks, an operation which seems to take the  
datanode out for upwards of 500 seconds in some cases.

In 0.19.2, if you look at  
o.a.h.hdfs.server.datanode.FSDataset.FSVolumeSet, you will see that  
all methods are synchronized. All operations for the dataset on the  
node seem to drop through methods in this class which in turn causes a  
backup when one thread spends a large amount of time locking the  

You can grab a few jstacks and use a dump analyzer (like https://tda.dev.java.net/) 
  to poke through them to see if you have the same behavior.

I have not spent enough time digging into this to understand whether  
the whole dataset really needs to be locked during the operation or if  
the locks could be moved closer to the FSDir operations.

dave bayer

original logs clips included here:
> Client log:
> 09/11/25 10:54:15 INFO hdfs.DFSClient: Exception in  
> createBlockOutputStream java.net.SocketTimeoutException: 69000  
> millis timeout while waiting for channel to be ready for read. ch :  
> java.nio.channels.SocketChannel[connected local=/  
> remote=/]
> 09/11/25 10:54:15 INFO hdfs.DFSClient: Abandoning block  
> blk_-105422935413230449_22608
> 09/11/25 10:54:15 INFO hdfs.DFSClient: Waiting to find target node:  
> Datanode log:
> 2009-11-25 10:54:51,170 ERROR  
> org.apache.hadoop.hdfs.server.datanode.DataNode:  
> DatanodeRegistration(,  
> storageID=DS-1401408597-,  
> infoPort=50075, ipcPort=50020):DataXceiver
> java.net.SocketTimeoutException: 120000 millis timeout while waiting  
> for channel to be ready for connect. ch :  
> java.nio.channels.SocketChannel[connection-pending remote=/ 
>       at  
> org 
> .apache 
> .hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213)
>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
>       at  
> org 
> .apache 
> .hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java: 
> 282)
>       at  
> org 
> .apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java: 
> 103)
>       at java.lang.Thread.run(Thread.java:619)

View raw message