hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ryan rawson (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-915) Hung DN stalls write pipeline for far longer than its timeout
Date Mon, 22 Mar 2010 04:12:27 GMT

    [ https://issues.apache.org/jira/browse/HDFS-915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848029#action_12848029
] 

ryan rawson commented on HDFS-915:
----------------------------------

i got this, i was doing a 3 TB distcp from a hftp:// url to the target cluster.  It was running
hadoop 0.20.1+169.56, with patches of HDFS-200, HDFS-826 as well.

nothing super interesting in the logs, just:

java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready
for write. ch : java.nio.channels.SocketChannel[connected local=/10.10.21.27:50010 remote=/10.10.21.17:33970]
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:401)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
        at java.lang.Thread.run(Thread.java:619)

and errors in my distcp job every now and again.

> Hung DN stalls write pipeline for far longer than its timeout
> -------------------------------------------------------------
>
>                 Key: HDFS-915
>                 URL: https://issues.apache.org/jira/browse/HDFS-915
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client
>    Affects Versions: 0.20.1
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: local-dn.log
>
>
> After running kill -STOP on the datanode in the middle of a write pipeline, the client
takes far longer to recover than it should. The ResponseProcessor times out in the correct
interval, but doesn't interrupt the DataStreamer, which appears to not be subject to the same
timeout. The client only recovers once the OS actually declares the TCP stream dead, which
can take a very long time.
> I've experienced this on 0.20.1, haven't tried it yet on trunk or 0.21.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message