hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-915) Hung DN stalls write pipeline for far longer than its timeout
Date Thu, 18 Feb 2010 01:41:28 GMT

    [ https://issues.apache.org/jira/browse/HDFS-915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835087#action_12835087

Todd Lipcon commented on HDFS-915:

bq. This was important because the client had to correctly detect which datanode in the pipeline
was dead. I am unable to recollect that scenario as of now

I think these issues were identified and fixed by HDFS-793 and HDFS-101, no?

bq. But the better way to solve this issue (as done in trunk) is for the client to send a
ping message periodically

I agree - the pipeline heartbeats should ideally originate from the clients. In the current
implementation, they originate from the last node in the pipeline, though, so at least we
are detecting that each ResponseProcessor is alive. Since DataStreamer will interrupt ResponseProcessor
in the case of an error, we're also indirectly verifying the DataStreamers are alive. So while
end-to-end heartbeat would be a little better, the current mechanism should also work.

If the heartbeat doesn't arrive on a node in the pipeline, its reader times out. This causes
an IOException which makes the ResponseProcessor shut down. But it's not interrupting the
DataStreamer - instead we have to wait the much longer write timeout.

> Hung DN stalls write pipeline for far longer than its timeout
> -------------------------------------------------------------
>                 Key: HDFS-915
>                 URL: https://issues.apache.org/jira/browse/HDFS-915
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client
>    Affects Versions: 0.20.1
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: local-dn.log
> After running kill -STOP on the datanode in the middle of a write pipeline, the client
takes far longer to recover than it should. The ResponseProcessor times out in the correct
interval, but doesn't interrupt the DataStreamer, which appears to not be subject to the same
timeout. The client only recovers once the OS actually declares the TCP stream dead, which
can take a very long time.
> I've experienced this on 0.20.1, haven't tried it yet on trunk or 0.21.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message