hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
Date Wed, 16 Jun 2010 18:44:31 GMT

     [ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Todd Lipcon updated HDFS-101:

    Attachment: hdfs-101-branch-0.20-append-cdh3.txt

Hey Nicolas,

I just compared our two patches side by side. The one I've been testing with (and made a noticeable
improvement in recovery detecting the correct down node in cluster failure testing) is attached.
Here are a few differences I noticed (though maybe because the diffs are against different

- Looks like your patch doesn't maintain wire compat when mirrorError is true, since it constructs
a "replies" list with only 2 elements (not based on the number of downstream nodes)
- When receiving packets in BlockReceiver, I am explicitly forwarding HEART_BEAT packets where
it looks like you're not checking for them. Have you verified by leaving a connection open
with no data flowing that heartbeats are handled properly in BlockReceiver?

> DFS write pipeline : DFSClient sometimes does not detect second datanode failure 
> ---------------------------------------------------------------------------------
>                 Key: HDFS-101
>                 URL: https://issues.apache.org/jira/browse/HDFS-101
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 0.20-append, 0.20.1
>            Reporter: Raghu Angadi
>            Assignee: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.20.2, 0.21.0
>         Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, detectDownDN2.patch,
detectDownDN3-0.20-yahoo.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, hdfs-101-branch-0.20-append-cdh3.txt,
hdfs-101.tar.gz, HDFS-101_20-append.patch, pipelineHeartbeat.patch, pipelineHeartbeat_yahoo.patch
> When the first datanode's write to second datanode fails or times out DFSClient ends
up marking first datanode as the bad one and removes it from the pipeline. Similar problem
exists on DataNode as well and it is fixed in HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of DFSClient)
interrupt() the 'responder' thread. But interrupting is a pretty coarse control. We don't
know what state the responder is in and interrupting has different effects depending on responder
state. To fix this properly we need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should properly read
all the data left in the socket.. Also, DataNode's closing of the socket should not result
in a TCP reset, otherwise I think DFSClient will not be able to read from the socket.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message