hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
Date Fri, 18 Dec 2009 00:06:18 GMT

     [ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Todd Lipcon updated HDFS-101:
-----------------------------

    Attachment: hdfs-101.tar.gz

Hi Hairong,

This doesn't seem to fix the issue for me. Here's the test setup:

- Cluster is with 3 DNs, running on EC2.
- On DN at 10.251.43.82, I have dfs.data.dir pointing at two volumes. One has 150M remaining.
On the other nodes, there is plenty of disk space. I've set dfs.du.reserved to 0 on the small-disk
volume.
- From a separate node, I try to upload a 100M file. As soon as it starts uploading, I issue
a "dd" command on 10.251.43.82 to fill up the small volume (causing the disk to fill up while
the local DN is writing to that volume)
- If this patch works correctly, it should figure out that .82 is the problem node, and recover
the pipeline to the other two nodes in the cluster. It should never eject a different node
from the pipeline.

Although the write did succeed, it incorrectly decided .148 was the bad datanode even though
.82 experienced the failure. I've attached the logs from the perspective of the writer as
well as the 3 DNs in the pipeline.

Although it didn't show up in this particular capture, I also saw the following exception
in a few tests:

2009-12-17 18:36:31,622 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block blk_6564765548548790058_1354java.util.NoSuchElementException
        at java.util.LinkedList.remove(LinkedList.java:788)
        at java.util.LinkedList.removeFirst(LinkedList.java:134)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2435)


I'll make a local edit to fix the log message for "Expecting seqno" to actually print the
expected seqno, and try to rerun the tests, as that might help figure out the issue.

> DFS write pipeline : DFSClient sometimes does not detect second datanode failure 
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-101
>                 URL: https://issues.apache.org/jira/browse/HDFS-101
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 0.20.1
>            Reporter: Raghu Angadi
>            Assignee: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: detectDownDN-0.20.patch, detectDownDN.patch, detectDownDN1.patch,
hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out DFSClient ends
up marking first datanode as the bad one and removes it from the pipeline. Similar problem
exists on DataNode as well and it is fixed in HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of DFSClient)
interrupt() the 'responder' thread. But interrupting is a pretty coarse control. We don't
know what state the responder is in and interrupting has different effects depending on responder
state. To fix this properly we need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should properly read
all the data left in the socket.. Also, DataNode's closing of the socket should not result
in a TCP reset, otherwise I think DFSClient will not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message