hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "chackaravarthy (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2891) Some times first DataNode detected as bad when we power off for the second DataNode.
Date Sun, 01 Apr 2012 09:26:37 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243686#comment-13243686
] 

chackaravarthy commented on HDFS-2891:
--------------------------------------

Hi Uma,

This problem exists in branch-1.0 also.

*This can happen in the following scenario :*
{quote}
1. consider the pipeline as [ DN1 -> DN2 -> DN3 ]
2. create one file and get the stream.
3. write some bytes using that stream and call sync.
4. keep the stream open.
5. Now unplug network / power-off / ethernet down in DN2 machine.
{quote}

*The explanation is as follows :*

*Consider the case, when the caller is not writing any data and dataStreamer & ResponseProcessor
threads are running And also DN2 machine ethernet down:*

{quote}
	1. At *time t1* , ResponseProcessor will start reading the ack from DN1 [timeOut is 69 secs
]

	2. But in the DN1 , packetResponder not yet started reading the ack. It will be waiting on
ackQueue until one packet arrives.

	3. After time *t1+34.5* secs only, dataStreamer will stream the HEART_BEAT packet to DN1.
[if there is no data packet, DataStreamer will send HEART_BEAT packet after waiting for half
of the timeout value]

	4. Then only, DataXceiver will receive the packet and will put it in ackQueue in DN side.

	5. At *time t2* , Once the ackQueue is enqueued, the packetResponder will start reading the
ack from DN2. [timeOut is 66 secs]

	6. As DN2 machine ethernet is down, packetResponder in DN1 is not getting the reply.

	7. But packetResponder will get timeOut only after *t2+66* secs.

	8. Hence ResponseProcessor is getting socketTimeOutException earlier than PacketResponder.

{quote}

		*t2 - t1 >= 34.5 secs*   [if its greater than 3 secs itself, the reported scenario can
happen]

	So, DFSClient is getting SocketTimeOutException before DN1.
	Hence DFSClient is detecting DN1 as bad datanode [which is up] and not detecting DN2 as bad
datanode [which is down]


*Reason :*

In DataNode, the PacketResponder will start reading the ack only when it receives one packet
[either data or heartbeat packet]
But In DFSClient, the ResponseProcessor will start reading the ack before sending packet.

                
> Some times first DataNode detected as bad when we power off for the second DataNode.
> ------------------------------------------------------------------------------------
>
>                 Key: HDFS-2891
>                 URL: https://issues.apache.org/jira/browse/HDFS-2891
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node, hdfs client
>    Affects Versions: 1.1.0
>            Reporter: Uma Maheswara Rao G
>
> In one of my clusters, observed this situation.
> This issue looks to be due to time out in ResponseProcesser at client side, it is marking
first DataNode as bad.
> This happens in 20.2 version. This can be there in branch-1 as well and will check for
trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message