hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10490) Client may never recovery replica after a timeout during sending packet
Date Mon, 13 Jun 2016 17:52:21 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327836#comment-15327836

Kihwal Lee commented on HDFS-10490:

{{BlockReceiver}} flushes after writing each packet locally, so the reported issue can happen
in two cases:
1) a datanode in the pipeline relayed the first packet downstream, but the local write hung
or somehow stuck on flush.  The client would received one ack if this node is not the last
node in the pipeline. The second packet won't get through since this node is stuck.  If a
new node is added during the recovery, it will try to transfer the first packet.
2) a datanode in the pipeline got stuck on sending the first packet downstream.  The client
won't receive any ack.  No actual data will be copied during recovery.

Also, for simple pipeline recovery without adding any node, {{stopWriter()}} will cause {{IOUtils.closeStream()}}
to be called against the active {{BlockReceiver}} instance, so both checksum and data output
will be flushed and closed. However,  {{transferReplicaForPipelineRecovery()}} does not take
care of the active writer.

If a rbw copying failed in case 1), it was not a good node to include anyway.  Before HDFS-9106,
a single transfer failure would cause permanent failure. So if this was the cause, it could
have survived with HDFS-9106.  

If 2) was the case and the stuck node was the 1st node in the pipeline, the recovery can be
tricky. As stated in the description, the connections downstream might still be up and the
header might not have been flushed on the remaining "healthy" nodes. But normally, timeout
causes a connection to break and {{closeStream()}} to be called. I see you had to short out
{{close()}} to artificially have the connection stay open in the test case.

I can think of several potential solutions to this case.
1) The approach taken by the current patch. Flush the meta file after the header is written.
2) Revisit the design of {{transferReplicaForPipelineRecovery()}} and {{waitForMinLength()}}.
 Make it stop the active writer if possible.
3) Since no packet has been acked, the state of datanodes is uncertain to the client. Treat
it like block output stream creation failure. I.e. do {{abandonBlock()}} and retry with the
suspected bad node excluded.

1) will address most of cases, but 3) (a sludge hammer apporoach) may be the surest way. 
2) has a bigger impact and may need to be considered in a separate jira.  As for the patch,
{{closedInTest}} doesn't seem to serve any purpose.

> Client may never recovery replica after a timeout during sending packet
> -----------------------------------------------------------------------
>                 Key: HDFS-10490
>                 URL: https://issues.apache.org/jira/browse/HDFS-10490
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.6.0
>            Reporter: He Tianyi
>         Attachments: HDFS-10490.0001.patch, HDFS-10490.patch
> For newly created replica, a meta file is created in constructor of {{BlockReceiver}}
(for {{WRITE_BLOCK}} op). Its header will be written lazily (buffered in memory first by {{BufferedOutputStream}}).

> If following packets fail to deliver (e.g. in  extreme network condition), the header
may never get flush until closed. 
> However, {{BlockReceiver}} will not call close until block receiving is finished or exception(s)
encountered. Also in extreme network condition, both RST & FIN may not deliver in time.

> In this case, if client tries to initiates a {{transferBlock}} to a new datanode (in
{{addDatanode2ExistingPipeline}}), existing datanode will see an empty meta if its {{BlockReceiver}}
did not close in time. 
> Then, after HDFS-3429, a default {{DataChecksum}} (NULL, 512) will be used during transfer.
So when client then tries to recover pipeline after completely transferred, it may encounter
the following exception:
> {noformat}
> java.io.IOException: Client requested checksum DataChecksum(type=CRC32C, chunkSize=4096)
when appending to an existing block with different chunk size: DataChecksum(type=NULL, chunkSize=512)
>         at org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.createStreams(ReplicaInPipeline.java:230)
>         at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:226)
>         at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:798)
>         at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
>         at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:76)
>         at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:243)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This will repeat, until exhausted by datanode replacement policy.
> Also to note that, with bad luck (like I), 20k clients are all doing this. It's to some
extend a DDoS attack to NameNode (because of getAdditionalDataNode calls).
> I suggest we flush immediately after header is written, preventing anybody from seeing
empty meta file for avoiding the issue.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message