Return-Path: X-Original-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 58433C766 for ; Thu, 25 Jul 2013 14:45:59 +0000 (UTC) Received: (qmail 66022 invoked by uid 500); 25 Jul 2013 14:45:54 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 65793 invoked by uid 500); 25 Jul 2013 14:45:50 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 64500 invoked by uid 99); 25 Jul 2013 14:45:49 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Jul 2013 14:45:49 +0000 Date: Thu, 25 Jul 2013 14:45:49 +0000 (UTC) From: "Kihwal Lee (JIRA)" To: hdfs-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HDFS-5032) Write pipeline failures caused by slow or busy disk may not be handled properly. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Kihwal Lee created HDFS-5032: -------------------------------- Summary: Write pipeline failures caused by slow or busy disk may not be handled properly. Key: HDFS-5032 URL: https://issues.apache.org/jira/browse/HDFS-5032 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.23.9, 2.1.0-beta Reporter: Kihwal Lee Here is one scenario I have recently encountered in a hbase cluster. The 1st datanode in a write pipeline's disk became extremely busy for many minutes and it caused block writes on the disk to slow down. The 2nd datanode's socket read from the 1st datanode timed out in 60 seconds and disconnected. This caused a block recovery. The problem was, the 1st datanode hasn't written the last packet, but the downstream nodes did and ACK was sent back to the client. For this reason, the block recovery was issued up to the ACKed size. During the recovery, the first datanode was told to do copyBlock(). Since it didn't have enough data on disk, it waited in waitForMinLength(), which didn't help, so the command failed. The connection was already established to the target node for the copy, but the target never received any data. The data packet was eventually written, but it was too late for the copyBlock() call. The destination node for the copy had block metadata in memory, but no file was created on disk. When client contacted this node for block recovery, it too failed. There are few problems: - The faulty (slow) node was not detected correctly. Instead, the 2nd DN was excluded. The 1st DN's packet responder could have done a better job. It didn't have any outstanding ACKs to receive. Or the second DN could have tried to hint the 1st DN of what happened. - copyBlock() could probably wait longer than 3 seconds in waitForMinLength(). Or it could check the on-disk size early on and fail early even before trying to establish a connection to the target. - Failed targets in block write/copy should clean up the record or make it recoverable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira