From hdfs-dev-return-1268-apmail-hadoop-hdfs-dev-archive=hadoop.apache.org@hadoop.apache.org Thu Jun 17 13:10:31 2010 Return-Path: Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: (qmail 24140 invoked from network); 17 Jun 2010 13:10:31 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 17 Jun 2010 13:10:31 -0000 Received: (qmail 88398 invoked by uid 500); 17 Jun 2010 06:03:51 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 88046 invoked by uid 500); 17 Jun 2010 06:03:48 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 88030 invoked by uid 99); 17 Jun 2010 06:03:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jun 2010 06:03:47 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jun 2010 06:03:44 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o5H63N5q004843 for ; Thu, 17 Jun 2010 06:03:23 GMT Message-ID: <1325596.49901276754603437.JavaMail.jira@thor> Date: Thu, 17 Jun 2010 02:03:23 -0400 (EDT) From: "Thanh Do (JIRA)" To: hdfs-dev@hadoop.apache.org Subject: [jira] Created: (HDFS-1239) All datanodes are bad in 2nd phase MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org All datanodes are bad in 2nd phase ---------------------------------- Key: HDFS-1239 URL: https://issues.apache.org/jira/browse/HDFS-1239 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client Affects Versions: 0.20.1 Reporter: Thanh Do - Setups: number of datanodes = 2 replication factor = 2 Type of failure: transient fault (a java i/o call throws an exception or return false) Number of failures = 2 when/where failures happen = during the 2nd phase of the pipeline, each happens at each datanode when trying to perform I/O (e.g. dataoutputstream.flush()) - Details: This is similar to HDFS-1237. In this case, node1 throws exception that makes client creates a pipeline only with node2, then tries to redo the whole thing, which throws another failure. So at this point, the client considers all datanodes are bad, and never retries the whole thing again, (i.e. it never asks the namenode again to ask for a new set of datanodes). In HDFS-1237, the bug is due to permanent disk fault. In this case, it's about transient error. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (thanhdo@cs.wisc.edu) and Haryadi Gunawi (haryadi@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.