Return-Path: X-Original-To: apmail-hadoop-common-dev-archive@www.apache.org Delivered-To: apmail-hadoop-common-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CEB127ED6 for ; Wed, 23 Nov 2011 23:38:37 +0000 (UTC) Received: (qmail 47145 invoked by uid 500); 23 Nov 2011 23:38:36 -0000 Delivered-To: apmail-hadoop-common-dev-archive@hadoop.apache.org Received: (qmail 47090 invoked by uid 500); 23 Nov 2011 23:38:36 -0000 Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-dev@hadoop.apache.org Received: (qmail 47074 invoked by uid 99); 23 Nov 2011 23:38:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Nov 2011 23:38:36 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of todd@cloudera.com designates 209.85.161.48 as permitted sender) Received: from [209.85.161.48] (HELO mail-fx0-f48.google.com) (209.85.161.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Nov 2011 23:38:29 +0000 Received: by faan2 with SMTP id n2so302911faa.35 for ; Wed, 23 Nov 2011 15:38:09 -0800 (PST) Received: by 10.152.144.137 with SMTP id sm9mr11971938lab.18.1322091489170; Wed, 23 Nov 2011 15:38:09 -0800 (PST) MIME-Version: 1.0 Received: by 10.152.131.41 with HTTP; Wed, 23 Nov 2011 15:37:48 -0800 (PST) In-Reply-To: <1542FA4EE20C5048A5C2A3663BED2A6B516424@szxeml504-mbx.china.huawei.com> References: <1542FA4EE20C5048A5C2A3663BED2A6B515D7D@szxeml504-mbx.china.huawei.com> <1542FA4EE20C5048A5C2A3663BED2A6B516355@szxeml504-mbx.china.huawei.com> <1542FA4EE20C5048A5C2A3663BED2A6B516424@szxeml504-mbx.china.huawei.com> From: Todd Lipcon Date: Wed, 23 Nov 2011 15:37:48 -0800 Message-ID: Subject: Re: Blocks are getting corrupted under very high load To: common-dev@hadoop.apache.org Cc: "hdfs-dev@hadoop.apache.org" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Wed, Nov 23, 2011 at 1:23 AM, Uma Maheswara Rao G wrote: > Yes, Todd, =A0block after restart is small and =A0genstamp also lesser. > =A0 Here complete machine reboot happend. The boards are configured like,= if it is not getting any CPU cycles =A0for 480secs, it will reboot himself= . > =A0kernal.hung_task_timeout_secs =3D 480 sec. So sounds like the following happened: - while writing file, the pipeline got reduced down to 1 node due to timeouts from the other two - soon thereafter (before more replicas were made), that last replica kernel-paniced without syncing the data - on reboot, the filesystem lost some edits from its ext3 journal, and the block got moved back into the RBW directly, with truncated data - hdfs did "the right thing" - at least what the algorithms say it should do, because it had gotten a commitment for a later replica If you have a build which includes HDFS-1539, you could consider setting dfs.datanode.synconclose to true, which would have prevented this problem. -Todd --=20 Todd Lipcon Software Engineer, Cloudera