Return-Path: Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: (qmail 78866 invoked from network); 24 Mar 2011 17:48:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 24 Mar 2011 17:48:23 -0000 Received: (qmail 37841 invoked by uid 500); 24 Mar 2011 17:48:22 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 37777 invoked by uid 500); 24 Mar 2011 17:48:22 -0000 Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-user@hadoop.apache.org Delivered-To: mailing list hdfs-user@hadoop.apache.org Received: (qmail 37769 invoked by uid 99); 24 Mar 2011 17:48:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Mar 2011 17:48:22 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of amp@opendns.com designates 67.215.68.163 as permitted sender) Received: from [67.215.68.163] (HELO mail.opendns.com) (67.215.68.163) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Mar 2011 17:48:14 +0000 Received: from Adams-Desktop.local ([67.215.69.42]) (authenticated bits=0) by mail.opendns.com (8.14.3/8.14.3/Debian-5) with ESMTP id p2OHlqmU003678 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NO) for ; Thu, 24 Mar 2011 17:47:52 GMT Message-ID: <4D8B83C2.4000107@opendns.com> Date: Thu, 24 Mar 2011 10:47:46 -0700 From: Adam Phelps User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.15) Gecko/20110303 Thunderbird/3.1.9 MIME-Version: 1.0 To: hdfs-user@hadoop.apache.org Subject: Re: Datanode won't start with bad disk References: <4D8B7FB3.1050607@opendns.com> In-Reply-To: <4D8B7FB3.1050607@opendns.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org For reference, this is running hadoop 0.20.2 from the CDH3B4 distribution. - Adam On 3/24/11 10:30 AM, Adam Phelps wrote: > We have a bad disk on one of our datanode machines, and while we have > dfs.datanode.failed.volumes.tolerated set to 2 and didn't see any > problem while the DataNode process was running we are seeing a problem > when we needed to restart the DataNode process: > > 2011-03-24 16:50:20,071 WARN org.apache.hadoop.util.DiskChecker: > Incorrect permissions were set on /var/lib/stats/hdfs/4, expected: > rwxr-xr-x, while actual: ---------. Fixing... > 2011-03-24 16:50:20,089 INFO org.apache.hadoop.util.NativeCodeLoader: > Loaded the native-hadoop library > 2011-03-24 16:50:20,091 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: EPERM: Operation not > permitted > > In this case /var/lib/stats/hdfs/4 is the mount point for the bad disk. > It gets that permission error because we have the mount directory set to > be immutable: > > root@s3:/var/log/hadoop# lsattr /var/lib/stats/hdfs/ > ------------------- /var/lib/stats/hdfs/2 > ----i------------e- /var/lib/stats/hdfs/4 > ------------------- /var/lib/stats/hdfs/3 > ------------------- /var/lib/stats/hdfs/1 > > As we'd previously seen HDFS just write to the local disk when a disk > couldn't be mounted. > > HDFS is supposed to be able to handle failed disk, but it doesn't seem > to be doing the right thing in this case. Is this a known problem, or is > there some other way we should be configuring things to allow the > DataNode to come up in this situation? > > (clearly we can remove the mount point from hdfs-site.xml, but that > doesn't feel like the correct solution) > > Thanks > - Adam >