Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of amp@opendns.com designates
 67.215.68.163 as permitted sender)
Message-ID: <4D8B83C2.4000107@opendns.com>
Date: Thu, 24 Mar 2011 10:47:46 -0700
From: Adam Phelps <amp@opendns.com>
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US;
 rv:1.9.2.15) Gecko/20110303 Thunderbird/3.1.9
MIME-Version: 1.0
To: hdfs-user@hadoop.apache.org
Subject: Re: Datanode won't start with bad disk
References: <4D8B7FB3.1050607@opendns.com>
In-Reply-To: <4D8B7FB3.1050607@opendns.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

For reference, this is running hadoop 0.20.2 from the CDH3B4 distribution.

- Adam

On 3/24/11 10:30 AM, Adam Phelps wrote:
> We have a bad disk on one of our datanode machines, and while we have
> dfs.datanode.failed.volumes.tolerated set to 2 and didn't see any
> problem while the DataNode process was running we are seeing a problem
> when we needed to restart the DataNode process:
>
> 2011-03-24 16:50:20,071 WARN org.apache.hadoop.util.DiskChecker:
> Incorrect permissions were set on /var/lib/stats/hdfs/4, expected:
> rwxr-xr-x, while actual: ---------. Fixing...
> 2011-03-24 16:50:20,089 INFO org.apache.hadoop.util.NativeCodeLoader:
> Loaded the native-hadoop library
> 2011-03-24 16:50:20,091 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: EPERM: Operation not
> permitted
>
> In this case /var/lib/stats/hdfs/4 is the mount point for the bad disk.
> It gets that permission error because we have the mount directory set to
> be immutable:
>
> root@s3:/var/log/hadoop# lsattr /var/lib/stats/hdfs/
> ------------------- /var/lib/stats/hdfs/2
> ----i------------e- /var/lib/stats/hdfs/4
> ------------------- /var/lib/stats/hdfs/3
> ------------------- /var/lib/stats/hdfs/1
>
> As we'd previously seen HDFS just write to the local disk when a disk
> couldn't be mounted.
>
> HDFS is supposed to be able to handle failed disk, but it doesn't seem
> to be doing the right thing in this case. Is this a known problem, or is
> there some other way we should be configuring things to allow the
> DataNode to come up in this situation?
>
> (clearly we can remove the mount point from hdfs-site.xml, but that
> doesn't feel like the correct solution)
>
> Thanks
> - Adam
>