Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Received-SPF: neutral (herse.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns;
	h=message-id:date:from:user-agent:mime-version:to:subject:
	references:in-reply-to:content-type:content-transfer-encoding;
	b=FObC0if4pNjoz48dlsljuobud7UQjBBWIw7FgexCeKSkBCPWBjwkNXZtAr6gF8Aa
Message-ID: <4570B6F9.50700@yahoo-inc.com>
Date: Fri, 01 Dec 2006 15:12:57 -0800
From: Raghu Angadi <rangadi@yahoo-inc.com>
User-Agent: Thunderbird 1.5.0.8 (Windows/20061025)
MIME-Version: 1.0
To: hadoop-dev@lucene.apache.org
Subject: Re: minor change in dataNode handling of multiple directories.
References: <456E284D.3040805@yahoo-inc.com>
 <1bf79d3e0611291803h2b5e3b5dq2992255a00f7f92f@mail.gmail.com>
 <456F1E37.3070706@yahoo-inc.com> <456F2843.4010807@yahoo-inc.com>
In-Reply-To: <456F2843.4010807@yahoo-inc.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit


Does anyone have a config where some data directories don't exists at 
all? The current datanode does not work in that case. It throws 
IOException. The current code only tolerates if the directory exist but 
could not be locked. Yes, we could decide not throw the exception if the 
directory does not exist.

For now I am just going to keep the same behavior as before.

Raghu.

> Konstantin Shvachko wrote:
>> Good point.
>> I think we should document it (Javadoc?) making it a feature rather 
>> than a side effect.
>>
>> Bryan A. P. Pendleton wrote:
>>
>>> I would prefer this proposal not be implements. The current way 
>>> things work
>>> makes it possible to configure, centrally, a list of all directories 
>>> that
>>> _could_ be used for storage. Since there's no easy way to do per-node
>>> configurations (nor would it be desirable, IMO, in this case), the
>>> directories config ends up being the list of all possibly usable
>>> directories. Many of my cluster nodes are configured using 
>>> "rocksclusters":
>>> they will have a uniform set of mounts created, one for each physical 
>>> drive,
>>> at boot/re-install. If I specify in my config the list of all 
>>> directories up
>>> to the most number of drives a machine will ever have, then I get easy
>>> drop-in use, regardless of variations in nodes in the cluster. I have 
>>> been
>>> relying in the current behavior to keep me sane.
>>>
>>> OTOH, I wouldn't oppose making this the default behavior, with a
>>> configuration param that would set things back to the old behavior.
>>>
>>> On 11/29/06, Raghu Angadi <rangadi@yahoo-inc.com> wrote:
>>>
>>>>
>>>>
>>>> As part of the "Version upgrade" related changes, thinking of strictly
>>>> requiring that datanode be able to lock _all_ the configured 
>>>> directories
>>>> instead of any one of them.
>>>>
>>>> Currently if multiple data directories are specified for a datanode, it
>>>> tries to lock a file is in each of the directories. If it fails to lock
>>>> some of the directories, it will use the directories that it could.
>>>> Looks like this flexibility was included mainly for convenience in
>>>> config file.
>>>>
>>>> This might not affect anyone, let us know of your opinions.
>>>>
>>>> Note that all directories have the same storage id. So each individual
>>>> directory is not complete by itself but a part of one storage.
>>>>
>>>> Raghu.
>>>
>