Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MailScanner-NULL-Check: 1209472271.34558@Drg5gf3kF5qm9np4VTLteg
Message-ID: <480DDA8F.4020505@apache.org>
Date: Tue, 22 Apr 2008 13:31:11 +0100
From: Steve Loughran <stevel@apache.org>
User-Agent: Thunderbird 2.0.0.12 (X11/20080213)
MIME-Version: 1.0
To: core-user@hadoop.apache.org
Subject: Re: jar files on NFS instead of DistributedCache
References: <C4320D4B.3CE93%tdunning@veoh.com>
 <1DCCB325C9C9BA4A8BB7C7115110AF93018650@Sf2pmxb04.TheFacebook.com>
In-Reply-To: 
 <1DCCB325C9C9BA4A8BB7C7115110AF93018650@Sf2pmxb04.TheFacebook.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Joydeep Sen Sarma wrote:
> as opposed to 200 boxes all not being able to talk to the Namenode? or the jobtracker? 
> 
> i think this is a topic that requires a little nuance. if there's a small cluster and a reliable (netapp) filer - then getting jar's off seems like a good alternative to consider. in 8 months of all of our users have been submitting streaming scripts from shared nfs mounts - aside from occasional auto-mounter issues (that are really operator error for the most part) - there have been no nfs issues. in that same time duration, we have had numerous problems with hdfs and/or map-reduce daemons going into spasms and killing tons of tasks (because of timeouts). this is not my opinion - it's empirical evidence. the upside for me, as an administrator, has been to avoid all the questions around jobcache and stuff (that this list is peppered with).
> 
> while i don't know how many of the users on this list have access to a reliable/fast nfs server - i would bet a majority of them have small-ish clusters. and to just say nfs should be ruled out as a useful tool for such environments is a little unfair for people looking for sound advice.
> 
> this is not to say that this is the right solution for large clusters or those trying to run nfs servers of linux (which, last i heard, has a notoriously bad nfs server). (perhaps open-solaris is a better option).
> 
> 'fair-and-balanced' :-)
> 
> Joydeep
> 


OK. NFS (and Samba) can be made to work in a well managed environment where
-you can set the boxes up so their clocks are synced off the same NTP 
server(s) and their timezone settings are in sync
-you have RAID storage for the NFS data
-you aren't too worried about locking
-you arent too worried about someone getting a laptop onto the network 
-of if they have that access, there are other things that someone would 
be interested in.

I will point you at some slides of some work I did long ago
http://people.apache.org/~stevel/slides/when_web_services_go_bad.pdf

here we were using netapp behind the scenes, and got burned by the fact 
that even though the base protocol worked, and the clocks were in sync, 
the filestore was running in GMT0 and the hosts were running in PST at 
+8 hours, so any file written would appear to be 8 hours behind. When 
the half-hourly purge-all-old-data action kicked in, out of date 
rendered content could get deleted before it was used. That wasnt 
something that showed up during development, or even staging, but only 
in the production site, during our 
most-realistic-we-even-simulate-pauses tests...the core functional tests 
didnt pick it up as they didnt simulate a delay between render and GET.

as a result
1. I dont trust any remote filestore any more. Its not just a point of 
failure, its a point of configuration trouble.
2. ant -diagnostics checks that the temp dir is in sync, even that it is 
writeable.

That's not a direct critique of NFS, more an observation that things our 
there can catch you out unawares. For example , if you are using ant to 
build and <copy> the files, you'd better turn off timestamp checking in 
case those clocks are wrong; you also need to handle the problem of a 
slow copy stamping on earlier versions of the artifacts.

If the issue is how to make access to HDFS systems easier for users, 
that may a better area to focus on.


-- 
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/