Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 39070 invoked from network); 22 Apr 2008 12:32:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 22 Apr 2008 12:32:46 -0000 Received: (qmail 25345 invoked by uid 500); 22 Apr 2008 12:32:43 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 25312 invoked by uid 500); 22 Apr 2008 12:32:43 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 25301 invoked by uid 99); 22 Apr 2008 12:32:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Apr 2008 05:32:42 -0700 X-ASF-Spam-Status: No, hits=-2.8 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [192.6.10.60] (HELO tobor.hpl.hp.com) (192.6.10.60) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Apr 2008 12:31:47 +0000 Received: from localhost (localhost [127.0.0.1]) by tobor.hpl.hp.com (Postfix) with ESMTP id 2A26DB7B14 for ; Tue, 22 Apr 2008 13:31:51 +0100 (BST) X-Virus-Scanned: amavisd-new at hplb.hpl.hp.com Received: from tobor.hpl.hp.com ([127.0.0.1]) by localhost (tobor.hpl.hp.com [127.0.0.1]) (amavisd-new, port 10024) with LMTP id Te6Sji5W1Tuy for ; Tue, 22 Apr 2008 13:31:35 +0100 (BST) Received: from ha-node-br2.hpl.hp.com (ha-node-br2.hpl.hp.com [16.25.144.57]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by tobor.hpl.hp.com (Postfix) with ESMTPS id 5A676B7B11 for ; Tue, 22 Apr 2008 13:31:30 +0100 (BST) MailScanner-NULL-Check: 1209472271.34558@Drg5gf3kF5qm9np4VTLteg Received: from [16.25.171.118] (morzine.hpl.hp.com [16.25.171.118]) by ha-node-br2.hpl.hp.com (8.14.1/8.13.4) with ESMTP id m3MCVAUv002882 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Tue, 22 Apr 2008 13:31:10 +0100 (BST) Message-ID: <480DDA8F.4020505@apache.org> Date: Tue, 22 Apr 2008 13:31:11 +0100 From: Steve Loughran User-Agent: Thunderbird 2.0.0.12 (X11/20080213) MIME-Version: 1.0 To: core-user@hadoop.apache.org Subject: Re: jar files on NFS instead of DistributedCache References: <1DCCB325C9C9BA4A8BB7C7115110AF93018650@Sf2pmxb04.TheFacebook.com> In-Reply-To: <1DCCB325C9C9BA4A8BB7C7115110AF93018650@Sf2pmxb04.TheFacebook.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-HPL-MailScanner-Information: Please contact the ISP for more information X-MailScanner-ID: m3MCVAUv002882 X-HPL-MailScanner: Found to be clean X-HPL-MailScanner-From: stevel@apache.org X-Virus-Checked: Checked by ClamAV on apache.org Joydeep Sen Sarma wrote: > as opposed to 200 boxes all not being able to talk to the Namenode? or the jobtracker? > > i think this is a topic that requires a little nuance. if there's a small cluster and a reliable (netapp) filer - then getting jar's off seems like a good alternative to consider. in 8 months of all of our users have been submitting streaming scripts from shared nfs mounts - aside from occasional auto-mounter issues (that are really operator error for the most part) - there have been no nfs issues. in that same time duration, we have had numerous problems with hdfs and/or map-reduce daemons going into spasms and killing tons of tasks (because of timeouts). this is not my opinion - it's empirical evidence. the upside for me, as an administrator, has been to avoid all the questions around jobcache and stuff (that this list is peppered with). > > while i don't know how many of the users on this list have access to a reliable/fast nfs server - i would bet a majority of them have small-ish clusters. and to just say nfs should be ruled out as a useful tool for such environments is a little unfair for people looking for sound advice. > > this is not to say that this is the right solution for large clusters or those trying to run nfs servers of linux (which, last i heard, has a notoriously bad nfs server). (perhaps open-solaris is a better option). > > 'fair-and-balanced' :-) > > Joydeep > OK. NFS (and Samba) can be made to work in a well managed environment where -you can set the boxes up so their clocks are synced off the same NTP server(s) and their timezone settings are in sync -you have RAID storage for the NFS data -you aren't too worried about locking -you arent too worried about someone getting a laptop onto the network -of if they have that access, there are other things that someone would be interested in. I will point you at some slides of some work I did long ago http://people.apache.org/~stevel/slides/when_web_services_go_bad.pdf here we were using netapp behind the scenes, and got burned by the fact that even though the base protocol worked, and the clocks were in sync, the filestore was running in GMT0 and the hosts were running in PST at +8 hours, so any file written would appear to be 8 hours behind. When the half-hourly purge-all-old-data action kicked in, out of date rendered content could get deleted before it was used. That wasnt something that showed up during development, or even staging, but only in the production site, during our most-realistic-we-even-simulate-pauses tests...the core functional tests didnt pick it up as they didnt simulate a delay between render and GET. as a result 1. I dont trust any remote filestore any more. Its not just a point of failure, its a point of configuration trouble. 2. ant -diagnostics checks that the temp dir is in sync, even that it is writeable. That's not a direct critique of NFS, more an observation that things our there can catch you out unawares. For example , if you are using ant to build and the files, you'd better turn off timestamp checking in case those clocks are wrong; you also need to handle the problem of a slow copy stamping on earlier versions of the artifacts. If the issue is how to make access to HDFS systems easier for users, that may a better area to focus on. -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/