hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Vargas <br...@ardvaark.net>
Subject Re: Hadoop On Cluster
Date Wed, 23 Sep 2009 20:14:39 GMT

Thanks for the advice.  I'll keep that in mind as we grow.  At present,
the cluster is only six machines, and for these small (2K-ish) scripts,
it has worked flawlessly, with the caveats mentioned.  When it starts
failing, or if I need to move beyond these very-small-sizes, I'll
certainly use the Distributed Cache.


Jeff Hammerbacher wrote:
> Hey Brian,
> Having tried and failed to use NFS to store shared resources for a large
> Hadoop cluster, I feel the need to say: you may want to reconsider that
> strategy as your cluster grows. NFS mounts can be quite flaky at scale, as
> Ted mentions. As Allen mentions, the Distributed Cache is intended to allow
> access to shared resources on the cluster; see
> http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html#DistributedCachefor
> more information.
> Later,
> Jeff
> On Wed, Sep 23, 2009 at 10:19 AM, Allen Wittenauer <awittenauer@linkedin.com
>> wrote:
>> On 9/23/09 10:09 AM, "Brian Vargas" <brian@ardvaark.net> wrote:
>>> Although it can be quite useful to store small shared resources on an
>>> NFS mount.  For example, I find it easier to store various scripts
>>> called by a streaming job on NFS rather than distributing them from the
>>> command-line.
>>> Of course, then you have to be sure they don't change out from under the
>>> running jobs.  Tradeoffs.  :-)
>> You should probably look into distributed cache archives.  This eliminates
>> the NFS bottleneck, avoids the 'magically changing file' problem, and
>> allows
>> you to use different versions with different job submissions such that you
>> can test changes on the fly without having to redeploy.

View raw message