hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Re: Most effective way to use a lot of shared libraries?
Date Mon, 12 Apr 2010 18:34:21 GMT
Hey Keith,

The way we (LHC) approach a similar problem (not using Hadoop, but basically the same thing)
is we distributed the common software everywhere (either through a shared file system or an
RPM which is installed as part of the base image), and allow users to fly in changed code
with the job.

So, package foo-3.5.6 might be installed as an RPM and have 500 shared libraries.  If a user
wants their own version of libBar.so.2, then it gets submitted along with the job.  As long
as the job's working environment is set to prefer user-provided libraries over the base install
ones - usually by mucking with LD_LIBRARY_PATH - then you only have to carry along your changes
with the job.

Mind you, there's tradeoffs here:
1) If you use NFS for sharing your code to the worker nodes, then you now have a SPOF.
2) If you have the RPMs installed on the worker nodes as part of the base image, you now have
a giant headache in terms of system administration if the code changes every week.

Because of the large size of our releases (a few gigabytes per complete version...), we use
an NFS server.  However, CERN has been working on a caching FUSE file system in CernVM that
uses HTTP and HTTP caches to only download libraries on-demand (see CernVM or, for earlier
work, GROW-FS).

On Apr 12, 2010, at 1:15 PM, Keith Wiley wrote:

> I am have partial success chipping away at the shared library dependencies of my hadoop
job by submitting them to the distributed cache with the -files option.  When I add another
library to the -files list, it seems to work in that the run no longer fails on that library,
but rather fails on another library instead, one I haven't added via -files yet, so I can
envision completing this process, but...
> I am just curious whether this is the correct way to run a job that depends on upwards
of forty shared libraries.  I don't really know which ones will be touched during a given
run of course.  All I know is that an 'ldd' dump on the binary (this is a C++ pipes job) suggests
as many possible dependencies.
> Should I really copy forty .so files to my HDFS cluster and then reference them in an
enormously long -files option when running the job...or am I not approaching this problem
correctly; is there an alternate preferable method for doing this?
> Thanks.
> ________________________________________________________________________________
> Keith Wiley               kwiley@keithwiley.com               www.keithwiley.com
> "Yet mark his perfect self-contentment, and hence learn his lesson, that to be
> self-contented is to be vile and ignorant, and that to aspire is better than to
> be blindly and impotently happy."
>  -- Edwin A. Abbott, Flatland
> ________________________________________________________________________________

View raw message