hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@yahoo-inc.com>
Subject Re: Adding entries to classpath
Date Wed, 11 Aug 2010 16:30:47 GMT
Moving to mapreduce-user@, bcc common-user@.

Why do you need to create a single top-level jar? Just register each  
of your jars and put each in the distributed cache... however you have  
150 jars which is a lot. Is there a way you can decrease that? I'm  
sure how you do this in pig, but in MR you have the ability to add a  
jar in the DC to the classpath of the child  
(DistributedCache.addFileToClassPath).

Hope that helps.

Arun

On Aug 11, 2010, at 12:48 AM, Kaluskar, Sanjay wrote:

> I am using Hadoop indirectly through PIG, and some of the UDFs  
> (defined
> by me) need other jars at runtime (around 150) some of which have
> conflicting resource names. Hence, trying to unpack all of them and
> repacking into a single jar doesn't work. My solution is to create a
> single top-level jar that names all the dependencies in Class-Path in
> the MANIFEST.MF. This is also simpler from a user's point of view. Of
> course this requires the top-level jar and all the dependencies to be
> created with a certain directory structure that I can control.
> Currently, I have a structure where I have a root directory which
> contains the top-level jar and a directory called lib, and all the
> dependencies are in lib, and the top-level jar names the  
> dependencies as
> lib/x.jar lib/y.jar etc. I package all of this as a single zip file  
> for
> easy installation.
>
> Just to be clear this is the dir structure:
>
> root dir
>    |
>    |--- top-level.jar
>    |--- lib
>            |--- x.jar
>            |--- y.jar
>
> I can't register top-level.jar in my PIG script (this is the  
> recommended
> approach) because PIG then unpacks & repackages everything into a  
> single
> jar, instead of including the jar on the classpath. I can't use
> distributed cache because if I specify top-level.jar and lib  
> separately
> in mapred.cache.files, then the relative directory locations aren't
> preserved. If I use the mapred.cache.archives option and specify the  
> zip
> file, I can't add the top-level jar to the classpath (because the
> entries in mapred.job.classpath.files must be something from
> mapred.cache.files).
>
> If mapred.child.java.opts also allowed java.class.path to be augmented
> (similar to java.library.path, which I am using for native libs that I
> store in another dir parallel to lib), it would have solved my  
> problem.
> I could have specified the zip in mapred.cache.archives, and added the
> jar to the classpath. Right now I can't see any solution, other than
> using a shared file system and adding top-level.jar to  
> HADOOP_CLASSPATH
> - this works because I am using a small cluster that has a shared file
> system but clearly it's not always feasible (and of course, it's
> modifying Hadoop's environment).
>
> Please suggest any alternatives you can think of.
>
> Thanks,
> -sanjay


Mime
View raw message