hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "tim robertson" <timrobertson...@gmail.com>
Subject Re: Best practice for using third party libraries in MapReduce Jobs?
Date Thu, 04 Dec 2008 07:44:42 GMT
Exactly.  I'm no expert on maven either, but I like it's convenience for
classpath handling

Attached are my scripts.
- Hadoop-installer allows me to install different versions of hadoop
to local repo
- Pom has an assembly plugin (change mainClass and packageName to be
your target)
- Assembly does the packaging.  Run it with
 - mvn assembly:assembly -Dmaven.test.skip=true

The way I work is manage all dependencies in the pom, use "mvn
eclipse:eclipse" to keep eclipse buildpath correct.  Then I just run
everything in Eclipse with small input files until I am happy that it
works.  Then I build the jar with dependencies and copy it up to EC2
to run on the cluster.  Might not be the best way but seems fairly
efficient for me.



On Wed, Dec 3, 2008 at 10:42 PM, Scott Whitecross <scott@dataxu.com> wrote:
> Thanks Tim.
> We use Maven, though I'm not an expert on it.  Basically you are using Maven
> to take the dependencies, and package them in one large jar?  Basically
> unjar the contents of the jar and use those with your code I'm assuming?
> On Dec 3, 2008, at 9:25 AM, tim robertson wrote:
>> Can't answer your question exactly, but can let you know what I do.
>> I build all dependencies into 1 jar, and by using Maven for my build
>> environment, when I assemble my jar, I am 100% sure all my
>> dependencies are collected together.  This is working very nicely for
>> me and I have used the same scripts for around 20 different jars that
>> I run on EC2 - each had different dependencies which would have been a
>> pain to manage seperately, but maven simplifies this massively.
>> Let me know if you want any of my maven config for assembly etc if you
>> are a maven user...
>> Cheers,
>> Tim
>> On Wed, Dec 3, 2008 at 3:19 PM, Scott Whitecross <scott@dataxu.com> wrote:
>>> What's the best way to use third party libraries with Hadoop?  For
>>> example,
>>> I want to run a job with both a jar file containing the mob, and also
>>> extra
>>> libraries.  I noticed a couple solutions with a search, but I'm hoping
>>> for
>>> something better:
>>> - Merge the third party jar libraries into the job jar
>>> - Distribute the third party libraries across the cluster in the local
>>> boxes
>>> classpath.
>>> What I'd really like is a way to add an extra option to the hadoop jar
>>> command, such as hadoop/bin/hadoop jar myJar.jar myJobClass -classpath
>>> thirdpartyjar1.jar:jar2.jar:etc  args
>>> Anything exist like this?

View raw message