hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Friso van Vollenhoven <fvanvollenho...@xebia.com>
Subject Re: Distributing our jars to all machines in a cluster
Date Wed, 16 Nov 2011 16:51:10 GMT
You use maven jar-with-deps default assembly? That layout works too, but it will give you problems
eventually when you have different classes with the same package and name.

Java jar files are regular ZIP files. They can contain duplicate entries. I don't know whether
your packaging creates duplicates in them, but if it does, it could be the cause of your problem.

Try checking your jar for a duplicate license dir in the META-INF (something like: unzip -l
<your-jar-name>.jar | awk '{print $4}' | sort | uniq -d)


On 16 nov. 2011, at 17:33, Something Something wrote:

Thanks Bejoy & Friso.  When I use the all-in-one jar file created by Maven I get this:

Mkdirs failed to create /Users/xyz/hdfs/hadoop-unjar4743660161930001886/META-INF/license

Do you recall coming across this?  Our 'all-in-one' jar is not exactly how you have described
it.  It doesn't contain any JARs, but it has all the classes from all the dependent JARs.

On Wed, Nov 16, 2011 at 7:59 AM, Friso van Vollenhoven <fvanvollenhoven@xebia.com<mailto:fvanvollenhoven@xebia.com>>
We usually package my jobs as a single jar that contains a /lib directory in the jar that
contains all other jars that the job code depends on. Hadoop understands this layout when
run as 'hadoop jar'. So the jar layout would be something like:


If you use Maven or some other build tool with dependency management, you can usually produce
this jar as part of your build. We also have Maven write the main class to the manifest, such
that there is no need to type it. So for us, submitting a job looks like:
hadoop jar jar-with-all-deps-in-lib.jar arg1 arg2 argN

Then Hadoop will take care of submitting and distributing, etc. Of course you pay the penalty
of always sending all of your dependencies over the wire (the job jar gets replicated to 10
machines by default). Pre-distributing sounds tedious and error prone to me. What if you have
different jobs that require different versions of the same dependency?


On 16 nov. 2011, at 15:42, Something Something wrote:

Bejoy - Thanks for the reply.  The '-libjars' is not working for me with 'hadoop jar'.  Also,
as per the documentation (http://hadoop.apache.org/common/docs/current/commands_manual.html#jar):

Generic Options

The following options are supported by dfsadmin<http://hadoop.apache.org/common/docs/current/commands_manual.html#dfsadmin>,
fs<http://hadoop.apache.org/common/docs/current/commands_manual.html#fs>, fsck<http://hadoop.apache.org/common/docs/current/commands_manual.html#fsck>,
job<http://hadoop.apache.org/common/docs/current/commands_manual.html#job> and fetchdt<http://hadoop.apache.org/common/docs/current/commands_manual.html#fetchdt>.

Does it work for you?  If it does, please let me know.  "Pre-distributing" definitely works,
but is that the best way?  If you have a big cluster and Jars are changing often it will be

Also, how does Pig do it?  We update Pig UDFs often and put them only on the 'client' machine
(machine that starts the Pig job) and the UDF becomes available to all machines in the cluster
- automagically!  Is Pig doing the pre-distributing for us?

Thanks for your patience & help with our questions.

On Wed, Nov 16, 2011 at 6:29 AM, Something Something <mailinglists19@gmail.com<mailto:mailinglists19@gmail.com>>
Hmm... there must be a different way 'cause we don't need to do that to run Pig jobs.

On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <daan.gerits@gmail.com<mailto:daan.gerits@gmail.com>>
There might be different ways but currently we are storing our jars onto HDFS and register
them from there. They will be copied to the machine once the job starts. Is that an option?


On 16 Nov 2011, at 07:24, Something Something wrote:

> Until now we were manually copying our Jars to all machines in a Hadoop
> cluster.  This used to work until our cluster size was small.  Now our
> cluster is getting bigger.  What's the best way to start a Hadoop Job that
> automatically distributes the Jar to all machines in a cluster?
> I read the doc at:
> http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
> Would -libjars do the trick?  But we need to use 'hadoop job' for that,
> right?  Until now, we were using 'hadoop jar' to start all our jobs.
> Needless to say, we are getting our feet wet with Hadoop, so appreciate
> your help with our dumb questions.
> Thanks.
> PS:  We use Pig a lot, which automatically does this, so there must be a
> clean way to do this.

View raw message