From John Armstrong <john.armstr...@ccri.com>
Subject Problems adding JARs to distributed classpath in Hadoop 0.20.2
Date Thu, 26 May 2011 14:45:28 GMT
Hi, everybody.

I'm running into some difficulties getting needed libraries to map/reduce
tasks using the distributed cache.

I'm using Hadoop 0.20.2, which from what I can tell is a hard requirement
by the client, so more current versions are not really viable options.

The code I've inherited is Java, which sets up and runs the MR job. 
There's currently some nontrivial pre- and post-processing, so it will be a
large refactoring before I can just run bare MR jobs rather than starting
them through Java.

Further complicating matters: in practice the Java jobs are launched by
Oozie, which of course does so by wrapping each one in a MR shell.  The
upshot is that I don't have any control over which "local" filesystem the
Java job is run from, though if local files are absolutely needed I can
make my Java wrappers copy stuff back from HDFS to the Java job's local

So here's the problem

mappers and/or reducers need class Needed, which is contained in
needed-1.0.jar, which is in HDFS:

Java program executes:

Inspecting the Job object I find the file has been added to the cache
files as expected:
    job.conf.overlay[...] = mapred.cache.files ->
    job.conf.properties[...] = mapred.cache.files ->

And the class seems to show up in the internal ClassLoader:
    job.conf.classLoader.classes[...] = "class my.class.package.Needed"

though this may just be inherited from the ClassLoader of the Java process
itself (which also uses Needed).

And yet as soon as I get into the mapreduce job itself I start getting:

2011-05-25 17:22:56,080  INFO JobClient - Task Id :
attempt_201105251330_0037_r_000043_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException:

Up until this point we've run things by having a directory on each node
containing all the libraries we'd need, and including that in the Hadoop
classpath, but we have no such control in the deployment scenario, so we
have to make our program hand the needed libraries to the map and reduce
nodes via the distributed cache classpath.

Thanks in advance for any insight or assistance you can offer.

