hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher L Tubbs II <ctubb...@gmail.com>
Subject Re: mapreduce classpath issue
Date Sat, 01 Aug 2009 21:58:43 GMT
I previously sent the following message requesting assistance with
classpath issues. I've now appeared to resolve the issue, in that the
libjars that I specify are added to the classpath for the mapper tasks,
but NOT for the run() method of the Tool.

I can get around this by specifying HADOOP_CLASSPATH environment
variable with the identical libjar listing, but this seems to be a bit
clunky for a quick mapreduce job.

Shouldn't libjars also place the jars on the hadoop classpath for the
thread that executes run()?

Right now, to avoid setting the libjars twice, I use (syntax may not be
exactly right, I'm going by memory):
  job.setInputFormat(conf.getClassByName("my.package.Class")))

However, in my InputFormat class, I also have static methods that add
information to the conf for the mapper task, in the same way that
FileInputFormat and FileOutputFormat do, so as to abstract away the
implementation details of those configuration settings.

So, while I can set the input format / output format by class name, I
cannot access these static methods in the run() thread unless I use
reflection.

It seems very odd that libjars only works for the mapper/reducer tasks
and not for the Tool.run() method. The two workarounds I've found are,
as I've stated above:
 1. duplicating the libjars listing on the HADOOP_CLASSPATH environment
variable for the Tool thread, and
 2. grabbing the class from the conf.getClassByName() and using
reflection to call the static methods to configure the class' options.

On a side note, it also seems very clumsy that the Tool requires to
extend Configured so that you can create the JobConf from the one
retrieved from getConf() so that the libjars works at all.

Any thoughts or suggestions for an easier way?


Christopher L Tubbs II wrote:
> So, I've been having an issue running on Mac OS, with classpath for
> MapReduce jobs on version 0.20.0
> 
> The basic idea is that I want to add a jar as a dependency to a
> MapReduce job. I know I can add the jar to HADOOP_CLASSPATH, but I do
> not have access to hadoop-env.xml (administrator restricted).
> 
> I have been using the "hadoop jar" command to execute a class with a
> main() contained within my jar. My class extends Configured and
> implements Tool. The main() calls ToolRunner.run() to launch the job,
> and I specify -libjars on the command line, which I see from the source,
> uses the GenericOptionsParser to strip the "-libjars x.jar,y.jar,z.jar"
> from the arguments passed to my class. I then get errors that I can't
> find classes contained within my library jars.
> 
> From within run(), my debugging code demonstrates that the jar files are
> readable, and I can getResource() to verify that the classes are
> available in those jars. However, I still get errors, because it doesn't
> seem that at any point, the JVM is using the contextClassLoader that the
> resources were added with.
> 
> Perhaps this is a bug with Mac OS's JVM 1.6, or am I doing something
> wrong? Everything works fine in linux with the identical setup.
> 
> 
> -
> Chris

Mime
View raw message