hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henning Blohm <henning.bl...@zfabrik.de>
Subject Re: Too large class path for map reduce jobs
Date Wed, 06 Oct 2010 09:57:15 GMT
Hi Tom,

  that's exactly it. Thanks! I don't think that I can comment on the
issues in Jira so I will do it here.

  Tricking with class paths and deviating from the default class loading
delegation has never been anything but a short term relieve. Fixing
things by imposing a "better" order of stuff on the class path will not
work when people do actually use child loaders (as the parent win) -
like we do. Also it may easily lead to very confusing situations because
the former part of the class path is not complete and gets other stuff
from a latter part etc. etc.... no good.

  Child loaders are good for module separation but should not be used to
"hide" type visibiliy from the parent. Almost certainly leading to Class
Loader Contraint Violation - once you lose control (which is usually
earlier than expected).

  The suggestion to reduce the Job class path to the required minimum is
the most practical approach. There is some gray area there of course and
it will not be feasible to reach the absolute minimal set of types there
- but something reasonable, i.e. the hadoop core that suffices to run
the job. Certainly jetty & co are not required for job execution (btw. I
"hacked" 0.20.2 to remove anything in "server/" from the classpath
before setting the job class path).

  I would suggest to 

  a) introduce some HADOOP_JOB_CLASSPATH var that, if set, is the
additional classpath, added to the "core" classpath (as described
above). If not set, for compatibility, preserve today's behavior.
  b) not getting into custom child loaders for jobs as part of hadoop
M/R. It's non-trivial to get it right and feels to be beyond scope.

  I wouldn't mind helping btw.


On Tue, 2010-10-05 at 15:59 -0700, Tom White wrote:

> Hi Henning,
> I don't know if you've seen
> https://issues.apache.org/jira/browse/MAPREDUCE-1938 and
> https://issues.apache.org/jira/browse/MAPREDUCE-1700 which have
> discussion about this issue.
> Cheers
> Tom
> On Fri, Sep 24, 2010 at 3:41 AM, Henning Blohm <henning.blohm@zfabrik.de> wrote:
> > Short update on the issue:
> >
> > I tried to find a way to separate class path configurations by modifying the
> > scripts in HADOOP_HOME/bin but found that TaskRunner actually copies the
> > class path setting from the parent process when starting a local task so
> > that I do not see a way of having less on a job's classpath without
> > modifying Hadoop.
> >
> > As that will present a real issue when running our jobs on Hadoop I would
> > like to propose to change TaskRunner so that it sets a class path
> > specifically for M/R tasks. That class path could be defined in the scipts
> > (as for the other processes) using a particular environment variable (e.g.
> > HADOOP_JOB_CLASSPATH). It could default to the current VM's class path,
> > preserving today's behavior.
> >
> > Is it ok to enter this as an issue?
> >
> > Thanks,
> >   Henning
> >
> >
> > Am Freitag, den 17.09.2010, 16:01 +0000 schrieb Allen Wittenauer:
> >
> > On Sep 17, 2010, at 4:56 AM, Henning Blohm wrote:
> >
> >> When running map reduce tasks in Hadoop I run into classpath issues.
> >> Contrary to previous posts, my problem is not that I am missing classes on
> >> the Task's class path (we have a perfect solution for that) but rather find
> >> too many (e.g. ECJ classes or jetty).
> >
> > The fact that you mention:
> >
> >> The libs in HADOOP_HOME/lib seem to contain everything needed to run
> >> anything in Hadoop which is, I assume, much more than is needed to run a map
> >> reduce task.
> >
> > hints that your perfect solution is to throw all your custom stuff in lib.
> > If so, that's a huge mistake.  Use distributed cache instead.
> >

View raw message