hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alejandro Abdelnur <t...@cloudera.com>
Subject Re: Too large class path for map reduce jobs
Date Wed, 06 Oct 2010 10:28:43 GMT
1. Classloader business can be done right. Actually it could be done as
spec-ed for servlet web-apps.

2. If the issue is strictly 'too large classpath', then a simpler solution
would be to sof-link all JARs to the current directory and create the
classpath with the JAR names only (no path). Note that the soft-linking
business is already supported by the DistributedCache. So the changes would
be mostly in the TT to create the JAR names only classpath before starting
the child.


On Wed, Oct 6, 2010 at 5:57 PM, Henning Blohm <henning.blohm@zfabrik.de>wrote:

>  Hi Tom,
>   that's exactly it. Thanks! I don't think that I can comment on the issues
> in Jira so I will do it here.
>   Tricking with class paths and deviating from the default class loading
> delegation has never been anything but a short term relieve. Fixing things
> by imposing a "better" order of stuff on the class path will not work when
> people do actually use child loaders (as the parent win) - like we do. Also
> it may easily lead to very confusing situations because the former part of
> the class path is not complete and gets other stuff from a latter part etc.
> etc.... no good.
>   Child loaders are good for module separation but should not be used to
> "hide" type visibiliy from the parent. Almost certainly leading to Class
> Loader Contraint Violation - once you lose control (which is usually earlier
> than expected).
>   The suggestion to reduce the Job class path to the required minimum is
> the most practical approach. There is some gray area there of course and it
> will not be feasible to reach the absolute minimal set of types there - but
> something reasonable, i.e. the hadoop core that suffices to run the job.
> Certainly jetty & co are not required for job execution (btw. I "hacked"
> 0.20.2 to remove anything in "server/" from the classpath before setting the
> job class path).
>   I would suggest to
>   a) introduce some HADOOP_JOB_CLASSPATH var that, if set, is the
> additional classpath, added to the "core" classpath (as described above). If
> not set, for compatibility, preserve today's behavior.
>   b) not getting into custom child loaders for jobs as part of hadoop M/R.
> It's non-trivial to get it right and feels to be beyond scope.
>   I wouldn't mind helping btw.
> Thanks,
>   Henning
> On Tue, 2010-10-05 at 15:59 -0700, Tom White wrote:
> Hi Henning,
> I don't know if you've seenhttps://issues.apache.org/jira/browse/MAPREDUCE-1938 andhttps://issues.apache.org/jira/browse/MAPREDUCE-1700
which have
> discussion about this issue.
> Cheers
> Tom
> On Fri, Sep 24, 2010 at 3:41 AM, Henning Blohm <henning.blohm@zfabrik.de> wrote:
> > Short update on the issue:
> >
> > I tried to find a way to separate class path configurations by modifying the
> > scripts in HADOOP_HOME/bin but found that TaskRunner actually copies the
> > class path setting from the parent process when starting a local task so
> > that I do not see a way of having less on a job's classpath without
> > modifying Hadoop.
> >
> > As that will present a real issue when running our jobs on Hadoop I would
> > like to propose to change TaskRunner so that it sets a class path
> > specifically for M/R tasks. That class path could be defined in the scipts
> > (as for the other processes) using a particular environment variable (e.g.
> > HADOOP_JOB_CLASSPATH). It could default to the current VM's class path,
> > preserving today's behavior.
> >
> > Is it ok to enter this as an issue?
> >
> > Thanks,
> >   Henning
> >
> >
> > Am Freitag, den 17.09.2010, 16:01 +0000 schrieb Allen Wittenauer:
> >
> > On Sep 17, 2010, at 4:56 AM, Henning Blohm wrote:
> >
> >> When running map reduce tasks in Hadoop I run into classpath issues.
> >> Contrary to previous posts, my problem is not that I am missing classes on
> >> the Task's class path (we have a perfect solution for that) but rather find
> >> too many (e.g. ECJ classes or jetty).
> >
> > The fact that you mention:
> >
> >> The libs in HADOOP_HOME/lib seem to contain everything needed to run
> >> anything in Hadoop which is, I assume, much more than is needed to run a map
> >> reduce task.
> >
> > hints that your perfect solution is to throw all your custom stuff in lib.
> > If so, that's a huge mistake.  Use distributed cache instead.
> >

View raw message