hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henning Blohm <henning.bl...@zfabrik.de>
Subject Re: Too large class path for map reduce jobs
Date Thu, 07 Oct 2010 07:43:08 GMT
So that's actually another issue, right? Besides splitting the classpath
into those three groups, you want the TT to create soft-links on demand
to simplify the computation of classpath string. Is that right?

But it's the TT that actually starts the job VM. Why does it matter what
the string actually looks like, as long as it has the right content? 

Thanks,
  Henning

On Thu, 2010-10-07 at 13:22 +0800, Alejandro Abdelnur wrote:
> [sent too soon]
> 
> 
> The first CP shown is how it is today the CP of a task. If we change
> it pick up all the job JARs from the current dir, then the classpath
> will be much shorter (second CP shown). We can easily achieve this by
> soft-linking the job JARs in the work dir of the task.
> 
> 
> Alejandro
> 
> 
> On Thu, Oct 7, 2010 at 1:02 PM, Alejandro Abdelnur <tucu@cloudera.com>
> wrote:
> 
>         Fragmentation of Hadoop classpaths is another issue: hadoop
>         should differentiate the CP in 3:
>         
>         
>         
>         1*client CP: what is needed to submit a job (only the nachos)
>         2*server CP (JT/NN/TT/DD): what is need to run the cluster
>         (the whole enchilada)
>         3*job CP: what is needed to run a job (some of the enchilada)
>         
>         
>         But i'm not trying to get into that here. What I'm suggesting
>         is:
>         
>         
>         
>         
>         -----
>         # Hadoop JARs:
>         
>         
>         /Users/tucu/dev-apps/hadoop/conf
>         /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/lib/tools.jar
>         /Users/tucu/dev-apps/hadoop/bin/..
>         /Users/tucu/dev-apps/hadoop/bin/../hadoop-core-0.20.3-CDH3-SNAPSHOT.jar
>         /Users/tucu/dev-apps/hadoop/bin/../lib/aspectjrt-1.6.5.jar
>         
>         
>         ..... (about 30 jars from hadoop lib/ )
>         
>         
>         /Users/tucu/dev-apps/hadoop/bin/../lib/jsp-2.1/jsp-api-2.1.jar
>         
>         
>         # Job JARs (for a job with only 2 JARs):
>         
>         
>         /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/-2707763075630339038_639898034_1993697040/localhost/user/tucu/oozie-tucu/0000003-101004184132247-oozie-tucu-W/java-node--java/java-launcher.jar
>         /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/3613772770922728555_-588832047_1993624983/localhost/user/tucu/examples/apps/java-main/lib/oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar
>         /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/tucu/jobcache/job_201010041326_0058/attempt_201010041326_0058_m_000000_0/work
>         -----
>         
>         
>         
>         
>         What I'm suggesting is that the later group, the job JARs to
>         be soft-linked (by the TT) into the working directory, then
>         their classpath is just:
>         
>         
>         -----
>         java-launcher.jar
>         oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar
>         .
>         -----
>         
>         
>         
>         
>         
>         Alejandro
>         
>         
>         On Wed, Oct 6, 2010 at 7:57 PM, Henning Blohm
>         <henning.blohm@zfabrik.de> wrote:
>         
>                 Hi Alejandro,
>                 
>                    yes, it can of course be done right (sorry if my
>                 wording seemed to imply otherwise). Just saying that I
>                 think that Hadoop M/R should not go into that class
>                 loader / module separation business. It's one Job, one
>                 VM, right? So the problem is to assign just the stuff
>                 needed to let the Job do its business without becoming
>                 an obstacle.
>                 
>                   Must admit I didn't understand your proposal 2. How
>                 would that remove (e.g.) jetty libs from the job's
>                 classpath?
>                 
>                 Thanks,
>                   Henning
>                 
>                 Am Mittwoch, den 06.10.2010, 18:28 +0800 schrieb
>                 Alejandro Abdelnur:
>                 
>                 
>                 
>                 
>                 > 1. Classloader business can be done right. Actually
>                 > it could be done as spec-ed for servlet web-apps. 
>                 > 
>                 > 
>                 > 2. If the issue is strictly 'too large classpath',
>                 > then a simpler solution would be to sof-link all
>                 > JARs to the current directory and create the
>                 > classpath with the JAR names only (no path). Note
>                 > that the soft-linking business is already supported
>                 > by the DistributedCache. So the changes would be
>                 > mostly in the TT to create the JAR names only
>                 > classpath before starting the child.
>                 > 
>                 > 
>                 > Alejandro
>                 > 
>                 > 
>                 > On Wed, Oct 6, 2010 at 5:57 PM, Henning Blohm
>                 > <henning.blohm@zfabrik.de> wrote:
>                 > 
>                 >         Hi Tom,
>                 >         
>                 >           that's exactly it. Thanks! I don't think
>                 >         that I can comment on the issues in Jira so
>                 >         I will do it here.
>                 >         
>                 >           Tricking with class paths and deviating
>                 >         from the default class loading delegation
>                 >         has never been anything but a short term
>                 >         relieve. Fixing things by imposing a
>                 >         "better" order of stuff on the class path
>                 >         will not work when people do actually use
>                 >         child loaders (as the parent win) - like we
>                 >         do. Also it may easily lead to very
>                 >         confusing situations because the former part
>                 >         of the class path is not complete and gets
>                 >         other stuff from a latter part etc. etc....
>                 >         no good.
>                 >         
>                 >           Child loaders are good for module
>                 >         separation but should not be used to "hide"
>                 >         type visibiliy from the parent. Almost
>                 >         certainly leading to Class Loader Contraint
>                 >         Violation - once you lose control (which is
>                 >         usually earlier than expected).
>                 >         
>                 >           The suggestion to reduce the Job class
>                 >         path to the required minimum is the most
>                 >         practical approach. There is some gray area
>                 >         there of course and it will not be feasible
>                 >         to reach the absolute minimal set of types
>                 >         there - but something reasonable, i.e. the
>                 >         hadoop core that suffices to run the job.
>                 >         Certainly jetty & co are not required for
>                 >         job execution (btw. I "hacked" 0.20.2 to
>                 >         remove anything in "server/" from the
>                 >         classpath before setting the job class
>                 >         path).
>                 >         
>                 >           I would suggest to 
>                 >         
>                 >           a) introduce some HADOOP_JOB_CLASSPATH var
>                 >         that, if set, is the additional classpath,
>                 >         added to the "core" classpath (as described
>                 >         above). If not set, for compatibility,
>                 >         preserve today's behavior.
>                 >           b) not getting into custom child loaders
>                 >         for jobs as part of hadoop M/R. It's
>                 >         non-trivial to get it right and feels to be
>                 >         beyond scope.
>                 >         
>                 >           I wouldn't mind helping btw.
>                 >         
>                 >         Thanks,
>                 >           Henning 
>                 >         
>                 >         
>                 >         
>                 >         
>                 >         On Tue, 2010-10-05 at 15:59 -0700, Tom White
>                 >         wrote: 
>                 >         
>                 >         > Hi Henning,
>                 >         > 
>                 >         > I don't know if you've seen
>                 >         > https://issues.apache.org/jira/browse/MAPREDUCE-1938
and
>                 >         > https://issues.apache.org/jira/browse/MAPREDUCE-1700
which have
>                 >         > discussion about this issue.
>                 >         > 
>                 >         > Cheers
>                 >         > Tom
>                 >         > 
>                 >         > On Fri, Sep 24, 2010 at 3:41 AM, Henning Blohm <henning.blohm@zfabrik.de>
wrote:
>                 >         > > Short update on the issue:
>                 >         > >
>                 >         > > I tried to find a way to separate class path configurations
by modifying the
>                 >         > > scripts in HADOOP_HOME/bin but found that TaskRunner
actually copies the
>                 >         > > class path setting from the parent process when
starting a local task so
>                 >         > > that I do not see a way of having less on a job's
classpath without
>                 >         > > modifying Hadoop.
>                 >         > >
>                 >         > > As that will present a real issue when running
our jobs on Hadoop I would
>                 >         > > like to propose to change TaskRunner so that it
sets a class path
>                 >         > > specifically for M/R tasks. That class path could
be defined in the scipts
>                 >         > > (as for the other processes) using a particular
environment variable (e.g.
>                 >         > > HADOOP_JOB_CLASSPATH). It could default to the
current VM's class path,
>                 >         > > preserving today's behavior.
>                 >         > >
>                 >         > > Is it ok to enter this as an issue?
>                 >         > >
>                 >         > > Thanks,
>                 >         > >   Henning
>                 >         > >
>                 >         > >
>                 >         > > Am Freitag, den 17.09.2010, 16:01 +0000 schrieb
Allen Wittenauer:
>                 >         > >
>                 >         > > On Sep 17, 2010, at 4:56 AM, Henning Blohm wrote:
>                 >         > >
>                 >         > >> When running map reduce tasks in Hadoop I
run into classpath issues.
>                 >         > >> Contrary to previous posts, my problem is
not that I am missing classes on
>                 >         > >> the Task's class path (we have a perfect solution
for that) but rather find
>                 >         > >> too many (e.g. ECJ classes or jetty).
>                 >         > >
>                 >         > > The fact that you mention:
>                 >         > >
>                 >         > >> The libs in HADOOP_HOME/lib seem to contain
everything needed to run
>                 >         > >> anything in Hadoop which is, I assume, much
more than is needed to run a map
>                 >         > >> reduce task.
>                 >         > >
>                 >         > > hints that your perfect solution is to throw all
your custom stuff in lib.
>                 >         > > If so, that's a huge mistake.  Use distributed
cache instead.
>                 >         > >
>                 >         
>                 >         
>                 >         
>                 > 
>                 > 
>                 > 
>         
>         
>         
> 
> 
> 


Mime
View raw message