hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alejandro Abdelnur <t...@cloudera.com>
Subject Re: Too large class path for map reduce jobs
Date Thu, 07 Oct 2010 05:22:57 GMT
[sent too soon]

The first CP shown is how it is today the CP of a task. If we change it pick
up all the job JARs from the current dir, then the classpath will be much
shorter (second CP shown). We can easily achieve this by soft-linking the
job JARs in the work dir of the task.

Alejandro

On Thu, Oct 7, 2010 at 1:02 PM, Alejandro Abdelnur <tucu@cloudera.com>wrote:

> Fragmentation of Hadoop classpaths is another issue: hadoop should
> differentiate the CP in 3:
>
> 1*client CP: what is needed to submit a job (only the nachos)
> 2*server CP (JT/NN/TT/DD): what is need to run the cluster (the whole
> enchilada)
> 3*job CP: what is needed to run a job (some of the enchilada)
>
> But i'm not trying to get into that here. What I'm suggesting is:
>
>
> -----
> # Hadoop JARs:
>
> /Users/tucu/dev-apps/hadoop/conf
> /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/lib/tools.jar
> /Users/tucu/dev-apps/hadoop/bin/..
> /Users/tucu/dev-apps/hadoop/bin/../hadoop-core-0.20.3-CDH3-SNAPSHOT.jar
> /Users/tucu/dev-apps/hadoop/bin/../lib/aspectjrt-1.6.5.jar
>
> ..... (about 30 jars from hadoop lib/ )
>
> /Users/tucu/dev-apps/hadoop/bin/../lib/jsp-2.1/jsp-api-2.1.jar
>
> # Job JARs (for a job with only 2 JARs):
>
>
> /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/-2707763075630339038_639898034_1993697040/localhost/user/tucu/oozie-tucu/0000003-101004184132247-oozie-tucu-W/java-node--java/java-launcher.jar
>
> /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/3613772770922728555_-588832047_1993624983/localhost/user/tucu/examples/apps/java-main/lib/oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar
>
> /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/tucu/jobcache/job_201010041326_0058/attempt_201010041326_0058_m_000000_0/work
> -----
>
>
> What I'm suggesting is that the later group, the job JARs to be soft-linked
> (by the TT) into the working directory, then their classpath is just:
>
> -----
> java-launcher.jar
> oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar
> .
> -----
>
>
> Alejandro
>
> On Wed, Oct 6, 2010 at 7:57 PM, Henning Blohm <henning.blohm@zfabrik.de>wrote:
>
>>  Hi Alejandro,
>>
>>    yes, it can of course be done right (sorry if my wording seemed to
>> imply otherwise). Just saying that I think that Hadoop M/R should not go
>> into that class loader / module separation business. It's one Job, one VM,
>> right? So the problem is to assign just the stuff needed to let the Job do
>> its business without becoming an obstacle.
>>
>>   Must admit I didn't understand your proposal 2. How would that remove
>> (e.g.) jetty libs from the job's classpath?
>>
>> Thanks,
>>   Henning
>>
>> Am Mittwoch, den 06.10.2010, 18:28 +0800 schrieb Alejandro Abdelnur:
>>
>>  1. Classloader business can be done right. Actually it could be done as
>> spec-ed for servlet web-apps.
>>
>>
>>
>>  2. If the issue is strictly 'too large classpath', then a simpler
>> solution would be to sof-link all JARs to the current directory and create
>> the classpath with the JAR names only (no path). Note that the soft-linking
>> business is already supported by the DistributedCache. So the changes would
>> be mostly in the TT to create the JAR names only classpath before starting
>> the child.
>>
>>
>>
>>  Alejandro
>>
>>
>>
>>  On Wed, Oct 6, 2010 at 5:57 PM, Henning Blohm <henning.blohm@zfabrik.de>
>> wrote:
>>
>>  Hi Tom,
>>
>>   that's exactly it. Thanks! I don't think that I can comment on the
>> issues in Jira so I will do it here.
>>
>>   Tricking with class paths and deviating from the default class loading
>> delegation has never been anything but a short term relieve. Fixing things
>> by imposing a "better" order of stuff on the class path will not work when
>> people do actually use child loaders (as the parent win) - like we do. Also
>> it may easily lead to very confusing situations because the former part of
>> the class path is not complete and gets other stuff from a latter part etc.
>> etc.... no good.
>>
>>   Child loaders are good for module separation but should not be used to
>> "hide" type visibiliy from the parent. Almost certainly leading to Class
>> Loader Contraint Violation - once you lose control (which is usually earlier
>> than expected).
>>
>>   The suggestion to reduce the Job class path to the required minimum is
>> the most practical approach. There is some gray area there of course and it
>> will not be feasible to reach the absolute minimal set of types there - but
>> something reasonable, i.e. the hadoop core that suffices to run the job.
>> Certainly jetty & co are not required for job execution (btw. I "hacked"
>> 0.20.2 to remove anything in "server/" from the classpath before setting the
>> job class path).
>>
>>   I would suggest to
>>
>>   a) introduce some HADOOP_JOB_CLASSPATH var that, if set, is the
>> additional classpath, added to the "core" classpath (as described above). If
>> not set, for compatibility, preserve today's behavior.
>>   b) not getting into custom child loaders for jobs as part of hadoop M/R.
>> It's non-trivial to get it right and feels to be beyond scope.
>>
>>   I wouldn't mind helping btw.
>>
>> Thanks,
>>   Henning
>>
>>
>>
>>
>>
>> On Tue, 2010-10-05 at 15:59 -0700, Tom White wrote:
>>
>> Hi Henning,
>>
>> I don't know if you've seenhttps://issues.apache.org/jira/browse/MAPREDUCE-1938 andhttps://issues.apache.org/jira/browse/MAPREDUCE-1700
which have
>> discussion about this issue.
>>
>> Cheers
>> Tom
>>
>> On Fri, Sep 24, 2010 at 3:41 AM, Henning Blohm <henning.blohm@zfabrik.de> wrote:
>> > Short update on the issue:
>> >
>> > I tried to find a way to separate class path configurations by modifying the
>> > scripts in HADOOP_HOME/bin but found that TaskRunner actually copies the
>> > class path setting from the parent process when starting a local task so
>> > that I do not see a way of having less on a job's classpath without
>> > modifying Hadoop.
>> >
>> > As that will present a real issue when running our jobs on Hadoop I would
>> > like to propose to change TaskRunner so that it sets a class path
>> > specifically for M/R tasks. That class path could be defined in the scipts
>> > (as for the other processes) using a particular environment variable (e.g.
>> > HADOOP_JOB_CLASSPATH). It could default to the current VM's class path,
>> > preserving today's behavior.
>> >
>> > Is it ok to enter this as an issue?
>> >
>> > Thanks,
>> >   Henning
>> >
>> >
>> > Am Freitag, den 17.09.2010, 16:01 +0000 schrieb Allen Wittenauer:
>> >
>> > On Sep 17, 2010, at 4:56 AM, Henning Blohm wrote:
>> >
>> >> When running map reduce tasks in Hadoop I run into classpath issues.
>> >> Contrary to previous posts, my problem is not that I am missing classes
on
>> >> the Task's class path (we have a perfect solution for that) but rather find
>> >> too many (e.g. ECJ classes or jetty).
>> >
>> > The fact that you mention:
>> >
>> >> The libs in HADOOP_HOME/lib seem to contain everything needed to run
>> >> anything in Hadoop which is, I assume, much more than is needed to run a
map
>> >> reduce task.
>> >
>> > hints that your perfect solution is to throw all your custom stuff in lib.
>> > If so, that's a huge mistake.  Use distributed cache instead.
>> >
>>
>>
>>
>>
>>
>>
>

Mime
View raw message