Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <1286359035.7134.36.camel@linux-elo4.site>
References: <1285324881.20975.29.camel@expat>
 <AANLkTinrV5mqX-mJQ8J=Nk4H5O_WvKOkkqmrJn5AOCrY@mail.gmail.com>
 <1286359035.7134.36.camel@linux-elo4.site>
From: Alejandro Abdelnur <tucu@cloudera.com>
Date: Wed, 6 Oct 2010 18:28:43 +0800
Message-ID: <AANLkTi=opfY4Jwywh-=J12Res_3m8gbMm2oMgvyBdx16@mail.gmail.com>
Subject: Re: Too large class path for map reduce jobs
To: mapreduce-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=0015174c4270ab2b440491f03fd8

--0015174c4270ab2b440491f03fd8
Content-Type: text/plain; charset=ISO-8859-1

1. Classloader business can be done right. Actually it could be done as
spec-ed for servlet web-apps.

2. If the issue is strictly 'too large classpath', then a simpler solution
would be to sof-link all JARs to the current directory and create the
classpath with the JAR names only (no path). Note that the soft-linking
business is already supported by the DistributedCache. So the changes would
be mostly in the TT to create the JAR names only classpath before starting
the child.

Alejandro

On Wed, Oct 6, 2010 at 5:57 PM, Henning Blohm <henning.blohm@zfabrik.de>wrote:

>  Hi Tom,
>
>   that's exactly it. Thanks! I don't think that I can comment on the issues
> in Jira so I will do it here.
>
>   Tricking with class paths and deviating from the default class loading
> delegation has never been anything but a short term relieve. Fixing things
> by imposing a "better" order of stuff on the class path will not work when
> people do actually use child loaders (as the parent win) - like we do. Also
> it may easily lead to very confusing situations because the former part of
> the class path is not complete and gets other stuff from a latter part etc.
> etc.... no good.
>
>   Child loaders are good for module separation but should not be used to
> "hide" type visibiliy from the parent. Almost certainly leading to Class
> Loader Contraint Violation - once you lose control (which is usually earlier
> than expected).
>
>   The suggestion to reduce the Job class path to the required minimum is
> the most practical approach. There is some gray area there of course and it
> will not be feasible to reach the absolute minimal set of types there - but
> something reasonable, i.e. the hadoop core that suffices to run the job.
> Certainly jetty & co are not required for job execution (btw. I "hacked"
> 0.20.2 to remove anything in "server/" from the classpath before setting the
> job class path).
>
>   I would suggest to
>
>   a) introduce some HADOOP_JOB_CLASSPATH var that, if set, is the
> additional classpath, added to the "core" classpath (as described above). If
> not set, for compatibility, preserve today's behavior.
>   b) not getting into custom child loaders for jobs as part of hadoop M/R.
> It's non-trivial to get it right and feels to be beyond scope.
>
>   I wouldn't mind helping btw.
>
> Thanks,
>   Henning
>
>
>
> On Tue, 2010-10-05 at 15:59 -0700, Tom White wrote:
>
> Hi Henning,
>
> I don't know if you've seenhttps://issues.apache.org/jira/browse/MAPREDUCE-1938 andhttps://issues.apache.org/jira/browse/MAPREDUCE-1700 which have
> discussion about this issue.
>
> Cheers
> Tom
>
> On Fri, Sep 24, 2010 at 3:41 AM, Henning Blohm <henning.blohm@zfabrik.de> wrote:
> > Short update on the issue:
> >
> > I tried to find a way to separate class path configurations by modifying the
> > scripts in HADOOP_HOME/bin but found that TaskRunner actually copies the
> > class path setting from the parent process when starting a local task so
> > that I do not see a way of having less on a job's classpath without
> > modifying Hadoop.
> >
> > As that will present a real issue when running our jobs on Hadoop I would
> > like to propose to change TaskRunner so that it sets a class path
> > specifically for M/R tasks. That class path could be defined in the scipts
> > (as for the other processes) using a particular environment variable (e.g.
> > HADOOP_JOB_CLASSPATH). It could default to the current VM's class path,
> > preserving today's behavior.
> >
> > Is it ok to enter this as an issue?
> >
> > Thanks,
> >   Henning
> >
> >
> > Am Freitag, den 17.09.2010, 16:01 +0000 schrieb Allen Wittenauer:
> >
> > On Sep 17, 2010, at 4:56 AM, Henning Blohm wrote:
> >
> >> When running map reduce tasks in Hadoop I run into classpath issues.
> >> Contrary to previous posts, my problem is not that I am missing classes on
> >> the Task's class path (we have a perfect solution for that) but rather find
> >> too many (e.g. ECJ classes or jetty).
> >
> > The fact that you mention:
> >
> >> The libs in HADOOP_HOME/lib seem to contain everything needed to run
> >> anything in Hadoop which is, I assume, much more than is needed to run a map
> >> reduce task.
> >
> > hints that your perfect solution is to throw all your custom stuff in lib.
> > If so, that's a huge mistake.  Use distributed cache instead.
> >
>
>
>

--0015174c4270ab2b440491f03fd8
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

1. Classloader business can be done right. Actually it could be done as spe=
c-ed for servlet web-apps.=A0<div><br></div><div>2. If the issue is strictl=
y &#39;too large classpath&#39;, then a simpler solution would be to sof-li=
nk all JARs to the current directory and create the classpath with the JAR =
names only (no path). Note that the soft-linking business is already suppor=
ted by the DistributedCache. So the changes would be mostly in the TT to cr=
eate the JAR names only classpath before starting the child.</div>

<div><br></div><div>Alejandro<br><div><br></div><div><div class=3D"gmail_qu=
ote">On Wed, Oct 6, 2010 at 5:57 PM, Henning Blohm <span dir=3D"ltr">&lt;<a=
 href=3D"mailto:henning.blohm@zfabrik.de">henning.blohm@zfabrik.de</a>&gt;<=
/span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex;">


 =20
 =20

<div>
Hi Tom,<br>
<br>
=A0 that&#39;s exactly it. Thanks! I don&#39;t think that I can comment on =
the issues in Jira so I will do it here.<br>
<br>
=A0 Tricking with class paths and deviating from the default class loading =
delegation has never been anything but a short term relieve. Fixing things =
by imposing a &quot;better&quot; order of stuff on the class path will not =
work when people do actually use child loaders (as the parent win) - like w=
e do. Also it may easily lead to very confusing situations because the form=
er part of the class path is not complete and gets other stuff from a latte=
r part etc. etc.... no good.<br>


<br>
=A0 Child loaders are good for module separation but should not be used to =
&quot;hide&quot; type visibiliy from the parent. Almost certainly leading t=
o Class Loader Contraint Violation - once you lose control (which is usuall=
y earlier than expected).<br>


<br>
=A0 The suggestion to reduce the Job class path to the required minimum is =
the most practical approach. There is some gray area there of course and it=
 will not be feasible to reach the absolute minimal set of types there - bu=
t something reasonable, i.e. the hadoop core that suffices to run the job. =
Certainly jetty &amp; co are not required for job execution (btw. I &quot;h=
acked&quot; 0.20.2 to remove anything in &quot;server/&quot; from the class=
path before setting the job class path).<br>


<br>
=A0 I would suggest to <br>
<br>
=A0 a) introduce some HADOOP_JOB_CLASSPATH var that, if set, is the additio=
nal classpath, added to the &quot;core&quot; classpath (as described above)=
. If not set, for compatibility, preserve today&#39;s behavior.<br>
=A0 b) not getting into custom child loaders for jobs as part of hadoop M/R=
. It&#39;s non-trivial to get it right and feels to be beyond scope.<br>
<br>
=A0 I wouldn&#39;t mind helping btw.<br>
<br>
Thanks,<br><font color=3D"#888888">
=A0 Henning</font><div><div></div><div class=3D"h5"><br>
<br>
<br>
On Tue, 2010-10-05 at 15:59 -0700, Tom White wrote:
<blockquote type=3D"CITE">
<pre>Hi Henning,

I don&#39;t know if you&#39;ve seen
<a href=3D"https://issues.apache.org/jira/browse/MAPREDUCE-1938" target=3D"=
_blank">https://issues.apache.org/jira/browse/MAPREDUCE-1938</a> and
<a href=3D"https://issues.apache.org/jira/browse/MAPREDUCE-1700" target=3D"=
_blank">https://issues.apache.org/jira/browse/MAPREDUCE-1700</a> which have
discussion about this issue.

Cheers
Tom

On Fri, Sep 24, 2010 at 3:41 AM, Henning Blohm &lt;<a href=3D"mailto:hennin=
g.blohm@zfabrik.de" target=3D"_blank">henning.blohm@zfabrik.de</a>&gt; wrot=
e:
&gt; Short update on the issue:
&gt;
&gt; I tried to find a way to separate class path configurations by modifyi=
ng the
&gt; scripts in HADOOP_HOME/bin but found that TaskRunner actually copies t=
he
&gt; class path setting from the parent process when starting a local task =
so
&gt; that I do not see a way of having less on a job&#39;s classpath withou=
t
&gt; modifying Hadoop.
&gt;
&gt; As that will present a real issue when running our jobs on Hadoop I wo=
uld
&gt; like to propose to change TaskRunner so that it sets a class path
&gt; specifically for M/R tasks. That class path could be defined in the sc=
ipts
&gt; (as for the other processes) using a particular environment variable (=
e.g.
&gt; HADOOP_JOB_CLASSPATH). It could default to the current VM&#39;s class =
path,
&gt; preserving today&#39;s behavior.
&gt;
&gt; Is it ok to enter this as an issue?
&gt;
&gt; Thanks,
&gt; =A0 Henning
&gt;
&gt;
&gt; Am Freitag, den 17.09.2010, 16:01 +0000 schrieb Allen Wittenauer:
&gt;
&gt; On Sep 17, 2010, at 4:56 AM, Henning Blohm wrote:
&gt;
&gt;&gt; When running map reduce tasks in Hadoop I run into classpath issue=
s.
&gt;&gt; Contrary to previous posts, my problem is not that I am missing cl=
asses on
&gt;&gt; the Task&#39;s class path (we have a perfect solution for that) bu=
t rather find
&gt;&gt; too many (e.g. ECJ classes or jetty).
&gt;
&gt; The fact that you mention:
&gt;
&gt;&gt; The libs in HADOOP_HOME/lib seem to contain everything needed to r=
un
&gt;&gt; anything in Hadoop which is, I assume, much more than is needed to=
 run a map
&gt;&gt; reduce task.
&gt;
&gt; hints that your perfect solution is to throw all your custom stuff in =
lib.
&gt; If so, that&#39;s a huge mistake.  Use distributed cache instead.
&gt;
</pre>
</blockquote>
<br>
</div></div></div>

</blockquote></div><br></div></div>

--0015174c4270ab2b440491f03fd8--