hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gera Shegalov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6128) Automatic addition of bundled jars to distributed cache
Date Thu, 20 Nov 2014 18:45:36 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219771#comment-14219771

Gera Shegalov commented on MAPREDUCE-6128:

Thanks for your comments, Jason!

bq. I was actually joking about the DO_TESTMRJOBS_HACK name.
DO_TESTMRJOBS_HACK sounded very creative and tempting to use :) I need to pause on this patch
for a bit. But I will be able to rework TestMRJobs to set the CLASSPATH explicitly.

Regarding nonManifestJarSet.clear, it's only to remove unneeded edges from the object graph.
IMO, There is no point of nulling a local variable of the stack frame that is about to go
anyways. Most of the time it's not even worth considering this kind of optimizations for the
job client. But our Scalding jobs submitting large DAG in combination with memory-hungry Parquet
splits made me think of it. 

bq. I'm a bit surprised the last jar in the manifest classpath wins since if they really are
the same jar (but possibly different versions) then the first one in the classpath will win
in practice, not the last.
That's a good point too. I was thinking more from the way I thought the build system will
simply append dependencies. Looking from the classpath perspective as you suggest works better.

It was actually a response to your comment "If the manifest asks for two different jars with
the same basename then I think it will silently skip the latter entry. Intentional?"

I am happy to change it if we settle for the first-entry-wins. Maybe [~sjlee0] wants to chime
in as well.

bq. I should point out that we don't have to exclude dependencies that conflict with the same
basename, since we could generate a unique linkname
Correct, that is the trick I used for [Twitter's transparent Rhs caching for Cascading HashJoin|https://github.com/twitter/cascading/blob/8271526443c9ef832415df5d9673fde3e4391620/cascading-hadoop/src/main/shared/cascading/tap/hadoop/fs/DistributedCacheFileSystem.java].
However, for code shipping I did not want to introduce any additional non-determinism.

> Automatic addition of bundled jars to distributed cache 
> --------------------------------------------------------
>                 Key: MAPREDUCE-6128
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6128
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 2.5.1
>            Reporter: Gera Shegalov
>            Assignee: Gera Shegalov
>         Attachments: MAPREDUCE-6128.v01.patch, MAPREDUCE-6128.v02.patch, MAPREDUCE-6128.v03.patch,
MAPREDUCE-6128.v04.patch, MAPREDUCE-6128.v05.patch, MAPREDUCE-6128.v06.patch, MAPREDUCE-6128.v07.patch,
> On the client side, JDK adds Class-Path elements from the job jar manifest
> on the classpath. In theory there could be many bundled jars in many directories such
that adding them manually via libjars or similar means to task classpaths is cumbersome. If
this property is enabled, the same jars are added
> to the task classpaths automatically.

This message was sent by Atlassian JIRA

View raw message