hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mahadev konar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1622) Hadoop should provide a way to allow the user to specify jar file(s) the user job depends on
Date Wed, 19 Mar 2008 04:46:24 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580227#action_12580227
] 

Mahadev konar commented on HADOOP-1622:
---------------------------------------

i like owens idea. its simple and gives the users the flexibility they need.

here is how I am implementing this -- 

the hadoop command line will have the following options 

hadoop jar -file <comma seperated files> -jar <comma seperated jars> -archive
<comma seperated archives>

all of these can be comma seperated uri's -- defaulting to local file system if not specified.

jobclient uploads the files / jars / archives onto HDFS ..... or the filesystem used by mapreduce.
... under the job directory 

given that these files/jars/archives might have the same name and different uris.... 
example :  hadoop jar -file file:///file1,hdfs://somehost:port/file1 
we would store these files as 
jobdir/file/file/file1
jobdir/hdfs_somehost_port/file1

To keep these files in different directories with the directory name as the uri would give
us the ability to just use DistributedCache as it is.

so we could say DistributedCache.addFiles(jobdir/file/file/file1, jobdir/hdfs_somehost_port/file1);
something like this ... 

so the job directory would like 

jobdir/jars/urischeme/<jarfiles>
jobdir/archives/urischeme/<archivefiles>
jobdir/file/urischeme/<files>

the one in jars will be added to the classpath of all the tasks in order they were mentioned.
the others will be copied once per job and symlinked from the current working directory of
the task.. 

comments?

> Hadoop should provide a way to allow the user to specify jar file(s) the user job depends
on
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1622
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1622
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Mahadev konar
>             Fix For: 0.17.0
>
>         Attachments: hadoop-1622-4-20071008.patch, HADOOP-1622-5.patch, HADOOP-1622-6.patch,
HADOOP-1622-7.patch, HADOOP-1622-8.patch, HADOOP-1622-9.patch, multipleJobJars.patch, multipleJobResources.patch,
multipleJobResources2.patch
>
>
> More likely than not, a user's job may depend on multiple jars.
> Right now, when submitting a job through bin/hadoop, there is no way for the user to
specify that. 
> A walk around for that is to re-package all the dependent jars into a new jar or put
the dependent jar files in the lib dir of the new jar.
> This walk around causes unnecessary inconvenience to the user. Furthermore, if the user
does not own the main function 
> (like the case when the user uses Aggregate, or datajoin, streaming), the user has to
re-package those system jar files too.
> It is much desired that hadoop provides a clean and simple way for the user to specify
a list of dependent jar files at the time 
> of job submission. Someting like:
> bin/hadoop .... --depending_jars j1.jar:j2.jar 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message