hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1622) Hadoop should provide a way to allow the user to specify jar file(s) the user job depends on
Date Wed, 12 Mar 2008 05:00:46 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577708#action_12577708
] 

Owen O'Malley commented on HADOOP-1622:
---------------------------------------

Dennis,
   Upon looking at this, I'm getting worried. This looks like a lot of special cases. What
we really need is to support 3 kinds of files:

  * simple files
  * archives
  * jar files

for each of these things, we would like them to be able to come from a URI and most convenient
would be a default of a local file. So, I propose something like:

{code}
-file foo,bar,hdfs:baz
{code}

will upload foo and bar to an upload area and download foo, bar, and baz to the slave nodes
as the tasks are run on them.

{code}
-archive foo.zip,hdfs:baz.zip
{code}

will download foo.zip and baz.zip and expand them.

Finally, the -jar option would download them and put them on the class path. So,

{code}
-jar myjar.jar,hadoop-0.16.1-streaming.jar
{code}

would upload the files in the job client, download them to the slaves, and add them to the
class path in the given order. 

I think I'd leave the rsync functionality out and just use hdfs:_upload/$jobid/... as transient
storage and delete it when the job is done. If the user wants to save the bandwidth they can
upload the files to hdfs themselves, in which case they don't need to be uploaded.
Thoughts?
 

> Hadoop should provide a way to allow the user to specify jar file(s) the user job depends
on
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1622
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1622
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Dennis Kubes
>         Attachments: hadoop-1622-4-20071008.patch, HADOOP-1622-5.patch, HADOOP-1622-6.patch,
HADOOP-1622-7.patch, HADOOP-1622-8.patch, HADOOP-1622-9.patch, multipleJobJars.patch, multipleJobResources.patch,
multipleJobResources2.patch
>
>
> More likely than not, a user's job may depend on multiple jars.
> Right now, when submitting a job through bin/hadoop, there is no way for the user to
specify that. 
> A walk around for that is to re-package all the dependent jars into a new jar or put
the dependent jar files in the lib dir of the new jar.
> This walk around causes unnecessary inconvenience to the user. Furthermore, if the user
does not own the main function 
> (like the case when the user uses Aggregate, or datajoin, streaming), the user has to
re-package those system jar files too.
> It is much desired that hadoop provides a clean and simple way for the user to specify
a list of dependent jar files at the time 
> of job submission. Someting like:
> bin/hadoop .... --depending_jars j1.jar:j2.jar 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message