hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Kubes (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HADOOP-1622) Hadoop should provide a way to allow the user to specify jar file(s) the user job depends on
Date Thu, 27 Dec 2007 16:12:43 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554586
] 

musepwizard edited comment on HADOOP-1622 at 12/27/07 8:11 AM:
----------------------------------------------------------------

I have only gotten a chance to design not to develop this as I have been launching the Search
Wikia site.  Here is what I have come up with in terms of a more generalized design after
talking with both Doug and Owen about this enhancement:

1.A runjob utility.  runjar is not affected as it is made to only run a single jar.

2.The options parser will be extended to to support resources, upload, classpath, noclasspath,
compress, decompress, and cache.
  - Items that at cached are added to the distributed cache.
  - Items uploaded are by default not added to the classpath
  - Items cached are by default added to the classpath
  - Resources are by default added to the classpath
  - Compress will choose resources to compress before adding to job.jar file
  - Decompress will choose resources to be decompress before adding to job.jar file.
  - Compress and decompress will only take action on resources being added to job.  This will
include non-local resources and will need to be handled in slave local job resources.
  - Classpath is ignored for any resource that is being uploaded as it will already be added
to the classpath due to it being in resources.
  - All options support multiple elements in comma separated format.
  - No classpath will removed cached and non-cached resources from the classpath.  For example
a jar can be added to resources, included in the local job.jar resources but not included
in its local classpath.  (I don't know if this functionality is useful?)

3.Resources
  - Resources are one or more items that are jarred up into the single job.jar file
  - Resources can be files (compressed or uncompressed) or directories
  - Resources can be from any file system.
  - Resources paths support relative and absolute paths
  - Resources support  URL type paths to support multiple file systems
  - If the path in not in a URL format then it is assumed to be on the local file system as
either an absolute or relative path.
  - Only resources that exist will be included.  This is true for any file system.  The resource
must exist at the beginning of the job to be uploaded.  If the resources exists at the beginning
of the job but not when the local job starts its processing an error will be thrown and that
task will cease operation.
  - A global configuration variable exists to choose to decompress any compressed file that
is added as a resource.
  - Non-local resources will be pulled down into the local job resources from the resources
given file system.  This can include DFS and S3 resources added dynamically.
  - Local resources that are added to the job.jar will be resources from the resources configuration
variable passed to the local jobs.  Remaing resources will be the non-local resources that
need to be added to local job resources.

4.Uploads
  - Uploads by default are put into the users home directory on the jobtracker file system.
  - Upload directories can be set either through a configuration variable for a global default
upload folder or through a colon path structure in the upload.  Something like path:uploadto.
  - Upload resources can be added to the classpath by the classpath option
  - If upload resources are added to the classpath, they will be pulled down into the resources
for each job and added to the local job classpath.
  - Uploads are independent of resources.  An upload doesn't have to be a resource.  A resource
can be an uploaded element.  In this case it would be uploaded (not included in local job.jar)
and then pulled down from the job tracker file system as a resource.
  - Uploads will check modified date/time and size before uploading elements.  If the upload
is a directory, the upload will recursively check all files in that directory before upload
and only upload modified files.  This should give an rsync type functionality to uploading
resources and should decrease bandwidth consumption.
  - Upload will support URL type paths as well.  This will allow transferring resources from
one type of file system (i.e. S3) to the job trackers file system.  Again resources without
a URL type structure will be considered local file system and will support relative and absolute
paths.  Only absolute paths will be supported on non-local file systems.

      was (Author: musepwizard):
    I have only gotten a chance to design not to develop this as I have been launching the
Search Wikia site.  Here is what I have come up with in terms of a more generalized design
after talking with both Doug and Owen about this enhancement:

1.A runjob utility.  runjar is not affected as it is made to only run a single jar.
2.The options parser will be extended to to support resources, upload, classpath, noclasspath,
compress, decompress, and cache.
  - Items that at cached are added to the distributed cache.
  - Items uploaded are by default not added to the classpath
  - Items cached are by default added to the classpath
  - Resources are by default added to the classpath
  - Compress will choose resources to compress before adding to job.jar file
  - Decompress will choose resources to be decompress before adding to job.jar file.
  - Compress and decompress will only take action on resources being added to job.  This will
include non-local resources and will need to be handled in slave local job resources.
  - Classpath is ignored for any resource that is being uploaded as it will already be added
to the classpath due to it being in resources.
  - All options support multiple elements in comma separated format.
  - No classpath will removed cached and non-cached resources from the classpath.  For example
a jar can be added to resources, included in the local job.jar resources but not included
in its local classpath.  (I don't know if this functionality is useful?)
3.Resources
  - Resources are one or more items that are jarred up into the single job.jar file
  - Resources can be files (compressed or uncompressed) or directories
  - Resources can be from any file system.
  - Resources paths support relative and absolute paths
  - Resources support  URL type paths to support multiple file systems
  - If the path in not in a URL format then it is assumed to be on the local file system as
either an absolute or relative path.
  - Only resources that exist will be included.  This is true for any file system.  The resource
must exist at the beginning of the job to be uploaded.  If the resources exists at the beginning
of the job but not when the local job starts its processing an error will be thrown and that
task will cease operation.
  - A global configuration variable exists to choose to decompress any compressed file that
is added as a resource.
  - Non-local resources will be pulled down into the local job resources from the resources
given file system.  This can include DFS and S3 resources added dynamically.
  - Local resources that are added to the job.jar will be resources from the resources configuration
variable passed to the local jobs.  Remaing resources will be the non-local resources that
need to be added to local job resources.
4.Uploads
  - Uploads by default are put into the users home directory on the jobtracker file system.
  - Upload directories can be set either through a configuration variable for a global default
upload folder or through a colon path structure in the upload.  Something like path:uploadto.
  - Upload resources can be added to the classpath by the classpath option
  - If upload resources are added to the classpath, they will be pulled down into the resources
for each job and added to the local job classpath.
  - Uploads are independent of resources.  An upload doesn't have to be a resource.  A resource
can be an uploaded element.  In this case it would be uploaded (not included in local job.jar)
and then pulled down from the job tracker file system as a resource.
  - Uploads will check modified date/time and size before uploading elements.  If the upload
is a directory, the upload will recursively check all files in that directory before upload
and only upload modified files.  This should give an rsync type functionality to uploading
resources and should decrease bandwidth consumption.
  - Upload will support URL type paths as well.  This will allow transferring resources from
one type of file system (i.e. S3) to the job trackers file system.  Again resources without
a URL type structure will be considered local file system and will support relative and absolute
paths.  Only absolute paths will be supported on non-local file systems.
  
> Hadoop should provide a way to allow the user to specify jar file(s) the user job depends
on
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1622
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1622
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Runping Qi
>            Assignee: Dennis Kubes
>             Fix For: 0.16.0
>
>         Attachments: hadoop-1622-4-20071008.patch, HADOOP-1622-5.patch, HADOOP-1622-6.patch,
HADOOP-1622-7.patch, HADOOP-1622-8.patch, HADOOP-1622-9.patch, multipleJobJars.patch, multipleJobResources.patch,
multipleJobResources2.patch
>
>
> More likely than not, a user's job may depend on multiple jars.
> Right now, when submitting a job through bin/hadoop, there is no way for the user to
specify that. 
> A walk around for that is to re-package all the dependent jars into a new jar or put
the dependent jar files in the lib dir of the new jar.
> This walk around causes unnecessary inconvenience to the user. Furthermore, if the user
does not own the main function 
> (like the case when the user uses Aggregate, or datajoin, streaming), the user has to
re-package those system jar files too.
> It is much desired that hadoop provides a clean and simple way for the user to specify
a list of dependent jar files at the time 
> of job submission. Someting like:
> bin/hadoop .... --depending_jars j1.jar:j2.jar 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message