hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mahadev konar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1032) Support for caching Job JARs
Date Thu, 01 Mar 2007 00:13:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476778

Mahadev konar commented on HADOOP-1032:

some comments:
1) can we add some javadoc to the public methods ?
2) we are better off not changing jobconf api  but include it as static methods in DistributedCache..

3) we should do it for both cache archives and cache files.... cache files is necessary because
you do not want you lib.jar to be unjarred before adding it to the classpath
4) the matching to see that each the last name of the path matches the archive name is not
right. You might have two jars with the same name but different path in the HDFS like : hdfs:hostname:port/soemthing/lib.jar
or hdfs:port/something/some/lib.jar. The code will add both the jars to the classpath though
it wasnt asked for.

> Support for caching Job JARs 
> -----------------------------
>                 Key: HADOOP-1032
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1032
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.11.2
>            Reporter: Gautam Kowshik
>            Priority: Minor
>             Fix For: 0.12.0
>         Attachments: HADOOP-1032.patch, HADOOP-1032_2.patch, HADOOP-1032_3.patch
> Often jobs need to be rerun number of times.. like a job that reads from crawled data
time and again.. so having to upload job jars to every node is cumbersome. We need a caching
mechanism to boost performance. Here are the features for job specific caching of jars/conf
>  - Ability to resubmit jobs with jars without having to propagate same jar to all nodes.
>     The idea is to keep a store(path mentioned by user in job.xml?) local to the task
node so as to speed up task initiation on tasktrackers. Assumes that the jar does not change
during an MR task.
> - An independent DFS store to upload jars to (Distributed File Cache?).. that does not
cleanup between jobs.
>     This might need user level configuration to indicate to the jobclient to upload files
to DFSCache instead of the DFS. https://issues.apache.org/jira/browse/HADOOP-288 facilitates
this. Our local cache can be client to the DFS Cache.
> - A standard cache mechanism that checks for changes in the local store and picks from
dfs if found dirty.
>    This does away with versioning. The DFSCache supports a md5 checksum check, we can
use that.
> Anything else? Suggestions? Thoughts?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message