hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mahadev konar (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-576) Enhance streaming to use the new caching feature
Date Thu, 26 Oct 2006 22:10:18 GMT
     [ http://issues.apache.org/jira/browse/HADOOP-576?page=all ]

Mahadev konar updated HADOOP-576:
---------------------------------

    Attachment: streaming.patch

This patch makes cachign available in streaming. So you can specify an option to cachefile
as 

bin/hadoop jar hadoop-streaming.jar -cacheFile dfs://host:port/path_in_dfsof_fileorarchive#NAME

This would lead to a symlink NAME in your current working directory pointing to 
(i) in case of a file the symlink points to the file where the file has been localized
(ii) in case of archives the symlink points to the directory where your archives have been
localized.

This patch also makes available the symlinks in conventional hadoop. 
To create a symlink providing a fragment is necessary (meaning specifying NAME in the uri
is necessary).
To create a symlink in conventional Hadoop you would need to do this :

DistributedCache.createSymlink(jobconf)

also there is method DistributedCache.checkURIs(fileURIs, archiveURIS)
this checks to see if there is a fragment collision and returns false if the fragments are
not specified and if there is a fragment collision it returns false.

So the correct order in which conventional hadoop users should be using it is 
DistributedCache.checkURIs(fileURIS, archiveURIs)
DistributedCache.setCache/Filearchives()
DistributedCache.createSymLink()

Hadoop streaming takes care of all this.
I have tested it on windows (cygwin) as well and it works.


> Enhance streaming to use the new caching feature
> ------------------------------------------------
>
>                 Key: HADOOP-576
>                 URL: http://issues.apache.org/jira/browse/HADOOP-576
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Michel Tourn
>         Assigned To: Mahadev konar
>         Attachments: streaming.patch
>
>
> Design proposal to expose filecache access to Hadoop streaming.
> The main difference with the pure-Java filecache code is:
> 1. As part of job launch (in hadoopStreaming client) we validate presence of
> cached archives/files in DFS.
> 2. As part of Task initialization, a symbolic link to cached files/unarchived
> directories is created in the Task working directory.
> C1. New command-line options (example)
> -cachearchive dfs:/user/me/big.zip#big_1 
> -cachefile dfs:/user/other/big.zip#big_2 
> -cachearchive dfs:/user/me/bang.zip
> This maps to API calls to static methods:
> DistributedCache.addCacheArchive(URI uri, Configuration conf)
> DistributedCache.addCacheFile(URI uri, Configuration conf)
> This is done in class StreamJob methods parseArgv() and setJobConf().
> The code should be similar to the way "-file" is handled.
> One difference is that we now require a FileSystem instance to VALIDATE the DFS
> paths in -cachefile and -cachearchive. The FileSystem instance should not be
> accessed before the filesystem is set by this: setUserJobConfProps(true);
> If FileSystem instance is "local" and there are -cachearchive/-cachefile
> options , then fail: this is not supported.
> Else this should return true:
> fs_.isFile(Path) for each -cachearchive/-cachefile option.
> Only in verbose mode: show the isFile status of each option.
> In any verbosity mode: show the first failed isFile() status and abort using
> method StreamJob.fail().
> C2. Task initialization
> The symlinks are called:
> Workingdir/big_1 (points to directory: /cache/user/me/big_zip)
> Workingdir/big_2 (points to file: /cache/user/other/big.zip)
> Workingdir/bang.zip (points to directory /cache/user/me/bang_zip)
> This will require hadoopStreaming to create symbolic links.
> Hadoop should have code to do this in a portable way.
> Although this may not be supported on non-Unix platforms. 
> Cross-platform support is harder than for hard-links. 
> Cygwin soft links are not a solution: they only work for applications compiled with
> cygwin1.dll)
> Symbolic links make JUnit tests less portable.
> So maybe the test should run as part of ant target test-unix. (in contrib/streaming/build.xml)
> The parameters after -cachearchive and -cachefile have the following
> properties:
> A. you can optionally give a name to your symlink (after #)
> B. the default name is the leaf name (big.zip, big.zip, bang.zip)
> C. if the same leaf name appears more than once you MUST give a name. Otherwise
> streaming client aborts and complains. For example with this, Streaming client
> should complain:
> -cachearchive dfs:/user/me/big.zip 
> -cachefile dfs:/user/other/big.zip
> This complains because multiple occurrences of "big.zip" are not disambiguated
> with #big_1, #big_2.
> Ideally the Streaming client error message should then generate an example on
> how to fix the parameters:
> -cachearchive dfs:/user/me/big.zip#1
> -cachefile dfs:/user/other/big.zip#2
> ---------
> hadoop-Client note:
> Currently argv parsing is position-independant. i.e. changing the order of
> arguments never impacts the behaviour of hadoopStreaming. It would be good to
> keep this behaviour.
> URI notes:
> scheme is "dfs:" for consistency with current state of Hadoop code.
> However there is a proposal to change the scheme to "hdfs:"
> Using a URI fragment to give a local name to the resource is unusual. The main
> constraint is that the URI should remain parsable by java.net.URI(String). And
> encoding attributes in the fragment is standard (like CGI parameters in an HTTP
> GET request) (fragment is #big2 in dfs:/user/other/big.zip#big_2)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message