hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philip Zeyliger (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-2914) extend DistributedCache to work locally (LocalJobRunner)
Date Mon, 01 Jun 2009 05:38:07 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Philip Zeyliger updated HADOOP-2914:

    Attachment: HADOOP-2914-v1-full.patch

I set out to get DistributedCache to work on local job runner --- which wasn't too tricky
--- but I ended up refactoring the DistributedCache code quite a bit, which has made this
patch large and perhaps unfriendly.

DistributedCache code is used in three places:
# In user code, to (1) configure files to be cached and (2) retrieve the URIs of those files
at runtime,
# In JobClient, to record some metadata information about the files desired in user code,
# And in TaskTracker/TaskRunner, to (1) maintain the cache, and (2) configure the cache per

Most of the code for all of these uses was in public static methods in DistributedCache.java,
though some pretty complicated logic about the DistributedCache was also in TaskTracker.java
and TaskRunner.java.  This made it tricky to tease out what the sacrosanct public APIs were.
 My interpretation is that the methods described in the documentation (http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache)
are public APIs, and I have left those, and a few others, in tact.  I separated out the other
logic into two other classes, so that then I could avoid duplication between TaskRunner and

The current patch depends on HADOOP-4041, so I've attached two patches: one for Hudson, and
another if you don't want to revisit the intersection with 4041 (which is largely uninteresting:
either way code moves out of TaskRunner into DistributedCacheHandle).

I've added some tests.  TestDistributedCache has become TestDistributedCacheManager, and there's
a new test in there.  TestMRWithDistributedCache tests against both local and MiniMRClusters.
 I've also tested using streaming, with commands like:
bin/hadoop jar build/contrib/streaming/hadoop-0.21.0-dev-streaming.jar \
  -files /etc/passwd -input /dev/null -output /tmp/output1 -mapper 'sh -c "test ! -z $mapred_cache_localFiles"'
bin/hadoop jar build/contrib/streaming/hadoop-0.21.0-dev-streaming.jar \
  -jt local -files /etc/passwd -input /dev/null -output /tmp/output2 -mapper 'sh -c "test
! -z $mapred_cache_localFiles"'
Is there a place where tests that use streaming to check other functionality could be checked

I wanted to stop somewhere and send this out, but I can think of several potential future

* The DistributedCache is in core/, but it only makes sense with mapred, so it probably should
be relocated to mapred.
* There's more work to be done to separate out the public interfaces from the private ones.
 The timestamp handling that's done by JobClient should really be done by something within
the filecache package, for example.  Much of the annoyance here stems from the haphazard ways
in which Hadoop jobs serialize some configuration data to the configuration file.  DistributedCache
uses, I believe, 6 configuration keys, just to store ("file", "archive", "file+classpath",
"archive+classpath", "filetimestamp", "archive+timestamp").
* Speaking of configuration, DistributedCache will not likely work for files with a comma
in their path, though perhaps URI encoding saves us there.
* I haven't touched the DistributedCacheManager code except to move it there, but I suspect
it could be significantly simplified now that it contains a Configuration object.
* It's my belief that SVN r696957 (HADOOP-249) turned off the symlink feature and that it
hasn't worked since then.  That said, I haven't yet written the test that would confirm this.

Looking forward to your feedback. -- Philip

> extend DistributedCache to work locally (LocalJobRunner)
> --------------------------------------------------------
>                 Key: HADOOP-2914
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2914
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: sam rash
>            Priority: Minor
>         Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch
> The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html

> Ideally, LocalJobRunner would take care of populating the JobConf and copying remote
files to the local file sytem (http, assume hdfs = default fs = local fs when doing local

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message