hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philip Zeyliger (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-989) Allow segregation of DistributedCache for maps and reduces
Date Thu, 17 Sep 2009 18:27:57 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756676#action_12756676

Philip Zeyliger commented on MAPREDUCE-989:

The use cases definitely make sense. Unpacking archives on setup tasks is often
going to be pointless.

I've been thinking about what a reasonable API for this would be (especially after working
on MAPREDUCE-476), from the Job
submitter's role. One thought is:

bq. addCacheFile(URI path, Set<TaskType> tasks, Set<DistributedCacheOptions> options);

Where the default for tasks is an ImmutableSet(EnumSet<TaskType>) containing
MAP and REDUCE. DistributedCacheOptions include
The defaults are to not add to classpath, not unarchive, and not create the symlink.
(Note that we'd be creating symlinks per-file, instead of globally, which is the only
place to set the option currently.)

What I like about this is that it replaces 5 methods (addCacheFile,
addCacheArchive, addFileToClassPath, addArchiveToClassPath, createSymlink),
with one method, and doesn't loose much in the way of readability.

You could also use booleans or enums (boolean add_to_classpath, boolean
unarchive, boolean create_symlink), but that is often difficult to read.

On the back-end, you'd need to revisit how the files to be cached are stored.
The current scheme of using
probably needs to remain for backwards compatibility, but it would
be great to just stick that into one configuration property:
bq. mapred.filecache = [ { "path": ..., "tasks": [ MAP, REDUCE ], ... }, ... ]
or, if it's legal
  mapred.filecache.0 = { "path: ...", ... }
  mapred.filecache.1 = ...


> Allow segregation of DistributedCache for maps and reduces
> ----------------------------------------------------------
>                 Key: MAPREDUCE-989
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-989
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: client
>            Reporter: Arun C Murthy
> Applications might have differing needs for files in the DistributedCache wrt maps and
reduces. We should allow them to specify them separately.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message