hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mithun Radhakrishnan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2765) DistCp Rewrite
Date Wed, 24 Aug 2011 14:04:30 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090234#comment-13090234

Mithun Radhakrishnan commented on MAPREDUCE-2765:

Thanks for the thorough review, Amareshwari. I've incorporated a majority of your suggestions,
save a couple.

1. Constants for mapred.local.dir, mapred.job.name, etc.: I've switched to using the constants
from JobContext, as you've suggested.
2. ssl.client.* conf-labels: I've moved those to DistCpConstants.
3. CONF_LABEL_TOTAL_*: The reason why the values of these constants don't start with "distcp"
is that CONF_LABEL_TOTAL_NUMBER_OF_RECORDS is used from within DynamicInputFormat (which will
eventually be independent of distcp). I've kept the other one for uniformity.
4. Javadocs: You're right. I had missed the DistCp driver class. Thank you for pointing that
out. I've also resolved all the warnings from the generation of Javadocs for the project.
5. Job-id hack: I have removed the hack. I'm using Job.getJobID().toString() instead.
6. CopyCommitter documentation: I've corrected the documentation for preserveFileAttributesForDirectories()
pertaining to the target-root-path (which was previously misleading). I've also elaborated
on the workings of deleteMissing(), etc.
7. FileSplits with zero-length: For the moment, I've set the FileSplit to be set up with a
fixed non-zero size. I'd intended to initialize it with the number of paths in the FileSplit,
but that would be intrusive. If required, I'll change that in a future patch.
8. Deleting attempt tmp-files: This is a *very* good suggestion. Incorporating it would be
slightly invasive at this time. I'd like very much to include this at a later date. The cleanup
would be much quicker.

Findbugs seems to indicate 2 "bugs" that are false positives:
1. One points at a possible race condition between the cleanup-handler shutdown-hook (from
DistCp.java). There's no problem here, given that the cleanup-state is protected via synch.
I've verified this with Amareshwari.
2. The other is a silly one, indicating that some exceptions are caught and ignored. (Dead
store to local variable).

I'd like to suppress these 2 warnings, if that's alright.

I'll post a newer patch (for hadoop-mapreduce) very shortly.

> DistCp Rewrite
> --------------
>                 Key: MAPREDUCE-2765
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2765
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: distcp
>    Affects Versions:
>            Reporter: Mithun Radhakrishnan
>            Assignee: Mithun Radhakrishnan
>         Attachments: distcpv2.20.203.patch, distcpv2_trunk.patch
> This is a slightly modified version of the DistCp rewrite that Yahoo uses in production
today. The rewrite was ground-up, with specific focus on:
> 1. improved startup time (postponing as much work as possible to the MR job)
> 2. support for multiple copy-strategies
> 3. new features (e.g. -atomic, -async, -bandwidth.)
> 4. improved programmatic use
> Some effort has gone into refactoring what used to be achieved by a single large (1.7
KLOC) source file, into a design that (hopefully) reads better too.
> The proposed DistCpV2 preserves command-line-compatibility with the old version, and
should be a drop-in replacement.
> New to v2:
> 1. Copy-strategies and the DynamicInputFormat:
> 	A copy-strategy determines the policy by which source-file-paths are distributed between
map-tasks. (These boil down to the choice of the input-format.) 
> 	If no strategy is explicitly specified on the command-line, the policy chosen is "uniform
size", where v2 behaves identically to old-DistCp. (The number of bytes transferred by each
map-task is roughly equal, at a per-file granularity.) 
> 	Alternatively, v2 ships with a "dynamic" copy-strategy (in the DynamicInputFormat).
This policy acknowledges that 
> 		(a)  dividing files based only on file-size might not be an even distribution (E.g.
if some datanodes are slower than others, or if some files are skipped.)
> 		(b) a "static" association of a source-path to a map increases the likelihood of long-tails
during copy.
> 	The "dynamic" strategy divides the list-of-source-paths into a number (> nMaps) of
smaller parts. When each map completes its current list of paths, it picks up a new list to
process, if available. So if a map-task is stuck on a slow (and not necessarily large) file,
other maps can pick up the slack. The thinner the file-list is sliced, the greater the parallelism
(and the lower the chances of long-tails). Within reason, of course: the number of these short-lived
list-files is capped at an overridable maximum.
> 	Internal benchmarks against source/target clusters with some slow(ish) datanodes have
indicated significant performance gains when using the dynamic-strategy. Gains are most pronounced
when nFiles greatly exceeds nMaps.
> 	Please note that the DynamicInputFormat might prove useful outside of DistCp. It is
hence available as a mapred/lib, unfettered to DistCpV2. Also note that the copy-strategies
have no bearing on the CopyMapper.map() implementation.
> 2. Improved startup-time and programmatic use:
> 	When the old-DistCp runs with -update, and creates the list-of-source-paths, it attempts
to filter out files that might be skipped (by comparing file-sizes, checksums, etc.) This
significantly increases the startup time (or the time spent in serial processing till the
MR job is launched), blocking the calling-thread. This becomes pronounced as nFiles increases.
(Internal benchmarks have seen situations where more time is spent setting up the job than
on the actual transfer.)
> 	DistCpV2 postpones as much work as possible to the MR job. The file-listing isn't filtered
until the map-task runs (at which time, identical files are skipped). DistCpV2 can now be
run "asynchronously". The program quits at job-launch, logging the job-id for tracking. Programmatically,
the DistCp.execute() returns a Job instance for progress-tracking.
> 3. New features:
> 	(a)   -async: As described in #2.
> 	(b)   -atomic: Data is copied to a (user-specifiable) tmp-location, and then moved atomically
to destination.
> 	(c)   -bandwidth: Enforces a limit on the bandwidth consumed per map.
> 	(d)   -strategy: As above.    
> A more comprehensive description the newer features, how the dynamic-strategy works,
etc. is available in src/site/xdoc/, and in the pdf that's generated therefrom, during the
> High on the list of things to do is support to parallelize copies on a per-block level.
(i.e. Incorporation of HDFS-222.)
> I look forward to comments, suggestions and discussion that will hopefully ensue. I have
this running against Hadoop I also have a port to 0.23.0 (complete with unit-tests).
> P.S.
> A tip of the hat to Srikanth (Sundarrajan) and Venkatesh (Seetharamaiah), for ideas,
code, reviews and guidance. Although much of the code is mine, the idea to use the DFS to
implement "dynamic" input-splits wasn't.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message