Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Date: Mon, 8 Aug 2011 12:06:27 +0000 (UTC)
From: "Amareshwari Sriramadasu (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: 
 <148608901.16406.1312805187623.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <1489319984.1211.1312286727371.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (MAPREDUCE-2765) DistCp Rewrite
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/MAPREDUCE-2765?page=3Dcom.atlas=
sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D=
13080917#comment-13080917 ]=20

Amareshwari Sriramadasu commented on MAPREDUCE-2765:
----------------------------------------------------

First of all, the code needs go into a contrib project. So, you need to reg=
enerate the patch putting the code in contrib.
Also, build environment needs changes. Will this be blocked on mavenization=
 of MapReduce?

Overall, design looks fine. Here are some comments on the code:
* CopyMapper:
  **=20
{noformat}
    if (targetFS.exists(targetFinalPath) && targetFS.isFile(targetFinalPath=
)) {
      overWrite =3D true; // When target is an existing file, overwrite it.
    }
{noformat}
Target file is overwritten irrespective of overwrite configuration? why?

* Dynamic\*
  ** DynamicInputChunk is not public?
  ** DynamicInputFormat creates FileSplits with zero length. Instead should=
 it be created with the size of chunk as the size of the split.
  ** DynamicRecordReader has commented code. Should remove it.

* CopyCommitter:
  ** Atomic commit should not delete the final directory. Should throw out =
an error if it exists even before starting the job.
  ** deleteMissing() counts the files which do not exists at both source an=
d target paths as deleted entries.
  ** Preserving status for the root folder does not happen at all? Can you =
check?
  ** If I=E2=80=99m not wrong, preserveFileAttributes() does preserve only =
for directories. Can we rename the method accordingly?
  ** The methods deleteMissing(), preserveFileAttributes() etc need more do=
c.
  ** Deleting attempt temp files happens in each attempt. Why are we doing =
delete again in Committer? Committer should just delete the work path.

General comment:
All public classes and public methods need javadoc

Haven't looked at testcases.

> DistCp Rewrite
> --------------
>
>                 Key: MAPREDUCE-2765
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2765
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: distcp
>    Affects Versions: 0.20.203.0
>            Reporter: Mithun Radhakrishnan
>            Assignee: Mithun Radhakrishnan
>         Attachments: distcpv2.20.203.patch
>
>
> This is a slightly modified version of the DistCp rewrite that Yahoo uses=
 in production today. The rewrite was ground-up, with specific focus on:
> 1. improved startup time (postponing as much work as possible to the MR j=
ob)
> 2. support for multiple copy-strategies
> 3. new features (e.g. -atomic, -async, -bandwidth.)
> 4. improved programmatic use
> Some effort has gone into refactoring what used to be achieved by a singl=
e large (1.7 KLOC) source file, into a design that (hopefully) reads better=
 too.
> The proposed DistCpV2 preserves command-line-compatibility with the old v=
ersion, and should be a drop-in replacement.
> New to v2:
> 1. Copy-strategies and the DynamicInputFormat:
> =09A copy-strategy determines the policy by which source-file-paths are d=
istributed between map-tasks. (These boil down to the choice of the input-f=
ormat.)=20
> =09If no strategy is explicitly specified on the command-line, the policy=
 chosen is "uniform size", where v2 behaves identically to old-DistCp. (The=
 number of bytes transferred by each map-task is roughly equal, at a per-fi=
le granularity.)=20
> =09Alternatively, v2 ships with a "dynamic" copy-strategy (in the Dynamic=
InputFormat). This policy acknowledges that=20
> =09=09(a)  dividing files based only on file-size might not be an even di=
stribution (E.g. if some datanodes are slower than others, or if some files=
 are skipped.)
> =09=09(b) a "static" association of a source-path to a map increases the =
likelihood of long-tails during copy.
> =09The "dynamic" strategy divides the list-of-source-paths into a number =
(> nMaps) of smaller parts. When each map completes its current list of pat=
hs, it picks up a new list to process, if available. So if a map-task is st=
uck on a slow (and not necessarily large) file, other maps can pick up the =
slack. The thinner the file-list is sliced, the greater the parallelism (an=
d the lower the chances of long-tails). Within reason, of course: the numbe=
r of these short-lived list-files is capped at an overridable maximum.
> =09Internal benchmarks against source/target clusters with some slow(ish)=
 datanodes have indicated significant performance gains when using the dyna=
mic-strategy. Gains are most pronounced when nFiles greatly exceeds nMaps.
> =09Please note that the DynamicInputFormat might prove useful outside of =
DistCp. It is hence available as a mapred/lib, unfettered to DistCpV2. Also=
 note that the copy-strategies have no bearing on the CopyMapper.map() impl=
ementation.
> =09
> 2. Improved startup-time and programmatic use:
> =09When the old-DistCp runs with -update, and creates the list-of-source-=
paths, it attempts to filter out files that might be skipped (by comparing =
file-sizes, checksums, etc.) This significantly increases the startup time =
(or the time spent in serial processing till the MR job is launched), block=
ing the calling-thread. This becomes pronounced as nFiles increases. (Inter=
nal benchmarks have seen situations where more time is spent setting up the=
 job than on the actual transfer.)
> =09DistCpV2 postpones as much work as possible to the MR job. The file-li=
sting isn't filtered until the map-task runs (at which time, identical file=
s are skipped). DistCpV2 can now be run "asynchronously". The program quits=
 at job-launch, logging the job-id for tracking. Programmatically, the Dist=
Cp.execute() returns a Job instance for progress-tracking.
> =09
> 3. New features:
> =09(a)   -async: As described in #2.
> =09(b)   -atomic: Data is copied to a (user-specifiable) tmp-location, an=
d then moved atomically to destination.
> =09(c)   -bandwidth: Enforces a limit on the bandwidth consumed per map.
> =09(d)   -strategy: As above.   =20
> =09
> A more comprehensive description the newer features, how the dynamic-stra=
tegy works, etc. is available in src/site/xdoc/, and in the pdf that's gene=
rated therefrom, during the build.
> High on the list of things to do is support to parallelize copies on a pe=
r-block level. (i.e. Incorporation of HDFS-222.)
> I look forward to comments, suggestions and discussion that will hopefull=
y ensue. I have this running against Hadoop 0.20.203.0. I also have a port =
to 0.23.0 (complete with unit-tests).
> P.S.
> A tip of the hat to Srikanth (Sundarrajan) and Venkatesh (Seetharamaiah),=
 for ideas, code, reviews and guidance. Although much of the code is mine, =
the idea to use the DFS to implement "dynamic" input-splits wasn't.
> =09

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira