Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 84E84860D for ; Mon, 8 Aug 2011 12:06:57 +0000 (UTC) Received: (qmail 13719 invoked by uid 500); 8 Aug 2011 12:06:56 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 13152 invoked by uid 500); 8 Aug 2011 12:06:50 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 13121 invoked by uid 99); 8 Aug 2011 12:06:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Aug 2011 12:06:49 +0000 X-ASF-Spam-Status: No, hits=-2000.8 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Aug 2011 12:06:48 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 99166B1006 for ; Mon, 8 Aug 2011 12:06:27 +0000 (UTC) Date: Mon, 8 Aug 2011 12:06:27 +0000 (UTC) From: "Amareshwari Sriramadasu (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <148608901.16406.1312805187623.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1489319984.1211.1312286727371.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (MAPREDUCE-2765) DistCp Rewrite MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-2765?page=3Dcom.atlas= sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D= 13080917#comment-13080917 ]=20 Amareshwari Sriramadasu commented on MAPREDUCE-2765: ---------------------------------------------------- First of all, the code needs go into a contrib project. So, you need to reg= enerate the patch putting the code in contrib. Also, build environment needs changes. Will this be blocked on mavenization= of MapReduce? Overall, design looks fine. Here are some comments on the code: * CopyMapper: **=20 {noformat} if (targetFS.exists(targetFinalPath) && targetFS.isFile(targetFinalPath= )) { overWrite =3D true; // When target is an existing file, overwrite it. } {noformat} Target file is overwritten irrespective of overwrite configuration? why? * Dynamic\* ** DynamicInputChunk is not public? ** DynamicInputFormat creates FileSplits with zero length. Instead should= it be created with the size of chunk as the size of the split. ** DynamicRecordReader has commented code. Should remove it. * CopyCommitter: ** Atomic commit should not delete the final directory. Should throw out = an error if it exists even before starting the job. ** deleteMissing() counts the files which do not exists at both source an= d target paths as deleted entries. ** Preserving status for the root folder does not happen at all? Can you = check? ** If I=E2=80=99m not wrong, preserveFileAttributes() does preserve only = for directories. Can we rename the method accordingly? ** The methods deleteMissing(), preserveFileAttributes() etc need more do= c. ** Deleting attempt temp files happens in each attempt. Why are we doing = delete again in Committer? Committer should just delete the work path. General comment: All public classes and public methods need javadoc Haven't looked at testcases. > DistCp Rewrite > -------------- > > Key: MAPREDUCE-2765 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2765 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: distcp > Affects Versions: 0.20.203.0 > Reporter: Mithun Radhakrishnan > Assignee: Mithun Radhakrishnan > Attachments: distcpv2.20.203.patch > > > This is a slightly modified version of the DistCp rewrite that Yahoo uses= in production today. The rewrite was ground-up, with specific focus on: > 1. improved startup time (postponing as much work as possible to the MR j= ob) > 2. support for multiple copy-strategies > 3. new features (e.g. -atomic, -async, -bandwidth.) > 4. improved programmatic use > Some effort has gone into refactoring what used to be achieved by a singl= e large (1.7 KLOC) source file, into a design that (hopefully) reads better= too. > The proposed DistCpV2 preserves command-line-compatibility with the old v= ersion, and should be a drop-in replacement. > New to v2: > 1. Copy-strategies and the DynamicInputFormat: > =09A copy-strategy determines the policy by which source-file-paths are d= istributed between map-tasks. (These boil down to the choice of the input-f= ormat.)=20 > =09If no strategy is explicitly specified on the command-line, the policy= chosen is "uniform size", where v2 behaves identically to old-DistCp. (The= number of bytes transferred by each map-task is roughly equal, at a per-fi= le granularity.)=20 > =09Alternatively, v2 ships with a "dynamic" copy-strategy (in the Dynamic= InputFormat). This policy acknowledges that=20 > =09=09(a) dividing files based only on file-size might not be an even di= stribution (E.g. if some datanodes are slower than others, or if some files= are skipped.) > =09=09(b) a "static" association of a source-path to a map increases the = likelihood of long-tails during copy. > =09The "dynamic" strategy divides the list-of-source-paths into a number = (> nMaps) of smaller parts. When each map completes its current list of pat= hs, it picks up a new list to process, if available. So if a map-task is st= uck on a slow (and not necessarily large) file, other maps can pick up the = slack. The thinner the file-list is sliced, the greater the parallelism (an= d the lower the chances of long-tails). Within reason, of course: the numbe= r of these short-lived list-files is capped at an overridable maximum. > =09Internal benchmarks against source/target clusters with some slow(ish)= datanodes have indicated significant performance gains when using the dyna= mic-strategy. Gains are most pronounced when nFiles greatly exceeds nMaps. > =09Please note that the DynamicInputFormat might prove useful outside of = DistCp. It is hence available as a mapred/lib, unfettered to DistCpV2. Also= note that the copy-strategies have no bearing on the CopyMapper.map() impl= ementation. > =09 > 2. Improved startup-time and programmatic use: > =09When the old-DistCp runs with -update, and creates the list-of-source-= paths, it attempts to filter out files that might be skipped (by comparing = file-sizes, checksums, etc.) This significantly increases the startup time = (or the time spent in serial processing till the MR job is launched), block= ing the calling-thread. This becomes pronounced as nFiles increases. (Inter= nal benchmarks have seen situations where more time is spent setting up the= job than on the actual transfer.) > =09DistCpV2 postpones as much work as possible to the MR job. The file-li= sting isn't filtered until the map-task runs (at which time, identical file= s are skipped). DistCpV2 can now be run "asynchronously". The program quits= at job-launch, logging the job-id for tracking. Programmatically, the Dist= Cp.execute() returns a Job instance for progress-tracking. > =09 > 3. New features: > =09(a) -async: As described in #2. > =09(b) -atomic: Data is copied to a (user-specifiable) tmp-location, an= d then moved atomically to destination. > =09(c) -bandwidth: Enforces a limit on the bandwidth consumed per map. > =09(d) -strategy: As above. =20 > =09 > A more comprehensive description the newer features, how the dynamic-stra= tegy works, etc. is available in src/site/xdoc/, and in the pdf that's gene= rated therefrom, during the build. > High on the list of things to do is support to parallelize copies on a pe= r-block level. (i.e. Incorporation of HDFS-222.) > I look forward to comments, suggestions and discussion that will hopefull= y ensue. I have this running against Hadoop 0.20.203.0. I also have a port = to 0.23.0 (complete with unit-tests). > P.S. > A tip of the hat to Srikanth (Sundarrajan) and Venkatesh (Seetharamaiah),= for ideas, code, reviews and guidance. Although much of the code is mine, = the idea to use the DFS to implement "dynamic" input-splits wasn't. > =09 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira