Return-Path: Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: (qmail 59209 invoked from network); 3 Nov 2010 16:46:19 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 Nov 2010 16:46:19 -0000 Received: (qmail 26206 invoked by uid 500); 3 Nov 2010 16:46:50 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 26138 invoked by uid 500); 3 Nov 2010 16:46:49 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 25737 invoked by uid 99); 3 Nov 2010 16:46:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Nov 2010 16:46:48 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Nov 2010 16:46:47 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id oA3GkRZL019630 for ; Wed, 3 Nov 2010 16:46:27 GMT Message-ID: <15962218.221431288802787279.JavaMail.jira@thor> Date: Wed, 3 Nov 2010 12:46:27 -0400 (EDT) From: "Raghu Angadi (JIRA)" To: mapreduce-issues@hadoop.apache.org Subject: [jira] Updated: (MAPREDUCE-2149) Distcp : setup with update is too slow when latency is high In-Reply-To: <23889712.60841288023321514.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated MAPREDUCE-2149: ------------------------------------ Attachment: MAPREDUCE-2149.patch A patch for the first option is attached. Now setup should not take longer than it takes to '-lsr' destination and source directories. This is the best we can do without parallelizing setup(). The fix is to store entries in destination directory in a map pass it to sameFile(). > Distcp : setup with update is too slow when latency is high > ----------------------------------------------------------- > > Key: MAPREDUCE-2149 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2149 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: distcp > Affects Versions: 0.20.2, 0.21.0 > Reporter: Raghu Angadi > Assignee: Raghu Angadi > Attachments: MAPREDUCE-2149.patch > > > If you run distcp with '-update' option, for _each of the files_ present on source cluster setup invokes a separate RPC to destination cluster to fetch file info. > Usually this overhead is not very noticeable when both cluster are geographically close to each other. But if the latency is large, setup could take couple of orders of magnitude longer. > E.g. : source has 10k directories, each with about 10 files, round trip latency between source and destination is 75 ms (typical for coast-to-coast clusters). > If we run distcp on source cluster, set up would take about _2.5 hours_ irrespective of whether destination has these files or not. '-lsr' on the same dest dir from source cluster would take up to 12 min (depending on how many directories already exist on dest). > * A fairly simple fix to how setup() iterates should bring the set up time to same as '-lsr'. I will have a patch for this.. (though 12 min is too large). > * A more scalable option is to differ update check to mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.