Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 18BC710C81 for ; Thu, 11 Apr 2013 04:12:48 +0000 (UTC) Received: (qmail 82704 invoked by uid 500); 11 Apr 2013 04:12:43 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 82229 invoked by uid 500); 11 Apr 2013 04:12:39 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 82214 invoked by uid 99); 11 Apr 2013 04:12:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Apr 2013 04:12:38 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of vajjalak009@gmail.com designates 74.125.82.170 as permitted sender) Received: from [74.125.82.170] (HELO mail-we0-f170.google.com) (74.125.82.170) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Apr 2013 04:12:32 +0000 Received: by mail-we0-f170.google.com with SMTP id z2so894501wey.15 for ; Wed, 10 Apr 2013 21:12:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=QVClLPvNJ0YWTC4fwRiXE5CHH7C6VWfggfxBqaUZYPQ=; b=E6ZcXtuCKi5HXGC8H/Gx6NiViHMcKIfGvAXVEkyXJffYOv2sNOTwx3XzCb7MqCMsF6 4GezYCHXtJmS095Ly+ItcwTdQpatML0V6hsNRXLMxBT+GhoGki/bW6th477b8Eh+yGCE /KFpfDxBjPtvwfUA0EWsfxvmQrfDopLimqTbo/4TSWpfKA/3Gn7auoNvNOE+dxn7cyjl eWk9RneGxqe32+kAjaYzP3QAwpGv/1OcN8yzNOBlFVJSJRQh+hO0/mbu0a0eUOxL4LS6 sAcFyVJl2KWZ0yxrq/zs8WbjeEbwLIatKB+n/P6P4jJYhyfvIYI3zlHMRI4+MX4XMf+9 c8ow== MIME-Version: 1.0 X-Received: by 10.194.219.162 with SMTP id pp2mr7608701wjc.27.1365653532769; Wed, 10 Apr 2013 21:12:12 -0700 (PDT) Received: by 10.216.124.197 with HTTP; Wed, 10 Apr 2013 21:12:12 -0700 (PDT) In-Reply-To: References: <2013041107282798808611@qunar.com> Date: Wed, 10 Apr 2013 21:12:12 -0700 Message-ID: Subject: Re: Copy Vs DistCP From: KayVajj To: "common-user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=001a11c1b9a8b5026204da0dfd8c X-Virus-Checked: Checked by ClamAV on apache.org --001a11c1b9a8b5026204da0dfd8c Content-Type: text/plain; charset=GB2312 Content-Transfer-Encoding: quoted-printable If CP command is not parallel how does it work for a file partitioned on various data nodes? On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu wrote: > CP command is not parallel, It's just call FileSystem, even if DFSClient > has multi threads. > > DistCp can work well on the same cluster. > > > On Thu, Apr 11, 2013 at 8:17 AM, KayVajj wrote: > >> The File System Copy utility copies files byte by byte if I'm not wrong. >> Could it be possible that the cp command works with blocks and moves the= m >> which could be significantly efficient? >> >> >> Also how does the cp command work if the file is distributed on differen= t >> data nodes?? >> >> Thanks >> Kay >> >> >> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas wrote: >> >>> DistCP is a full blown mapreduce job (mapper only, where the mappers do >>> a "fully" parallel copy to the detsination). >>> >>> CP appears (correct me if im wrong) to simply invoke the FileSystem and >>> issues a copy command for every source file. >>> >>> I have an additional question: how is CP which is internal to a cluster >>> optimized (if at all) ? >>> >>> >>> >>> On Wed, Apr 10, 2013 at 7:28 PM, =C2=F3=CA=F7=C8=D9 wrote: >>> >>>> ** >>>> Hi=A3=AC >>>> >>>> I think it' better using Copy in the same cluster while using distCP >>>> between clusters, and cp command is a hadoop internal parallel process= and >>>> will not copy files locally. >>>> >>>> ------------------------------ >>>> =C2=F3=CA=F7=C8=D9 >>>> >>>> *From:* KayVajj >>>> *Date:* 2013-04-11 06:20 >>>> *To:* user@hadoop.apache.org >>>> *Subject:* Copy Vs DistCP >>>> I have few questions regarding the usage of DistCP for copying >>>> files in the same cluster. >>>> >>>> >>>> 1) Which one is better within a same cluster and what factors (like >>>> file size etc) wouldinfluence the usage of one over te other? >>>> >>>> 2) when we run a cp command like below from a client node of the >>>> cluster (not a data node), How does the cp command work >>>> i) like an MR job >>>> ii) copy files locally and then it copy it back at the new >>>> location. >>>> >>>> Example of the copy command >>>> >>>> hdfs dfs -cp //file // >>>> >>>> Thanks, your responses are appreciated. >>>> >>>> -- Kay >>>> >>> >>> >>> >>> -- >>> Jay Vyas >>> http://jayunit100.blogspot.com >>> >> >> > --001a11c1b9a8b5026204da0dfd8c Content-Type: text/html; charset=GB2312 Content-Transfer-Encoding: quoted-printable
If CP command is not parallel how does it work for a file = partitioned on various data nodes?

On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu = <azuryyyu@gmail.com> wrote:
CP command is not para= llel, It's just call FileSystem, even if DFSClient has multi threads.
DistCp can work well on the same cluster.


On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <vajjalak009@gmail.com>= ; wrote:
The File System Copy utility copies files byte by byt= e if I'm not wrong. Could it be possible that the cp command works with= blocks and moves them which could be significantly efficient?


Also how does the cp command work if the file is distributed on = different data nodes??

Thanks
Kay
=


On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <jayunit100@gmail.com> wrote:
DistCP is a full = blown mapreduce job (mapper only, where the mappers do a "fully" = parallel copy to the detsination). 

CP appears (correct me if im wrong) to simply invoke the FileSystem and= issues a copy command for every source file.

I have an additional question: how is CP which is internal to a c= luster optimized (if at all) ?



On Wed, Apr 10, 2013 at 7:28 PM, =C2=F3=CA=F7=C8=D9 <<= a href=3D"mailto:shurong.mai@qunar.com" target=3D"_blank">shurong.mai@qunar= .com> wrote:
Hi=A3=AC
 
I think it' better using Copy in the same cluster while using= distCP between clusters, and cp command is a hadoop internal parallel proc= ess and will not copy files locally.
 

=C2=F3=CA=F7=C8=D9
 
From: KayVajj
Date: 2013-04-11 06:20
Subject: Copy Vs DistCP
I have few questions regarding the usage of DistCP for copying files i= n the same cluster.


1) Which one is better within a  same cluster and what factors (like f= ile size etc) wouldinfluence the usage of one over te other?

2) when we run a cp command like below from a  client node of the clus= ter (not a data node), How does the cp command work
     i) like an MR job
    ii) copy files locally and then it copy it back at the n= ew location.

Example of the copy command

hdfs dfs -cp /<some_location>/file /<new_location>/

Thanks, your responses are appreciated.

-- Kay



--
Jay Vyas
http://jayunit100.blogspot.com



--001a11c1b9a8b5026204da0dfd8c--