Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of smani@pivotal.io designates
 209.85.218.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CACeqxwRAxdHJ8bUFqcyDuXzpoV_K2a70dtxfyHDG3Fbv65pyDw@mail.gmail.com>
References: 
 <CACeqxwQ7sMmo92RG=SFuuOhCHn2x1Rf=WvE1_9Hs38q-EBX+Kw@mail.gmail.com>
 <CALr1C9oYpmFZUnc1LK-aEtJxPGMBvqQZA7kxFbMk9+0egK=9qw@mail.gmail.com>
 <CACeqxwTcFqN6PJtz8kB9NHJts9L3a=HH-JDwjS+TL1cAWFRO3Q@mail.gmail.com>
 <CAPqjCXeDgnh2W1uhcjEvrHseANN4CL-p5g4MzwF7+ppbJcswrQ@mail.gmail.com>
 <CAKKt98TD3X2Dbphzarijx0wHTA+m=0b+BEZuQsD5R3S0oNbfYw@mail.gmail.com>
 <CAJOOh6HYVaUPFM_ddXOOzqMTt-+rd1Uf6EY1iXtyUWVHq3_k2g@mail.gmail.com>
 <CACeqxwRAxdHJ8bUFqcyDuXzpoV_K2a70dtxfyHDG3Fbv65pyDw@mail.gmail.com>
From: Shivram Mani <smani@pivotal.io>
Date: Fri, 17 Oct 2014 22:24:22 -0700
Message-ID: 
 <CAPqjCXcYXfoaZtH8g3Gi4B0TfnZgKX8CYtDCo=F85vvWFWUo8A@mail.gmail.com>
Subject: Re: how to copy data between two hdfs cluster fastly?
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=089e01538a12e10d570505abb2dd

--089e01538a12e10d570505abb2dd
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Distcp is pretty restrictive w.r.t parallelizing data copy. If all that you
are doing is one large file, distcp wouldn't make this any faster.

In distcp, files are the lowest level of granularity. So increasing # of
maps, may not necessarily increase the overall throughput.

The default number of mappers if i=E2=80=99m not wrong is 20 for distcp. If=
 all you
were doing was to copy a large file, only one map task is effectively used

On Fri, Oct 17, 2014 at 8:18 PM, ch huang <justlooks@gmail.com> wrote:

> yes
>
> On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky <stransky.ja@gmail.com>
> wrote:
>
>> Distcp?
>> On 17 Oct 2014 20:51, "Alexander Pivovarov" <apivovarov@gmail.com> wrote=
:
>>
>>> try to run on dest cluster datanode
>>> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>>>
>>>
>>>
>>> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <smani@pivotal.io> wrote=
:
>>>
>>>> What is your approx input size ?
>>>> Do you have multiple files or is this one large file ?
>>>> What is your block size (source and destination cluster) ?
>>>>
>>>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <justlooks@gmail.com> wrote:
>>>>
>>>>> no ,all default
>>>>>
>>>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <azuryyyu@gmail.com> wrote=
:
>>>>>
>>>>>> Did you specified how many map tasks?
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <justlooks@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> hi,maillist:
>>>>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1
>>>>>>> , i find when copy small file,it very good, but when transfer big d=
ata ,it
>>>>>>> very slow ,any good method recommand? thanks
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks
>>>> Shivram
>>>>
>>>
>>>
>


--=20
Thanks
Shivram

--089e01538a12e10d570505abb2dd
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">


<p class=3D"">Distcp is pretty restrictive w.r.t parallelizing data copy. I=
f all that you are doing is one large file, distcp wouldn&#39;t make this a=
ny faster.</p>
<p class=3D"">In distcp, files are the lowest level of granularity. So incr=
easing # of maps, may not necessarily increase the overall throughput.</p>
<p class=3D"">The default number of mappers if i=E2=80=99m not wrong is 20 =
for distcp. If all you were doing was to copy a large file, only one map ta=
sk is effectively used</p></div><div class=3D"gmail_extra"><br><div class=
=3D"gmail_quote">On Fri, Oct 17, 2014 at 8:18 PM, ch huang <span dir=3D"ltr=
">&lt;<a href=3D"mailto:justlooks@gmail.com" target=3D"_blank">justlooks@gm=
ail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D=
"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D=
"ltr">yes</div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_=
extra"><br><div class=3D"gmail_quote">On Sat, Oct 18, 2014 at 3:53 AM, Jaku=
b Stransky <span dir=3D"ltr">&lt;<a href=3D"mailto:stransky.ja@gmail.com" t=
arget=3D"_blank">stransky.ja@gmail.com</a>&gt;</span> wrote:<br><blockquote=
 class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli=
d;padding-left:1ex"><p dir=3D"ltr">Distcp? </p><div><div>
<div class=3D"gmail_quote">On 17 Oct 2014 20:51, &quot;Alexander Pivovarov&=
quot; &lt;<a href=3D"mailto:apivovarov@gmail.com" target=3D"_blank">apivova=
rov@gmail.com</a>&gt; wrote:<br type=3D"attribution"><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex"><div dir=3D"ltr">try to run on dest cluster datanode<div>$ hadoop =
fs -cp hdfs://from_cluster/.... =C2=A0 =C2=A0hdfs://to_cluster/....</div><d=
iv><br></div><div><br></div></div><div class=3D"gmail_extra"><br><div class=
=3D"gmail_quote">On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <span dir=
=3D"ltr">&lt;<a href=3D"mailto:smani@pivotal.io" target=3D"_blank">smani@pi=
votal.io</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr">What is your approx input size ?<div>Do you have multiple files or=
 is this one large file ?</div><div>What is your block size (source and des=
tination cluster) ?</div></div><div class=3D"gmail_extra"><div><div><br><di=
v class=3D"gmail_quote">On Fri, Oct 17, 2014 at 4:19 AM, ch huang <span dir=
=3D"ltr">&lt;<a href=3D"mailto:justlooks@gmail.com" target=3D"_blank">justl=
ooks@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div=
 dir=3D"ltr">no ,all default</div><div><div><div class=3D"gmail_extra"><br>=
<div class=3D"gmail_quote">On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <span=
 dir=3D"ltr">&lt;<a href=3D"mailto:azuryyyu@gmail.com" target=3D"_blank">az=
uryyyu@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote"=
 style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><d=
iv dir=3D"ltr"><div>Did you specified how many map tasks?</div><div><br></d=
iv></div><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote=
">On Fri, Oct 17, 2014 at 4:58 PM, ch huang <span dir=3D"ltr">&lt;<a href=
=3D"mailto:justlooks@gmail.com" target=3D"_blank">justlooks@gmail.com</a>&g=
t;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0=
 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">hi,mail=
list:<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0i now use distcp =
to migrate data from CDH4.4 to CDH5.1 , i find when copy small file,it very=
 good, but when transfer big data ,it very slow ,any good method recommand?=
 thanks</div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
/div><span><font color=3D"#888888">-- <br><div dir=3D"ltr"><span style=3D"f=
ont-family:arial;font-size:small">Thanks</span><br><div style=3D"font-famil=
y:arial;font-size:small">Shivram</div></div>
</font></span></div>
</blockquote></div><br></div>
</blockquote></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div dir=3D"ltr"><span style=3D"font-family:arial;font-size:small">Thanks</=
span><br><div style=3D"font-family:arial;font-size:small">Shivram</div></di=
v>
</div>

--089e01538a12e10d570505abb2dd--