hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shivram Mani <sm...@pivotal.io>
Subject Re: how to copy data between two hdfs cluster fastly?
Date Sat, 18 Oct 2014 05:24:22 GMT
Distcp is pretty restrictive w.r.t parallelizing data copy. If all that you
are doing is one large file, distcp wouldn't make this any faster.

In distcp, files are the lowest level of granularity. So increasing # of
maps, may not necessarily increase the overall throughput.

The default number of mappers if i’m not wrong is 20 for distcp. If all you
were doing was to copy a large file, only one map task is effectively used

On Fri, Oct 17, 2014 at 8:18 PM, ch huang <justlooks@gmail.com> wrote:

> yes
>
> On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky <stransky.ja@gmail.com>
> wrote:
>
>> Distcp?
>> On 17 Oct 2014 20:51, "Alexander Pivovarov" <apivovarov@gmail.com> wrote:
>>
>>> try to run on dest cluster datanode
>>> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>>>
>>>
>>>
>>> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <smani@pivotal.io> wrote:
>>>
>>>> What is your approx input size ?
>>>> Do you have multiple files or is this one large file ?
>>>> What is your block size (source and destination cluster) ?
>>>>
>>>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <justlooks@gmail.com> wrote:
>>>>
>>>>> no ,all default
>>>>>
>>>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <azuryyyu@gmail.com>
wrote:
>>>>>
>>>>>> Did you specified how many map tasks?
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <justlooks@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> hi,maillist:
>>>>>>>              i now use distcp to migrate data from CDH4.4 to
CDH5.1
>>>>>>> , i find when copy small file,it very good, but when transfer
big data ,it
>>>>>>> very slow ,any good method recommand? thanks
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks
>>>> Shivram
>>>>
>>>
>>>
>


-- 
Thanks
Shivram

Mime
View raw message