hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shivram Mani <sm...@pivotal.io>
Subject Re: how to copy data between two hdfs cluster fastly?
Date Sat, 18 Oct 2014 05:26:27 GMT
If you still do want to use distcp

1. Break the file into smaller files (only if you have the luxury of doing
this

2. Use the "-m” option to set the number of mappers.

(Each map task will aim at copying (total bytes across all file)  /
numSplits. Uses the UniformSizeInputFormat by default

3. distcp by default uses a throttled input stream which by default is set
to 100MB. You can tune this based on your network bandwidth using the
-"bandwidth"
option

On Fri, Oct 17, 2014 at 10:24 PM, Shivram Mani <smani@pivotal.io> wrote:

> Distcp is pretty restrictive w.r.t parallelizing data copy. If all that
> you are doing is one large file, distcp wouldn't make this any faster.
>
> In distcp, files are the lowest level of granularity. So increasing # of
> maps, may not necessarily increase the overall throughput.
>
> The default number of mappers if i’m not wrong is 20 for distcp. If all
> you were doing was to copy a large file, only one map task is effectively
> used
>
> On Fri, Oct 17, 2014 at 8:18 PM, ch huang <justlooks@gmail.com> wrote:
>
>> yes
>>
>> On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky <stransky.ja@gmail.com>
>> wrote:
>>
>>> Distcp?
>>> On 17 Oct 2014 20:51, "Alexander Pivovarov" <apivovarov@gmail.com>
>>> wrote:
>>>
>>>> try to run on dest cluster datanode
>>>> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>>>>
>>>>
>>>>
>>>> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <smani@pivotal.io>
>>>> wrote:
>>>>
>>>>> What is your approx input size ?
>>>>> Do you have multiple files or is this one large file ?
>>>>> What is your block size (source and destination cluster) ?
>>>>>
>>>>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <justlooks@gmail.com>
wrote:
>>>>>
>>>>>> no ,all default
>>>>>>
>>>>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <azuryyyu@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Did you specified how many map tasks?
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <justlooks@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> hi,maillist:
>>>>>>>>              i now use distcp to migrate data from CDH4.4
to CDH5.1
>>>>>>>> , i find when copy small file,it very good, but when transfer
big data ,it
>>>>>>>> very slow ,any good method recommand? thanks
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks
>>>>> Shivram
>>>>>
>>>>
>>>>
>>
>
>
> --
> Thanks
> Shivram
>



-- 
Thanks
Shivram

Mime
View raw message