hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Namikaze Minato <lloydsen...@gmail.com>
Subject Re: Best way to migrate PB scale data between live cluster?
Date Tue, 12 Apr 2016 10:20:12 GMT
The clean way to go is to start from the log and to replay it... But I
have actually no idea about how to do that
You might find this (old) work interesting:
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

I'd never have tried to transmit this much data across the network, I
would always have tried to find a way to copy hard disks and
physically ship them to the location...

Camusensei

On 12 April 2016 at 19:14, cs user <acldstkusr@gmail.com> wrote:
> Hi there,
>
> At some point in the near future we are also going to require exactly what
> you describe. We had hope to use distcp.
>
> You mentioned:
>
> 1. it do not handle data delete
>
> distcp has a -delete flag which says -
>
> "Delete the files existing in the dst but not in src"
>
> Does this not help with handling deleted data?
>
> I believe there is an issue if data is removed during a distcp run, so for
> example at the start of the run it captures all the files it needs to sync.
> If some files are deleted during the run, it may lead to errors. Is there a
> way to ignore these errors and have distcp retry on the next run?
>
> I'd be interested in how you manage to eventually accomplish the syncing
> between the two clusters, because we also need to solve the very same
> problem :-)
>
> Perhaps others on the mailing list have experience with this?
>
>
> Thanks!
>
>
> On Tue, Apr 12, 2016 at 10:44 AM, raymond <rgbbones@163.com> wrote:
>>
>> Hi
>>
>>
>>
>> We have a hadoop cluster with several PB data. and we need to migrate it
>> to a new cluster across datacenter for larger volume capability.
>> We estimate that the data copy itself might took near a month to finish.
>> So we are seeking for a sound solution. The requirement is as below:
>> 1. we cannot bring down the old cluster for such a long time ( of course),
>> and a couple of hours is acceptable.
>> 2. we need to mirror the data, it means that we not only need to copy the
>> new data, but also need to delete the deleted data happened during the
>> migration period.
>> 3. we don’t have much space left on the old cluster, say 30% room.
>>
>>
>>
>> regarding distcp, although it might be the easiest way , but
>>
>>
>>
>> 1. it do not handle data delete
>> 2. it handle newly appended file by compare file size and overwrite it (
>> well , it might waste a lot of bandwidth )
>> 3. error handling base on file is triffle.
>> 4 load control is difficult ( we still have heavy work load on old
>> cluster) you can just try to split your work manually and make it small
>> enough to achieve the flow control goal.
>>
>>
>>
>> In one word, for a long time mirror work. It won't do well by itself.
>>
>>
>>
>> The are some possible works might need to be done :
>>
>>
>>
>> We can:
>>
>>
>>
>> Do  some wrap work around distcp to make it works better. ( say error
>> handling, check results. Extra code for sync deleted files etc. )
>> Utilize Snapshot mechanisms for better identify files need to be copied
>> and deleted. Or renamed.
>>
>>
>>
>> Or
>>
>>
>>
>> Forget about distcp. Use FSIMAGE and editlog as a change history source,
>> and write our own code to replay the operation. Handle each file one by one.
>> ( better per file error handling could be achieved), but this might need a
>> lot of dev works.
>>
>>
>>
>>
>>
>> Btw. The closest thing I could found is facebook migration 30PB hive
>> warehouse:
>>
>>
>>
>>
>> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/
>>
>>
>>
>> They modifiy the distcp to do a initial bulk load (to better handling
>> large files and very small files, for load balance I guess.) , and a
>> replication system (not much detail on this part) to mirror the changes.
>>
>>
>>
>> But it is not clear that how they handle those shortcomings of distcp I
>> mentioned above. And do they utilize snapshot mechanism.
>>
>>
>>
>> So , does anyone have experience on this kind of work? What do you think
>> might be the best approaching for our case? Is there any ready works been
>> done that we can utilize? Is there any works have been done around snapshot
>> mechanism to easy data migration?
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Mime
View raw message