hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Aquilina <jaquil...@eagleeyet.net>
Subject Re: Best way to migrate PB scale data between live cluster?
Date Sun, 17 Apr 2016 10:18:11 GMT
Probably a stupid suggestion but did you guys consider rsync? Supposed
to be quick and can do deletes?

On 2016-04-12 11:44, raymond wrote:

> Hi 
> We have a hadoop cluster with several PB data. and we need to migrate it to a new cluster
across datacenter for larger volume capability. 
> We estimate that the data copy itself might took near a month to finish. So we are seeking
for a sound solution. The requirement is as below: 
> 1. we cannot bring down the old cluster for such a long time ( of course), and a couple
of hours is acceptable. 
> 2. we need to mirror the data, it means that we not only need to copy the new data, but
also need to delete the deleted data happened during the migration period. 
> 3. we don't have much space left on the old cluster, say 30% room. 
> regarding distcp, although it might be the easiest way , but  
> 1. it do not handle data delete 
> 2. it handle newly appended file by compare file size and overwrite it ( well , it might
waste a lot of bandwidth ) 
> 3. error handling base on file is triffle.  
> 4 load control is difficult ( we still have heavy work load on old cluster) you can just
try to split your work manually and make it small enough to achieve the flow control goal.

> In one word, for a long time mirror work. It won't do well by itself. 
> The are some possible works might need to be done : 
> We can: 
> * Do  some wrap work around distcp to make it works better. ( say error handling, check
results. Extra code for sync deleted files etc. )
> * Utilize Snapshot mechanisms for better identify files need to be copied and deleted.
Or renamed.
> Or 
> * Forget about distcp. Use FSIMAGE and editlog as a change history source, and write
our own code to replay the operation. Handle each file one by one. ( better per file error
handling could be achieved), but this might need a lot of dev works. 
> Btw. The closest thing I could found is facebook migration 30PB hive warehouse: 
> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/

> They modifiy the distcp to do a initial bulk load (to better handling large files and
very small files, for load balance I guess.) , and a replication system (not much detail on
this part) to mirror the changes. 
> But it is not clear that how they handle those shortcomings of distcp I mentioned above.
And do they utilize snapshot mechanism. 
> So , does anyone have experience on this kind of work? What do you think might be the
best approaching for our case? Is there any ready works been done that we can utilize? Is
there any works have been done around snapshot mechanism to easy data migration?
View raw message