Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: 
 <CADfcUwGSbXaX7t99J_E-9Ov7rALxBiWcLJEwLKCr5R5yKMPTnw@mail.gmail.com>
References: <E42A7C57-D3FB-419F-A956-CC0024D9818A@163.com>
 <CADfcUwGSbXaX7t99J_E-9Ov7rALxBiWcLJEwLKCr5R5yKMPTnw@mail.gmail.com>
From: Namikaze Minato <lloydsensei@gmail.com>
Date: Tue, 12 Apr 2016 19:20:12 +0900
Message-ID: 
 <CACmJb3ype2Wf3T06-ZrxZTieRs=pgZ8D6mvMzF5f5o5yscYp9g@mail.gmail.com>
Subject: Re: Best way to migrate PB scale data between live cluster?
To: cs user <acldstkusr@gmail.com>
Cc: raymond <rgbbones@163.com>,
	"common-user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

The clean way to go is to start from the log and to replay it... But I
have actually no idea about how to do that
You might find this (old) work interesting:
https://engineering.linkedin.com/distributed-systems/log-what-every-softwar=
e-engineer-should-know-about-real-time-datas-unifying

I'd never have tried to transmit this much data across the network, I
would always have tried to find a way to copy hard disks and
physically ship them to the location...

Camusensei

On 12 April 2016 at 19:14, cs user <acldstkusr@gmail.com> wrote:
> Hi there,
>
> At some point in the near future we are also going to require exactly wha=
t
> you describe. We had hope to use distcp.
>
> You mentioned:
>
> 1. it do not handle data delete
>
> distcp has a -delete flag which says -
>
> "Delete the files existing in the dst but not in src"
>
> Does this not help with handling deleted data?
>
> I believe there is an issue if data is removed during a distcp run, so fo=
r
> example at the start of the run it captures all the files it needs to syn=
c.
> If some files are deleted during the run, it may lead to errors. Is there=
 a
> way to ignore these errors and have distcp retry on the next run?
>
> I'd be interested in how you manage to eventually accomplish the syncing
> between the two clusters, because we also need to solve the very same
> problem :-)
>
> Perhaps others on the mailing list have experience with this?
>
>
> Thanks!
>
>
> On Tue, Apr 12, 2016 at 10:44 AM, raymond <rgbbones@163.com> wrote:
>>
>> Hi
>>
>>
>>
>> We have a hadoop cluster with several PB data. and we need to migrate it
>> to a new cluster across datacenter for larger volume capability.
>> We estimate that the data copy itself might took near a month to finish.
>> So we are seeking for a sound solution. The requirement is as below:
>> 1. we cannot bring down the old cluster for such a long time ( of course=
),
>> and a couple of hours is acceptable.
>> 2. we need to mirror the data, it means that we not only need to copy th=
e
>> new data, but also need to delete the deleted data happened during the
>> migration period.
>> 3. we don=E2=80=99t have much space left on the old cluster, say 30% roo=
m.
>>
>>
>>
>> regarding distcp, although it might be the easiest way , but
>>
>>
>>
>> 1. it do not handle data delete
>> 2. it handle newly appended file by compare file size and overwrite it (
>> well , it might waste a lot of bandwidth )
>> 3. error handling base on file is triffle.
>> 4 load control is difficult ( we still have heavy work load on old
>> cluster) you can just try to split your work manually and make it small
>> enough to achieve the flow control goal.
>>
>>
>>
>> In one word, for a long time mirror work. It won't do well by itself.
>>
>>
>>
>> The are some possible works might need to be done :
>>
>>
>>
>> We can:
>>
>>
>>
>> Do  some wrap work around distcp to make it works better. ( say error
>> handling, check results. Extra code for sync deleted files etc. )
>> Utilize Snapshot mechanisms for better identify files need to be copied
>> and deleted. Or renamed.
>>
>>
>>
>> Or
>>
>>
>>
>> Forget about distcp. Use FSIMAGE and editlog as a change history source,
>> and write our own code to replay the operation. Handle each file one by =
one.
>> ( better per file error handling could be achieved), but this might need=
 a
>> lot of dev works.
>>
>>
>>
>>
>>
>> Btw. The closest thing I could found is facebook migration 30PB hive
>> warehouse:
>>
>>
>>
>>
>> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-l=
arge-scale-hadoop-data-migration-at-facebook/10150246275318920/
>>
>>
>>
>> They modifiy the distcp to do a initial bulk load (to better handling
>> large files and very small files, for load balance I guess.) , and a
>> replication system (not much detail on this part) to mirror the changes.
>>
>>
>>
>> But it is not clear that how they handle those shortcomings of distcp I
>> mentioned above. And do they utilize snapshot mechanism.
>>
>>
>>
>> So , does anyone have experience on this kind of work? What do you think
>> might be the best approaching for our case? Is there any ready works bee=
n
>> done that we can utilize? Is there any works have been done around snaps=
hot
>> mechanism to easy data migration?
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org