incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paulo Motta <pauloricard...@gmail.com>
Subject Re: Recommended way of data migration
Date Sun, 08 Sep 2013 12:23:39 GMT
That's a good approach. You could also migrate in-place if you're confident
your migration algorithm is correct, but for more safety having another CF
is better.

If you have a huge volume of data to be migrated (millions of rows or
more), I'd suggest you to use Hadoop to perform these migrations (
http://wiki.apache.org/cassandra/HadoopSupport).

If it's only a few rows, then you could do it programmatically via *
get_range_slices* using the language binding of your choice. Below are some
links on how to perform this on Hector or Pycassa:

* Hector:
http://stackoverflow.com/questions/8418448/cassandra-hector-how-to-retrieve-all-rows-of-a-column-family
* Pycassa:
http://pycassa.github.io/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.get_range

I Agree with Edward that you should only delete the rows once you make sure
they were correctly migrated.


2013/9/7 Edward Capriolo <edlinuxguru@gmail.com>

> I would do something like you are suggesting. I would not do the delete
> until all the rows are moved. Since writes in cassandra are idempotent you
> can even run the migration process multiple times without harm.
>
>
> On Sat, Sep 7, 2013 at 5:31 PM, Renat Gilfanov <grennat@mail.ru> wrote:
>
>> Hello,
>>
>> Let's say we have a simple CQL3 table
>>
>> CREATE TABLE example (
>>     id UUID PRIMARY KEY,
>>     timestamp TIMESTAMP,
>>     data ASCII
>> );
>>
>> And I need to mutate  (for example encrypt) column values in the "data"
>> column for all rows.
>>
>> What's the recommended approach to perform such migration
>> programatically?
>>
>> For me the general approach is:
>>
>> 1. Create another column family
>> 2. extract a batch of records
>> 3. for each extracted record, perform mutation, insert it in the new cf
>> and delete from old one
>> 4. repeat until source cf not empty
>>
>> Is it correct approach and if yes, how to implement some kind of paging
>> for the step 2?
>>
>
>


-- 
Paulo Ricardo

-- 
European Master in Distributed Computing***
Royal Institute of Technology - KTH
*
*Instituto Superior T├ęcnico - IST*
*http://paulormg.com*

Mime
View raw message