cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Migrating all rows from 0.6.13 to 0.7.5 over thrift?
Date Mon, 09 May 2011 03:51:42 GMT
Out of interest i've done some more digging. Not sure how much more I've contributed but here
goes...

Ran this against an clean v 0.6.12 and it works (I expected it to fail on the first read)

    client = pycassa.connect()
    standard1 = pycassa.ColumnFamily(client, 'Keyspace1', 'Standard1')

    uni_str = u"数時間"
    uni_str = uni_str.encode("utf-8")
    
    print "Insert row", uni_str
    print uni_str, standard1.insert(uni_str, {"bar" : "baz"})

    print "Read rows"
    print "???", standard1.get("???")
    print uni_str, standard1.get(uni_str)

Ran that against the current 0.6 head from the command line and it works. Run against the
code running in intelli J and the code fails as expected. Code also fails as expected on 0.7.5

At one stage I grabbed the buffer created by fastbinary.encode_binary in the python generated
batch_mutate_args.write() and it looked like the key was correctly utf-8 encoded (matching
bytes to the previous utf-8 encoding of that string).

I've updated the git project https://github.com/amorton/cassandra-unicode-bug 

Am going to leave it there unless there is interest to keep looking into it. 
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 8 May 2011, at 13:31, Jonathan Ellis wrote:

> Right, that's sort of a half-repair: it will repair differences in
> replies it got, but it won't doublecheck md5s on the rest in the
> background. So if you're doing CL.ONE reads this is a no-op.
> 
> On Sat, May 7, 2011 at 4:25 PM, aaron morton <aaron@thelastpickle.com> wrote:
>> I remembered something like that so had a look at RangeSliceResponseResolver.resolve()
 in 0.6.12 and it looks like it schedules the repairs...
>> 
>>            protected Row getReduced()
>>            {
>>                ColumnFamily resolved = ReadResponseResolver.resolveSuperset(versions);
>>                ReadResponseResolver.maybeScheduleRepairs(resolved, table, key, versions,
versionSources);
>>                versions.clear();
>>                versionSources.clear();
>>                return new Row(key, resolved);
>>            }
>> 
>> 
>> Is that right?
>> 
>> 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 8 May 2011, at 00:48, Jonathan Ellis wrote:
>> 
>>> range_slices respects consistencylevel, but only single-row reads and
>>> multiget do the *repair* part of RR.
>>> 
>>> On Sat, May 7, 2011 at 1:44 AM, aaron morton <aaron@thelastpickle.com>
wrote:
>>>> get_range_slices() does read repair if enabled (checked DoConsistencyChecksBoolean
in the config, it's on by default) so you should be getting good reads. If you want belt-and-braces
run nodetool repair first.
>>>> 
>>>> Hope that helps.
>>>> 
>>>> 
>>>> On 7 May 2011, at 11:46, Jeremy Hanna wrote:
>>>> 
>>>>> Great!  I just wanted to make sure you were getting the information you
needed.
>>>>> 
>>>>> On May 6, 2011, at 6:42 PM, Henrik Schröder wrote:
>>>>> 
>>>>>> Well, I already completed the migration program. Using get_range_slices
I could migrate a few thousand rows per second, which means that migrating all of our data
would take a few minutes, and we'll end up with pristine datafiles for the new cluster. Problem
solved!
>>>>>> 
>>>>>> I'll see if I can create datafiles in 0.6 that are uncleanable in
0.7 so that you all can repeat this and hopefully fix it.
>>>>>> 
>>>>>> 
>>>>>> /Henrik Schröder
>>>>>> 
>>>>>> On Sat, May 7, 2011 at 00:35, Jeremy Hanna <jeremy.hanna1234@gmail.com>
wrote:
>>>>>> If you're able, go into the #cassandra channel on freenode (IRC)
and talk to driftx or jbellis or aaron_morton about your problem.  It could be that you don't
have to do all of this based on a conversation there.
>>>>>> 
>>>>>> On May 6, 2011, at 5:04 AM, Henrik Schröder wrote:
>>>>>> 
>>>>>>> I'll see if I can make some example broken files this weekend.
>>>>>>> 
>>>>>>> 
>>>>>>> /Henrik Schröder
>>>>>>> 
>>>>>>> On Fri, May 6, 2011 at 02:10, aaron morton <aaron@thelastpickle.com>
wrote:
>>>>>>> The difficulty is the different thrift clients between 0.6 and
0.7.
>>>>>>> 
>>>>>>> If you want to roll your own solution I would consider:
>>>>>>> - write an app to talk to 0.6 and pull out the data using keys
from the other system (so you know can check referential integrity while you are at it). Dump
the data to flat file.
>>>>>>> - write an app to talk to 0.7 to load the data back in.
>>>>>>> 
>>>>>>> I've not given up digging on your migration problem, having to
manually dump and reload if you've done nothing wrong is not the best solution. I'll try to
find some time this weekend to test with:
>>>>>>> 
>>>>>>> - 0.6 server, random paritioner, standard CF's, byte column
>>>>>>> - load with python or the cli on osx or ubuntu (dont have a window
machine any more)
>>>>>>> - migrate and see whats going on.
>>>>>>> 
>>>>>>> If you can spare some sample data to load please send it over
in the user group or my email address.
>>>>>>> 
>>>>>>> Cheers
>>>>>>> 
>>>>>>> -----------------
>>>>>>> Aaron Morton
>>>>>>> Freelance Cassandra Developer
>>>>>>> @aaronmorton
>>>>>>> http://www.thelastpickle.com
>>>>>>> 
>>>>>>> On 6 May 2011, at 05:52, Henrik Schröder wrote:
>>>>>>> 
>>>>>>>> We can't do a straight upgrade from 0.6.13 to 0.7.5 because
we have rows stored that have unicode keys, and Cassandra 0.7.5 thinks those rows in the sstables
are corrupt, and it seems impossible to clean it up without losing data.
>>>>>>>> 
>>>>>>>> However, we can still read all rows perfectly via thrift
so we are now looking at building a simple tool that will copy all rows from our 0.6.3 cluster
to a parallell 0.7.5 cluster. Our question is now how to do that and ensure that we actually
get all rows migrated? It's a pretty small cluster, 3 machines, a single keyspace, a singke
columnfamily, ~2 million rows, a few GB of data, and a replication factor of 3.
>>>>>>>> 
>>>>>>>> So what's the best way? Call get_range_slices and move through
the entire token space? We also have all row keys in a secondary system, would it be better
to use that and make calls to get_multi or get_multi_slices instead? Are we correct in assuming
that if we use the consistencylevel ALL we'll get all rows?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> /Henrik Schröder
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of DataStax, the source for professional Cassandra support
>>> http://www.datastax.com
>> 
>> 
> 
> 
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com


Mime
View raw message