incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <>
Subject Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5
Date Thu, 05 May 2011 11:57:20 GMT
The hard core way to fix the data is export to json with sstable2json, hand edit, and then
json2sstable it back. 

Also to confirm, this only happens when data is written in 0.6 and then tried to read back
in 0.7?

And you what partitioner are you using ? You can still see the keys ?

Can you use sstable2json agains tthe 0.6 data ?

Looking at you last email something looks fishy about the encoding...
My two keys that I send in my test program are 0xe695b0e69982e99693 and 0x666f6f, which decodes
to "数時間" and "foo" respectively.

There are 9 bytes encoded there I would expect a multiple of 2 for each character. (using
UTF-16 surrogate pairs )

I looked the characters up and their encoding is different here 
数 0x6570
時 0x6642 
間 0x9593

Am I missing something ?

Hope that helps. 
Aaron Morton
Freelance Cassandra Developer

On 5 May 2011, at 23:09, Henrik Schröder wrote:

> Yes, the keys were written to 0.6, but when I looked through the thrift client code for
0.6, it explicitly converts all string keys to UTF8 before sending them over to the server
so the encoding *should* be right, and after the upgrade to 0.7.5, sstablekeys prints out
the correct byte values for those keys, but Cassandra itself is unable to get those rows.
> I ran some more tests yesterday with a clean database where I only wrote two rows, one
with an ascii key and one with a unicode key, upgraded to 0.7.5, ran nodetool cleanup, and
that actually fixed it. After cleanup, the server could fetch both rows correctly.
> However, when I tried to do the same thing with a snapshot of our live database where
we have ~2 million keys, out of which ~1000 are unicode, cleanup failed with a lot of "Keys
must be written in descending order" exceptions. I've tried various combinations of cleanup
and scrub, running cleanup before upgrading, etc, but I've yet to find something that fixes
all the problems without losing those rows.
> /Henrik
> On Thu, May 5, 2011 at 12:48, aaron morton <> wrote:
> I take it back, the problem started in 0.6 where keys were strings. Looking into how
0.6 did it's thing
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> On 5 May 2011, at 22:36, aaron morton wrote:
>> Interesting but as we are dealing with keys it should not matter as they are treated
as byte buffers. 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> On 5 May 2011, at 04:53, Daniel Doubleday wrote:
>>> This is a bit of a wild guess but Windows and encoding and 0.7.5 sounds like
>>> On May 3, 2011, at 5:15 PM, Henrik Schröder wrote:
>>>> Hey everyone,
>>>> We did some tests before upgrading our Cassandra cluster from 0.6 to 0.7,
just to make sure that the change in how keys are encoded wouldn't cause us any dataloss.
Unfortunately it seems that rows stored under a unicode key couldn't be retrieved after the
upgrade. We're running everything on Windows, and we're using the generated thrift client
in C# to access it.
>>>> I managed to make a minimal test to reproduce the error consistently:
>>>> First, I started up Cassandra 0.6.13 with an empty data directory, and a
really simple config with a single keyspace with a single bytestype columnfamily.
>>>> I wrote two rows, each with a single column with a simple column name and
a 1-byte value of "1". The first row had a key using only ascii chars ('foo'), and the second
row had a key using unicode chars ('ドメインウ').
>>>> Using multi_get, and both those keys, I got both columns back, as expected.
>>>> Using multi_get_slice and both those keys, I got both columns back, as expected.
>>>> I also did a get_range_slices to get all rows in the columnfamily, and I
got both columns back, as expected.
>>>> So far so good. Then I drain and shut down Cassandra 0.6.13, and start up
Cassandra 0.7.5, pointing to the same data directory, with a config containing the same keyspace,
and I run the schematool import command.
>>>> I then start up my test program that uses the new thrift api, and run some
>>>> Using multi_get_slice, and those two keys encoded as UTF8 byte-arrays, I
only get back one column, the one under the key 'foo'. The other row I simply can't retrieve.
>>>> However, when I use get_range_slices to get all rows, I get back two rows,
with the correct column values, and the byte-array keys are identical to my encoded keys,
and when I decode the byte-arrays as UTF8 drings, I get back my two original keys. This means
that both my rows are still there, the keys as output by Cassandra are identical to the original
string keys I used when I created the rows in 0.6.13, but it's just impossible to retrieve
the second row.
>>>> To continue the test, I inserted a row with the key 'ドメインウ' encoded
as UTF-8 again, and gave it a similar column as the original, but with a 1-byte value of "2".
>>>> Now, when I use multi_get_slice with my two encoded keys, I get back two
rows, the 'foo' row has the old value as expected, and the other row has the new value as
>>>> However, when I use get_range_slices to get all rows, I get back *three*
rows, two of which have the *exact same* byte-array key, one has the old column, one has the
new column. 
>>>> How is this possible? How can there be two different rows with the exact
same key? I'm guessing that it's related to the encoding of string keys in 0.6, and that the
internal representation is off somehow. I checked the generated thrift client for 0.6, and
it UTF8-encodes all keys before sending them to the server, so it should be UTF8 all the way,
but apparently it isn't.
>>>> Has anyone else experienced the same problem? Is it a platform-specific problem?
Is there a way to avoid this and upgrade from 0.6 to 0.7 and not lose any rows? I would also
really like to know which byte-array I should send in to get back that second row, there's
gotta be some key that can be used to get it, the row is still there after all.
>>>> /Henrik Schröder

View raw message