cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henrik Schröder <skro...@gmail.com>
Subject Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5
Date Thu, 05 May 2011 11:12:37 GMT
Yeah, I've seen that one, and I'm guessing that it's the root cause of my
problems, something something encoding error, but that doesn't really help
me. :-)

However, I've done all my tests with 0.7.5, I'm gonna try them again with
0.7.4, just to see how that version reacts.


/Henrik

On Wed, May 4, 2011 at 18:53, Daniel Doubleday <daniel.doubleday@gmx.net>wrote:

> This is a bit of a wild guess but Windows and encoding and 0.7.5 sounds
> like
>
> https://issues.apache.org/jira/browse/CASSANDRA-2367
>
> <https://issues.apache.org/jira/browse/CASSANDRA-2367>
> On May 3, 2011, at 5:15 PM, Henrik Schröder wrote:
>
> Hey everyone,
>
> We did some tests before upgrading our Cassandra cluster from 0.6 to 0.7,
> just to make sure that the change in how keys are encoded wouldn't cause us
> any dataloss. Unfortunately it seems that rows stored under a unicode key
> couldn't be retrieved after the upgrade. We're running everything on
> Windows, and we're using the generated thrift client in C# to access it.
>
> I managed to make a minimal test to reproduce the error consistently:
>
> First, I started up Cassandra 0.6.13 with an empty data directory, and a
> really simple config with a single keyspace with a single bytestype
> columnfamily.
> I wrote two rows, each with a single column with a simple column name and a
> 1-byte value of "1". The first row had a key using only ascii chars ('foo'),
> and the second row had a key using unicode chars ('ドメインウ').
>
> Using multi_get, and both those keys, I got both columns back, as expected.
> Using multi_get_slice and both those keys, I got both columns back, as
> expected.
> I also did a get_range_slices to get all rows in the columnfamily, and I
> got both columns back, as expected.
>
> So far so good. Then I drain and shut down Cassandra 0.6.13, and start up
> Cassandra 0.7.5, pointing to the same data directory, with a config
> containing the same keyspace, and I run the schematool import command.
>
> I then start up my test program that uses the new thrift api, and run some
> commands.
>
> Using multi_get_slice, and those two keys encoded as UTF8 byte-arrays, I
> only get back one column, the one under the key 'foo'. The other row I
> simply can't retrieve.
>
> However, when I use get_range_slices to get all rows, I get back two rows,
> with the correct column values, and the byte-array keys are identical to my
> encoded keys, and when I decode the byte-arrays as UTF8 drings, I get back
> my two original keys. This means that both my rows are still there, the keys
> as output by Cassandra are identical to the original string keys I used when
> I created the rows in 0.6.13, but it's just impossible to retrieve the
> second row.
>
> To continue the test, I inserted a row with the key 'ドメインウ' encoded as
> UTF-8 again, and gave it a similar column as the original, but with a 1-byte
> value of "2".
>
> Now, when I use multi_get_slice with my two encoded keys, I get back two
> rows, the 'foo' row has the old value as expected, and the other row has the
> new value as expected.
>
> However, when I use get_range_slices to get all rows, I get back *three*
> rows, two of which have the *exact same* byte-array key, one has the old
> column, one has the new column.
>
>
> How is this possible? How can there be two different rows with the exact
> same key? I'm guessing that it's related to the encoding of string keys in
> 0.6, and that the internal representation is off somehow. I checked the
> generated thrift client for 0.6, and it UTF8-encodes all keys before sending
> them to the server, so it should be UTF8 all the way, but apparently it
> isn't.
>
> Has anyone else experienced the same problem? Is it a platform-specific
> problem? Is there a way to avoid this and upgrade from 0.6 to 0.7 and not
> lose any rows? I would also really like to know which byte-array I should
> send in to get back that second row, there's gotta be some key that can be
> used to get it, the row is still there after all.
>
>
> /Henrik Schröder
>
>
>

Mime
View raw message