incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henrik Schröder <skro...@gmail.com>
Subject Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5
Date Thu, 05 May 2011 17:28:10 GMT
Thanks, but patching or losing keys is not an option for us. :-/


/Henrik

On Thu, May 5, 2011 at 15:00, Daniel Doubleday <daniel.doubleday@gmx.net>wrote:

> Don't know if that helps you but since we had the same SSTable corruption I
> have been looking into that very code the other day:
>
> If you could afford to drop these rows and are able to recognize them the
> easiest way would be patching:
>
> SSTableScanner:162
>
> public IColumnIterator next()
>         {
>             try
>             {
>                 if (row != null)
>                     file.seek(finishedAt);
>                 assert !file.isEOF();
>
>                 DecoratedKey key =
> SSTableReader.decodeKey(sstable.partitioner,
>
> sstable.descriptor,
>
> ByteBufferUtil.readWithShortLength(file));
>                 long dataSize = SSTableReader.readRowSize(file,
> sstable.descriptor);
>                 long dataStart = file.getFilePointer();
>                 finishedAt = dataStart + dataSize;
>
>                 if (filter == null)
>                 {
>                     row = new SSTableIdentityIterator(sstable, file, key,
> dataStart, dataSize);
>                     return row;
>                 }
>                 else
>                 {
>                     return row = filter.getSSTableColumnIterator(sstable,
> file, key);
>                 }
>             }
>             catch (IOException e)
>             {
>                 throw new RuntimeException(SSTableScanner.this + " failed
> to provide next columns from " + this, e);
>             }
>         }
>
> The string key is new String(ByteBufferUtil.getArray(key.key), "UTF-8")
> If you find one that you don't like just skip it.
>
> This way compaction goes through but obviously you'll loose data.
>
> On May 5, 2011, at 1:12 PM, Henrik Schröder wrote:
>
> Yeah, I've seen that one, and I'm guessing that it's the root cause of my
> problems, something something encoding error, but that doesn't really help
> me. :-)
>
> However, I've done all my tests with 0.7.5, I'm gonna try them again with
> 0.7.4, just to see how that version reacts.
>
>
> /Henrik
>
> On Wed, May 4, 2011 at 18:53, Daniel Doubleday <daniel.doubleday@gmx.net>wrote:
>
>> This is a bit of a wild guess but Windows and encoding and 0.7.5 sounds
>> like
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-2367
>>
>> <https://issues.apache.org/jira/browse/CASSANDRA-2367>
>> On May 3, 2011, at 5:15 PM, Henrik Schröder wrote:
>>
>> Hey everyone,
>>
>> We did some tests before upgrading our Cassandra cluster from 0.6 to 0.7,
>> just to make sure that the change in how keys are encoded wouldn't cause us
>> any dataloss. Unfortunately it seems that rows stored under a unicode key
>> couldn't be retrieved after the upgrade. We're running everything on
>> Windows, and we're using the generated thrift client in C# to access it.
>>
>> I managed to make a minimal test to reproduce the error consistently:
>>
>> First, I started up Cassandra 0.6.13 with an empty data directory, and a
>> really simple config with a single keyspace with a single bytestype
>> columnfamily.
>> I wrote two rows, each with a single column with a simple column name and
>> a 1-byte value of "1". The first row had a key using only ascii chars
>> ('foo'), and the second row had a key using unicode chars ('ドメインウ').
>>
>> Using multi_get, and both those keys, I got both columns back, as
>> expected.
>> Using multi_get_slice and both those keys, I got both columns back, as
>> expected.
>> I also did a get_range_slices to get all rows in the columnfamily, and I
>> got both columns back, as expected.
>>
>> So far so good. Then I drain and shut down Cassandra 0.6.13, and start up
>> Cassandra 0.7.5, pointing to the same data directory, with a config
>> containing the same keyspace, and I run the schematool import command.
>>
>> I then start up my test program that uses the new thrift api, and run some
>> commands.
>>
>> Using multi_get_slice, and those two keys encoded as UTF8 byte-arrays, I
>> only get back one column, the one under the key 'foo'. The other row I
>> simply can't retrieve.
>>
>> However, when I use get_range_slices to get all rows, I get back two rows,
>> with the correct column values, and the byte-array keys are identical to my
>> encoded keys, and when I decode the byte-arrays as UTF8 drings, I get back
>> my two original keys. This means that both my rows are still there, the keys
>> as output by Cassandra are identical to the original string keys I used when
>> I created the rows in 0.6.13, but it's just impossible to retrieve the
>> second row.
>>
>> To continue the test, I inserted a row with the key 'ドメインウ' encoded as
>> UTF-8 again, and gave it a similar column as the original, but with a 1-byte
>> value of "2".
>>
>> Now, when I use multi_get_slice with my two encoded keys, I get back two
>> rows, the 'foo' row has the old value as expected, and the other row has the
>> new value as expected.
>>
>> However, when I use get_range_slices to get all rows, I get back *three*
>> rows, two of which have the *exact same* byte-array key, one has the old
>> column, one has the new column.
>>
>>
>> How is this possible? How can there be two different rows with the exact
>> same key? I'm guessing that it's related to the encoding of string keys in
>> 0.6, and that the internal representation is off somehow. I checked the
>> generated thrift client for 0.6, and it UTF8-encodes all keys before sending
>> them to the server, so it should be UTF8 all the way, but apparently it
>> isn't.
>>
>> Has anyone else experienced the same problem? Is it a platform-specific
>> problem? Is there a way to avoid this and upgrade from 0.6 to 0.7 and not
>> lose any rows? I would also really like to know which byte-array I should
>> send in to get back that second row, there's gotta be some key that can be
>> used to get it, the row is still there after all.
>>
>>
>> /Henrik Schröder
>>
>>
>>
>
>

Mime
View raw message