cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Doubleday <>
Subject Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5
Date Thu, 05 May 2011 13:00:17 GMT
Don't know if that helps you but since we had the same SSTable corruption I have been looking
into that very code the other day:

If you could afford to drop these rows and are able to recognize them the easiest way would
be patching:


public IColumnIterator next()
                if (row != null)
                assert !file.isEOF();

                DecoratedKey key = SSTableReader.decodeKey(sstable.partitioner,
                long dataSize = SSTableReader.readRowSize(file, sstable.descriptor);
                long dataStart = file.getFilePointer();
                finishedAt = dataStart + dataSize;

                if (filter == null)
                    row = new SSTableIdentityIterator(sstable, file, key, dataStart, dataSize);
                    return row;
                    return row = filter.getSSTableColumnIterator(sstable, file, key);
            catch (IOException e)
                throw new RuntimeException(SSTableScanner.this + " failed to provide next
columns from " + this, e);

The string key is new String(ByteBufferUtil.getArray(key.key), "UTF-8")
If you find one that you don't like just skip it.

This way compaction goes through but obviously you'll loose data.

On May 5, 2011, at 1:12 PM, Henrik Schröder wrote:

> Yeah, I've seen that one, and I'm guessing that it's the root cause of my problems, something
something encoding error, but that doesn't really help me. :-)
> However, I've done all my tests with 0.7.5, I'm gonna try them again with 0.7.4, just
to see how that version reacts.
> /Henrik
> On Wed, May 4, 2011 at 18:53, Daniel Doubleday <> wrote:
> This is a bit of a wild guess but Windows and encoding and 0.7.5 sounds like
> On May 3, 2011, at 5:15 PM, Henrik Schröder wrote:
>> Hey everyone,
>> We did some tests before upgrading our Cassandra cluster from 0.6 to 0.7, just to
make sure that the change in how keys are encoded wouldn't cause us any dataloss. Unfortunately
it seems that rows stored under a unicode key couldn't be retrieved after the upgrade. We're
running everything on Windows, and we're using the generated thrift client in C# to access
>> I managed to make a minimal test to reproduce the error consistently:
>> First, I started up Cassandra 0.6.13 with an empty data directory, and a really simple
config with a single keyspace with a single bytestype columnfamily.
>> I wrote two rows, each with a single column with a simple column name and a 1-byte
value of "1". The first row had a key using only ascii chars ('foo'), and the second row had
a key using unicode chars ('ドメインウ').
>> Using multi_get, and both those keys, I got both columns back, as expected.
>> Using multi_get_slice and both those keys, I got both columns back, as expected.
>> I also did a get_range_slices to get all rows in the columnfamily, and I got both
columns back, as expected.
>> So far so good. Then I drain and shut down Cassandra 0.6.13, and start up Cassandra
0.7.5, pointing to the same data directory, with a config containing the same keyspace, and
I run the schematool import command.
>> I then start up my test program that uses the new thrift api, and run some commands.
>> Using multi_get_slice, and those two keys encoded as UTF8 byte-arrays, I only get
back one column, the one under the key 'foo'. The other row I simply can't retrieve.
>> However, when I use get_range_slices to get all rows, I get back two rows, with the
correct column values, and the byte-array keys are identical to my encoded keys, and when
I decode the byte-arrays as UTF8 drings, I get back my two original keys. This means that
both my rows are still there, the keys as output by Cassandra are identical to the original
string keys I used when I created the rows in 0.6.13, but it's just impossible to retrieve
the second row.
>> To continue the test, I inserted a row with the key 'ドメインウ' encoded as
UTF-8 again, and gave it a similar column as the original, but with a 1-byte value of "2".
>> Now, when I use multi_get_slice with my two encoded keys, I get back two rows, the
'foo' row has the old value as expected, and the other row has the new value as expected.
>> However, when I use get_range_slices to get all rows, I get back *three* rows, two
of which have the *exact same* byte-array key, one has the old column, one has the new column.

>> How is this possible? How can there be two different rows with the exact same key?
I'm guessing that it's related to the encoding of string keys in 0.6, and that the internal
representation is off somehow. I checked the generated thrift client for 0.6, and it UTF8-encodes
all keys before sending them to the server, so it should be UTF8 all the way, but apparently
it isn't.
>> Has anyone else experienced the same problem? Is it a platform-specific problem?
Is there a way to avoid this and upgrade from 0.6 to 0.7 and not lose any rows? I would also
really like to know which byte-array I should send in to get back that second row, there's
gotta be some key that can be used to get it, the row is still there after all.
>> /Henrik Schröder

View raw message