Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8B6C12FAA for ; Thu, 5 May 2011 10:48:54 +0000 (UTC) Received: (qmail 49715 invoked by uid 500); 5 May 2011 10:48:52 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 49688 invoked by uid 500); 5 May 2011 10:48:52 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 49680 invoked by uid 99); 5 May 2011 10:48:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 May 2011 10:48:52 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a51.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 May 2011 10:48:45 +0000 Received: from homiemail-a51.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a51.g.dreamhost.com (Postfix) with ESMTP id BDFE12E805C for ; Thu, 5 May 2011 03:48:24 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; q=dns; s=thelastpickle.com; b=ehllaxd29N LCdyzaQNHYb7eccns8vATj6PV7gmfndLddChdtp0Ph1foO3gy0zcplN7xff072CQ utPP00aoja4hRpUdH85/nQH5yJJ2E7WIZDXBQVaYC+y261p4lzX6utdXJQgq+q8T UqUA/GvhTtdFLeQZTDek47dWSkJhqLKZc= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; s=thelastpickle.com; bh=WcVQmHHXnzGz2zIl vHiQugsv/vg=; b=e9qu2ezeDeWvFRGyKC5mNsPLllTvKvN/OET9VZUgUK969fdZ +kr7rH3rbDzjOW/z0DxHYnMnghGStAsHEp5MnH5di+MH1cZ5NGowC6ZAdqPzYg7l EJ0lKL3bnqdCVrao9Et/U8BxufngZMNfJRqQLSWd+jjH3FF5HnVzgC7gXks= Received: from [10.0.1.151] (121-73-157-230.cable.telstraclear.net [121.73.157.230]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a51.g.dreamhost.com (Postfix) with ESMTPSA id CE7E52E8057 for ; Thu, 5 May 2011 03:48:23 -0700 (PDT) From: aaron morton Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: multipart/alternative; boundary=Apple-Mail-18--132214115 Subject: Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5 Date: Thu, 5 May 2011 22:48:20 +1200 In-Reply-To: To: user@cassandra.apache.org References: <80052FA0-2D52-4B1A-ABAA-9B04F843EFC4@gmx.net> Message-Id: <5828F3D8-B354-4E1D-B740-E52ADB983D76@thelastpickle.com> X-Mailer: Apple Mail (2.1084) --Apple-Mail-18--132214115 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 I take it back, the problem started in 0.6 where keys were strings. = Looking into how 0.6 did it's thing ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 5 May 2011, at 22:36, aaron morton wrote: > Interesting but as we are dealing with keys it should not matter as = they are treated as byte buffers.=20 >=20 > ----------------- > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com >=20 > On 5 May 2011, at 04:53, Daniel Doubleday wrote: >=20 >> This is a bit of a wild guess but Windows and encoding and 0.7.5 = sounds like >>=20 >> https://issues.apache.org/jira/browse/CASSANDRA-2367 >>=20 >> =20 >> On May 3, 2011, at 5:15 PM, Henrik Schr=C3=B6der wrote: >>=20 >>> Hey everyone, >>>=20 >>> We did some tests before upgrading our Cassandra cluster from 0.6 to = 0.7, just to make sure that the change in how keys are encoded wouldn't = cause us any dataloss. Unfortunately it seems that rows stored under a = unicode key couldn't be retrieved after the upgrade. We're running = everything on Windows, and we're using the generated thrift client in C# = to access it. >>>=20 >>> I managed to make a minimal test to reproduce the error = consistently: >>>=20 >>> First, I started up Cassandra 0.6.13 with an empty data directory, = and a really simple config with a single keyspace with a single = bytestype columnfamily. >>> I wrote two rows, each with a single column with a simple column = name and a 1-byte value of "1". The first row had a key using only ascii = chars ('foo'), and the second row had a key using unicode chars = ('=E3=83=89=E3=83=A1=E3=82=A4=E3=83=B3=E3=82=A6'). >>>=20 >>> Using multi_get, and both those keys, I got both columns back, as = expected. >>> Using multi_get_slice and both those keys, I got both columns back, = as expected. >>> I also did a get_range_slices to get all rows in the columnfamily, = and I got both columns back, as expected. >>>=20 >>> So far so good. Then I drain and shut down Cassandra 0.6.13, and = start up Cassandra 0.7.5, pointing to the same data directory, with a = config containing the same keyspace, and I run the schematool import = command. >>>=20 >>> I then start up my test program that uses the new thrift api, and = run some commands. >>>=20 >>> Using multi_get_slice, and those two keys encoded as UTF8 = byte-arrays, I only get back one column, the one under the key 'foo'. = The other row I simply can't retrieve. >>>=20 >>> However, when I use get_range_slices to get all rows, I get back two = rows, with the correct column values, and the byte-array keys are = identical to my encoded keys, and when I decode the byte-arrays as UTF8 = drings, I get back my two original keys. This means that both my rows = are still there, the keys as output by Cassandra are identical to the = original string keys I used when I created the rows in 0.6.13, but it's = just impossible to retrieve the second row. >>>=20 >>> To continue the test, I inserted a row with the key = '=E3=83=89=E3=83=A1=E3=82=A4=E3=83=B3=E3=82=A6' encoded as UTF-8 again, = and gave it a similar column as the original, but with a 1-byte value of = "2". >>>=20 >>> Now, when I use multi_get_slice with my two encoded keys, I get back = two rows, the 'foo' row has the old value as expected, and the other row = has the new value as expected. >>>=20 >>> However, when I use get_range_slices to get all rows, I get back = *three* rows, two of which have the *exact same* byte-array key, one has = the old column, one has the new column.=20 >>>=20 >>>=20 >>> How is this possible? How can there be two different rows with the = exact same key? I'm guessing that it's related to the encoding of string = keys in 0.6, and that the internal representation is off somehow. I = checked the generated thrift client for 0.6, and it UTF8-encodes all = keys before sending them to the server, so it should be UTF8 all the = way, but apparently it isn't. >>>=20 >>> Has anyone else experienced the same problem? Is it a = platform-specific problem? Is there a way to avoid this and upgrade from = 0.6 to 0.7 and not lose any rows? I would also really like to know which = byte-array I should send in to get back that second row, there's gotta = be some key that can be used to get it, the row is still there after = all. >>>=20 >>>=20 >>> /Henrik Schr=C3=B6der >>=20 >=20 --Apple-Mail-18--132214115 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 I = take it back, the problem started in 0.6 where keys were strings. = Looking into how 0.6 did it's thing


http://www.thelastpickle.com

On 5 May 2011, at 22:36, aaron morton wrote:

Interesting but as we are = dealing with keys it should not matter as they are treated as byte = buffers. 

http://www.thelastpickle.com

On 5 May 2011, at 04:53, Daniel Doubleday wrote:

This is a bit of a wild guess = but Windows and encoding and 0.7.5 sounds like


 
On May 3, 2011, at 5:15 PM, Henrik Schr=C3=B6der = wrote:

Hey everyone,

We did some tests before upgrading = our Cassandra cluster from 0.6 to 0.7, just to make sure that the change = in how keys are encoded wouldn't cause us any dataloss. Unfortunately it = seems that rows stored under a unicode key couldn't be retrieved after = the upgrade. We're running everything on Windows, and we're using the = generated thrift client in C# to access it.

I managed to make a minimal test to reproduce the error = consistently:

First, I started up Cassandra 0.6.13 with an empty = data directory, and a really simple config with a single keyspace with a = single bytestype columnfamily.
I wrote two rows, each with a single column with a simple column name = and a 1-byte value of "1". The first row had a key using only ascii = chars ('foo'), and the second row had a key using unicode chars = ('=E3=83=89=E3=83=A1=E3=82=A4=E3=83=B3=E3=82=A6').

Using multi_get, and both those keys, I got both columns back, as = expected.
Using multi_get_slice and both those keys, I got both = columns back, as expected.
I also did a get_range_slices to get all = rows in the columnfamily, and I got both columns back, as expected.

So far so good. Then I drain and shut down Cassandra 0.6.13, and = start up Cassandra 0.7.5, pointing to the same data directory, with a = config containing the same keyspace, and I run the schematool import = command.

I then start up my test program that uses the new thrift api, and run = some commands.

Using multi_get_slice, and those two keys encoded = as UTF8 byte-arrays, I only get back one column, the one under the key = 'foo'. The other row I simply can't retrieve.

However, when I use get_range_slices to get all rows, I get back two = rows, with the correct column values, and the byte-array keys are = identical to my encoded keys, and when I decode the byte-arrays as UTF8 = drings, I get back my two original keys. This means that both my rows = are still there, the keys as output by Cassandra are identical to the = original string keys I used when I created the rows in 0.6.13, but it's = just impossible to retrieve the second row.

To continue the test, I inserted a row with the key = '=E3=83=89=E3=83=A1=E3=82=A4=E3=83=B3=E3=82=A6' encoded as UTF-8 again, = and gave it a similar column as the original, but with a 1-byte value of = "2".

Now, when I use multi_get_slice with my two encoded keys, I = get back two rows, the 'foo' row has the old value as expected, and the = other row has the new value as expected.

However, when I use get_range_slices to get all rows, I get back = *three* rows, two of which have the *exact same* byte-array key, one has = the old column, one has the new column.


How is this = possible? How can there be two different rows with the exact same key? = I'm guessing that it's related to the encoding of string keys in 0.6, = and that the internal representation is off somehow. I checked the = generated thrift client for 0.6, and it UTF8-encodes all keys before = sending them to the server, so it should be UTF8 all the way, but = apparently it isn't.

Has anyone else experienced the same problem? Is it a = platform-specific problem? Is there a way to avoid this and upgrade from = 0.6 to 0.7 and not lose any rows? I would also really like to know which = byte-array I should send in to get back that second row, there's gotta = be some key that can be used to get it, the row is still there after = all.


/Henrik Schr=C3=B6der
=



= --Apple-Mail-18--132214115--