From user-return-16455-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Thu May 5 10:36:52 2011 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D395D275C for ; Thu, 5 May 2011 10:36:52 +0000 (UTC) Received: (qmail 32130 invoked by uid 500); 5 May 2011 10:36:50 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 32103 invoked by uid 500); 5 May 2011 10:36:50 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 32095 invoked by uid 99); 5 May 2011 10:36:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 May 2011 10:36:50 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a78.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 May 2011 10:36:44 +0000 Received: from homiemail-a78.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a78.g.dreamhost.com (Postfix) with ESMTP id 1F42215C062 for ; Thu, 5 May 2011 03:36:23 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; q=dns; s=thelastpickle.com; b=orxYrCa9Rt B1mI6BoYcJjzkJ+C9QmRhffYcgTI2tZReaPrDKCRz5xQKDiN3IK+W/W7bwmF0BaM VFoFdNUy/OItOcgpLjiSwBETDerzo0wufPkpbm33rk0ySHZtM6hismpBIv6P/LaY 1P/9IOJrDKhaF4DwHXuhdFMt5RD0rK69Q= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; s=thelastpickle.com; bh=P5BGOQVXceJ3y653 w2yJp0I1HAU=; b=2oWLRK55P1eLRNfWwHVdryD1eut0LtnpiJjfo7fYXfIeIP3o vYUncVJfajHLI0mYQ/IUIRkhrEgmnIAj0Bw/HCMhzvLJ0maO+yMJBhZVUbahFKIV 5CHVBT95HXA+keuC69R+tFkDIRa4ZL7Kx8dJAl1VfeRo91THhQnTLylc0MQ= Received: from [10.0.1.151] (121-73-157-230.cable.telstraclear.net [121.73.157.230]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a78.g.dreamhost.com (Postfix) with ESMTPSA id 54ED515C058 for ; Thu, 5 May 2011 03:36:22 -0700 (PDT) From: aaron morton Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: multipart/alternative; boundary=Apple-Mail-17--132934400 Subject: Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5 Date: Thu, 5 May 2011 22:36:19 +1200 In-Reply-To: <80052FA0-2D52-4B1A-ABAA-9B04F843EFC4@gmx.net> To: user@cassandra.apache.org References: <80052FA0-2D52-4B1A-ABAA-9B04F843EFC4@gmx.net> Message-Id: X-Mailer: Apple Mail (2.1084) --Apple-Mail-17--132934400 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Interesting but as we are dealing with keys it should not matter as they = are treated as byte buffers.=20 ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 5 May 2011, at 04:53, Daniel Doubleday wrote: > This is a bit of a wild guess but Windows and encoding and 0.7.5 = sounds like >=20 > https://issues.apache.org/jira/browse/CASSANDRA-2367 >=20 > =20 > On May 3, 2011, at 5:15 PM, Henrik Schr=C3=B6der wrote: >=20 >> Hey everyone, >>=20 >> We did some tests before upgrading our Cassandra cluster from 0.6 to = 0.7, just to make sure that the change in how keys are encoded wouldn't = cause us any dataloss. Unfortunately it seems that rows stored under a = unicode key couldn't be retrieved after the upgrade. We're running = everything on Windows, and we're using the generated thrift client in C# = to access it. >>=20 >> I managed to make a minimal test to reproduce the error consistently: >>=20 >> First, I started up Cassandra 0.6.13 with an empty data directory, = and a really simple config with a single keyspace with a single = bytestype columnfamily. >> I wrote two rows, each with a single column with a simple column name = and a 1-byte value of "1". The first row had a key using only ascii = chars ('foo'), and the second row had a key using unicode chars = ('=E3=83=89=E3=83=A1=E3=82=A4=E3=83=B3=E3=82=A6'). >>=20 >> Using multi_get, and both those keys, I got both columns back, as = expected. >> Using multi_get_slice and both those keys, I got both columns back, = as expected. >> I also did a get_range_slices to get all rows in the columnfamily, = and I got both columns back, as expected. >>=20 >> So far so good. Then I drain and shut down Cassandra 0.6.13, and = start up Cassandra 0.7.5, pointing to the same data directory, with a = config containing the same keyspace, and I run the schematool import = command. >>=20 >> I then start up my test program that uses the new thrift api, and run = some commands. >>=20 >> Using multi_get_slice, and those two keys encoded as UTF8 = byte-arrays, I only get back one column, the one under the key 'foo'. = The other row I simply can't retrieve. >>=20 >> However, when I use get_range_slices to get all rows, I get back two = rows, with the correct column values, and the byte-array keys are = identical to my encoded keys, and when I decode the byte-arrays as UTF8 = drings, I get back my two original keys. This means that both my rows = are still there, the keys as output by Cassandra are identical to the = original string keys I used when I created the rows in 0.6.13, but it's = just impossible to retrieve the second row. >>=20 >> To continue the test, I inserted a row with the key '=E3=83=89=E3=83=A1= =E3=82=A4=E3=83=B3=E3=82=A6' encoded as UTF-8 again, and gave it a = similar column as the original, but with a 1-byte value of "2". >>=20 >> Now, when I use multi_get_slice with my two encoded keys, I get back = two rows, the 'foo' row has the old value as expected, and the other row = has the new value as expected. >>=20 >> However, when I use get_range_slices to get all rows, I get back = *three* rows, two of which have the *exact same* byte-array key, one has = the old column, one has the new column.=20 >>=20 >>=20 >> How is this possible? How can there be two different rows with the = exact same key? I'm guessing that it's related to the encoding of string = keys in 0.6, and that the internal representation is off somehow. I = checked the generated thrift client for 0.6, and it UTF8-encodes all = keys before sending them to the server, so it should be UTF8 all the = way, but apparently it isn't. >>=20 >> Has anyone else experienced the same problem? Is it a = platform-specific problem? Is there a way to avoid this and upgrade from = 0.6 to 0.7 and not lose any rows? I would also really like to know which = byte-array I should send in to get back that second row, there's gotta = be some key that can be used to get it, the row is still there after = all. >>=20 >>=20 >> /Henrik Schr=C3=B6der >=20 --Apple-Mail-17--132934400 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8
http://www.thelastpickle.com

On 5 May 2011, at 04:53, Daniel Doubleday wrote:

This is a bit of a wild guess = but Windows and encoding and 0.7.5 sounds like


 
On May 3, 2011, at 5:15 PM, Henrik Schr=C3=B6der = wrote:

Hey everyone,

We did some tests before upgrading = our Cassandra cluster from 0.6 to 0.7, just to make sure that the change = in how keys are encoded wouldn't cause us any dataloss. Unfortunately it = seems that rows stored under a unicode key couldn't be retrieved after = the upgrade. We're running everything on Windows, and we're using the = generated thrift client in C# to access it.

I managed to make a minimal test to reproduce the error = consistently:

First, I started up Cassandra 0.6.13 with an empty = data directory, and a really simple config with a single keyspace with a = single bytestype columnfamily.
I wrote two rows, each with a single column with a simple column name = and a 1-byte value of "1". The first row had a key using only ascii = chars ('foo'), and the second row had a key using unicode chars = ('=E3=83=89=E3=83=A1=E3=82=A4=E3=83=B3=E3=82=A6').

Using multi_get, and both those keys, I got both columns back, as = expected.
Using multi_get_slice and both those keys, I got both = columns back, as expected.
I also did a get_range_slices to get all = rows in the columnfamily, and I got both columns back, as expected.

So far so good. Then I drain and shut down Cassandra 0.6.13, and = start up Cassandra 0.7.5, pointing to the same data directory, with a = config containing the same keyspace, and I run the schematool import = command.

I then start up my test program that uses the new thrift api, and run = some commands.

Using multi_get_slice, and those two keys encoded = as UTF8 byte-arrays, I only get back one column, the one under the key = 'foo'. The other row I simply can't retrieve.

However, when I use get_range_slices to get all rows, I get back two = rows, with the correct column values, and the byte-array keys are = identical to my encoded keys, and when I decode the byte-arrays as UTF8 = drings, I get back my two original keys. This means that both my rows = are still there, the keys as output by Cassandra are identical to the = original string keys I used when I created the rows in 0.6.13, but it's = just impossible to retrieve the second row.

To continue the test, I inserted a row with the key = '=E3=83=89=E3=83=A1=E3=82=A4=E3=83=B3=E3=82=A6' encoded as UTF-8 again, = and gave it a similar column as the original, but with a 1-byte value of = "2".

Now, when I use multi_get_slice with my two encoded keys, I = get back two rows, the 'foo' row has the old value as expected, and the = other row has the new value as expected.

However, when I use get_range_slices to get all rows, I get back = *three* rows, two of which have the *exact same* byte-array key, one has = the old column, one has the new column.


How is this = possible? How can there be two different rows with the exact same key? = I'm guessing that it's related to the encoding of string keys in 0.6, = and that the internal representation is off somehow. I checked the = generated thrift client for 0.6, and it UTF8-encodes all keys before = sending them to the server, so it should be UTF8 all the way, but = apparently it isn't.

Has anyone else experienced the same problem? Is it a = platform-specific problem? Is there a way to avoid this and upgrade from = 0.6 to 0.7 and not lose any rows? I would also really like to know which = byte-array I should send in to get back that second row, there's gotta = be some key that can be used to get it, the row is still there after = all.


/Henrik Schr=C3=B6der
=


= --Apple-Mail-17--132934400--