Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 64F1CBA1 for ; Thu, 5 May 2011 11:13:07 +0000 (UTC) Received: (qmail 83046 invoked by uid 500); 5 May 2011 11:13:05 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 83019 invoked by uid 500); 5 May 2011 11:13:05 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 83011 invoked by uid 99); 5 May 2011 11:13:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 May 2011 11:13:05 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of skrolle@gmail.com designates 209.85.218.44 as permitted sender) Received: from [209.85.218.44] (HELO mail-yi0-f44.google.com) (209.85.218.44) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 May 2011 11:12:59 +0000 Received: by yic13 with SMTP id 13so864010yic.31 for ; Thu, 05 May 2011 04:12:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=5nzSgQsaAzO603p2efgqlc1RX8YRvt6wxrkjTmKOXUw=; b=ZHB3nJ3/OHMDYwQsHjT86RuSHKlflkBwZX7oooy5tgT27PdctODrKZ+jUbhHduESxu mSGmJXjWDTnvK5kese/q9i7MSF1VAp/T92wV3upv8sk37ZSHf3jim5ol9t/Z0sSn5P2I I/E1/Ercw7GIFiyc06cXZekI3QJ5vslFeDMIg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=Olglp2KwEoEPMp+XwbI1zGDpF1VjRPIPvPQdv1Fdmfn6+S4wXYg8ie8kpE8kjqSn8r vjRGh4gL8NcJ6ay/x9gNzCA+7+lr5VgUpz2Ji/63/ch+Nc6CXZb/WTXdbAvm2E+Fm5ff 5OD4y0GNo/HijWlZedeMhnd7ow8xCSkZLGZY0= MIME-Version: 1.0 Received: by 10.91.197.14 with SMTP id z14mr2072005agp.172.1304593958093; Thu, 05 May 2011 04:12:38 -0700 (PDT) Received: by 10.90.55.2 with HTTP; Thu, 5 May 2011 04:12:37 -0700 (PDT) In-Reply-To: <80052FA0-2D52-4B1A-ABAA-9B04F843EFC4@gmx.net> References: <80052FA0-2D52-4B1A-ABAA-9B04F843EFC4@gmx.net> Date: Thu, 5 May 2011 13:12:37 +0200 Message-ID: Subject: Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5 From: =?ISO-8859-1?Q?Henrik_Schr=F6der?= To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0016367657bc72c4e904a28573d1 --0016367657bc72c4e904a28573d1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Yeah, I've seen that one, and I'm guessing that it's the root cause of my problems, something something encoding error, but that doesn't really help me. :-) However, I've done all my tests with 0.7.5, I'm gonna try them again with 0.7.4, just to see how that version reacts. /Henrik On Wed, May 4, 2011 at 18:53, Daniel Doubleday wr= ote: > This is a bit of a wild guess but Windows and encoding and 0.7.5 sounds > like > > https://issues.apache.org/jira/browse/CASSANDRA-2367 > > > On May 3, 2011, at 5:15 PM, Henrik Schr=C3=B6der wrote: > > Hey everyone, > > We did some tests before upgrading our Cassandra cluster from 0.6 to 0.7, > just to make sure that the change in how keys are encoded wouldn't cause = us > any dataloss. Unfortunately it seems that rows stored under a unicode key > couldn't be retrieved after the upgrade. We're running everything on > Windows, and we're using the generated thrift client in C# to access it. > > I managed to make a minimal test to reproduce the error consistently: > > First, I started up Cassandra 0.6.13 with an empty data directory, and a > really simple config with a single keyspace with a single bytestype > columnfamily. > I wrote two rows, each with a single column with a simple column name and= a > 1-byte value of "1". The first row had a key using only ascii chars ('foo= '), > and the second row had a key using unicode chars ('=E3=83=89=E3=83=A1=E3= =82=A4=E3=83=B3=E3=82=A6'). > > Using multi_get, and both those keys, I got both columns back, as expecte= d. > Using multi_get_slice and both those keys, I got both columns back, as > expected. > I also did a get_range_slices to get all rows in the columnfamily, and I > got both columns back, as expected. > > So far so good. Then I drain and shut down Cassandra 0.6.13, and start up > Cassandra 0.7.5, pointing to the same data directory, with a config > containing the same keyspace, and I run the schematool import command. > > I then start up my test program that uses the new thrift api, and run som= e > commands. > > Using multi_get_slice, and those two keys encoded as UTF8 byte-arrays, I > only get back one column, the one under the key 'foo'. The other row I > simply can't retrieve. > > However, when I use get_range_slices to get all rows, I get back two rows= , > with the correct column values, and the byte-array keys are identical to = my > encoded keys, and when I decode the byte-arrays as UTF8 drings, I get bac= k > my two original keys. This means that both my rows are still there, the k= eys > as output by Cassandra are identical to the original string keys I used w= hen > I created the rows in 0.6.13, but it's just impossible to retrieve the > second row. > > To continue the test, I inserted a row with the key '=E3=83=89=E3=83=A1= =E3=82=A4=E3=83=B3=E3=82=A6' encoded as > UTF-8 again, and gave it a similar column as the original, but with a 1-b= yte > value of "2". > > Now, when I use multi_get_slice with my two encoded keys, I get back two > rows, the 'foo' row has the old value as expected, and the other row has = the > new value as expected. > > However, when I use get_range_slices to get all rows, I get back *three* > rows, two of which have the *exact same* byte-array key, one has the old > column, one has the new column. > > > How is this possible? How can there be two different rows with the exact > same key? I'm guessing that it's related to the encoding of string keys i= n > 0.6, and that the internal representation is off somehow. I checked the > generated thrift client for 0.6, and it UTF8-encodes all keys before send= ing > them to the server, so it should be UTF8 all the way, but apparently it > isn't. > > Has anyone else experienced the same problem? Is it a platform-specific > problem? Is there a way to avoid this and upgrade from 0.6 to 0.7 and not > lose any rows? I would also really like to know which byte-array I should > send in to get back that second row, there's gotta be some key that can b= e > used to get it, the row is still there after all. > > > /Henrik Schr=C3=B6der > > > --0016367657bc72c4e904a28573d1 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Yeah, I've seen that one, and I'm guessing that it's the root c= ause of my problems, something something encoding error, but that doesn'= ;t really help me. :-)

However, I've done all my tests with 0.7.= 5, I'm gonna try them again with 0.7.4, just to see how that version re= acts.


/Henrik

On Wed, May 4, 2011 at 18= :53, Daniel Doubleday <daniel.doubleday@gmx.net> wrote:
This is a bit of a wild guess but Windo= ws and encoding and 0.7.5 sounds like


=C2=A0
On May 3, 2011, at 5:15 PM, Henrik Schr=C3=B6der wrote:
Hey everyone,

We did some tests before upgrading our Cassandra clust= er from 0.6 to 0.7, just to make sure that the change in how keys are encod= ed wouldn't cause us any dataloss. Unfortunately it seems that rows sto= red under a unicode key couldn't be retrieved after the upgrade. We'= ;re running everything on Windows, and we're using the generated thrift= client in C# to access it.

I managed to make a minimal test to reproduce the error consistently:
First, I started up Cassandra 0.6.13 with an empty data directory, an= d a really simple config with a single keyspace with a single bytestype col= umnfamily.
I wrote two rows, each with a single column with a simple column name and a= 1-byte value of "1". The first row had a key using only ascii ch= ars ('foo'), and the second row had a key using unicode chars ('= ;=E3=83=89=E3=83=A1=E3=82=A4=E3=83=B3=E3=82=A6').

Using multi_get, and both those keys, I got both columns back, as expec= ted.
Using multi_get_slice and both those keys, I got both columns back,= as expected.
I also did a get_range_slices to get all rows in the colum= nfamily, and I got both columns back, as expected.

So far so good. Then I drain and shut down Cassandra 0.6.13, and start = up Cassandra 0.7.5, pointing to the same data directory, with a config cont= aining the same keyspace, and I run the schematool import command.

I then start up my test program that uses the new thrift api, and run some = commands.

Using multi_get_slice, and those two keys encoded as UTF8 = byte-arrays, I only get back one column, the one under the key 'foo'= ;. The other row I simply can't retrieve.

However, when I use get_range_slices to get all rows, I get back two ro= ws, with the correct column values, and the byte-array keys are identical t= o my encoded keys, and when I decode the byte-arrays as UTF8 drings, I get = back my two original keys. This means that both my rows are still there, th= e keys as output by Cassandra are identical to the original string keys I u= sed when I created the rows in 0.6.13, but it's just impossible to retr= ieve the second row.

To continue the test, I inserted a row with the key '=E3=83=89=E3= =83=A1=E3=82=A4=E3=83=B3=E3=82=A6' encoded as UTF-8 again, and gave it = a similar column as the original, but with a 1-byte value of "2".=

Now, when I use multi_get_slice with my two encoded keys, I get bac= k two rows, the 'foo' row has the old value as expected, and the ot= her row has the new value as expected.

However, when I use get_range_slices to get all rows, I get back *three= * rows, two of which have the *exact same* byte-array key, one has the old = column, one has the new column.


How is this possible? How can t= here be two different rows with the exact same key? I'm guessing that i= t's related to the encoding of string keys in 0.6, and that the interna= l representation is off somehow. I checked the generated thrift client for = 0.6, and it UTF8-encodes all keys before sending them to the server, so it = should be UTF8 all the way, but apparently it isn't.

Has anyone else experienced the same problem? Is it a platform-specific= problem? Is there a way to avoid this and upgrade from 0.6 to 0.7 and not = lose any rows? I would also really like to know which byte-array I should s= end in to get back that second row, there's gotta be some key that can = be used to get it, the row is still there after all.


/Henrik Schr=C3=B6der


--0016367657bc72c4e904a28573d1--