Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of skrolle@gmail.com designates
 209.85.218.44 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=Olglp2KwEoEPMp+XwbI1zGDpF1VjRPIPvPQdv1Fdmfn6+S4wXYg8ie8kpE8kjqSn8r
         vjRGh4gL8NcJ6ay/x9gNzCA+7+lr5VgUpz2Ji/63/ch+Nc6CXZb/WTXdbAvm2E+Fm5ff
         5OD4y0GNo/HijWlZedeMhnd7ow8xCSkZLGZY0=
MIME-Version: 1.0
In-Reply-To: <80052FA0-2D52-4B1A-ABAA-9B04F843EFC4@gmx.net>
References: <BANLkTi=e0yPeVuuyMU-5T+zJsuoEoUKh5w@mail.gmail.com>
	<80052FA0-2D52-4B1A-ABAA-9B04F843EFC4@gmx.net>
Date: Thu, 5 May 2011 13:12:37 +0200
Message-ID: <BANLkTinj6MQ3vtghFQrX2-ghxbFtnEwpjw@mail.gmail.com>
Subject: Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5
From: =?ISO-8859-1?Q?Henrik_Schr=F6der?= <skrolle@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0016367657bc72c4e904a28573d1

--0016367657bc72c4e904a28573d1
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Yeah, I've seen that one, and I'm guessing that it's the root cause of my
problems, something something encoding error, but that doesn't really help
me. :-)

However, I've done all my tests with 0.7.5, I'm gonna try them again with
0.7.4, just to see how that version reacts.


/Henrik

On Wed, May 4, 2011 at 18:53, Daniel Doubleday <daniel.doubleday@gmx.net>wr=
ote:

> This is a bit of a wild guess but Windows and encoding and 0.7.5 sounds
> like
>
> https://issues.apache.org/jira/browse/CASSANDRA-2367
>
> <https://issues.apache.org/jira/browse/CASSANDRA-2367>
> On May 3, 2011, at 5:15 PM, Henrik Schr=C3=B6der wrote:
>
> Hey everyone,
>
> We did some tests before upgrading our Cassandra cluster from 0.6 to 0.7,
> just to make sure that the change in how keys are encoded wouldn't cause =
us
> any dataloss. Unfortunately it seems that rows stored under a unicode key
> couldn't be retrieved after the upgrade. We're running everything on
> Windows, and we're using the generated thrift client in C# to access it.
>
> I managed to make a minimal test to reproduce the error consistently:
>
> First, I started up Cassandra 0.6.13 with an empty data directory, and a
> really simple config with a single keyspace with a single bytestype
> columnfamily.
> I wrote two rows, each with a single column with a simple column name and=
 a
> 1-byte value of "1". The first row had a key using only ascii chars ('foo=
'),
> and the second row had a key using unicode chars ('=E3=83=89=E3=83=A1=E3=
=82=A4=E3=83=B3=E3=82=A6').
>
> Using multi_get, and both those keys, I got both columns back, as expecte=
d.
> Using multi_get_slice and both those keys, I got both columns back, as
> expected.
> I also did a get_range_slices to get all rows in the columnfamily, and I
> got both columns back, as expected.
>
> So far so good. Then I drain and shut down Cassandra 0.6.13, and start up
> Cassandra 0.7.5, pointing to the same data directory, with a config
> containing the same keyspace, and I run the schematool import command.
>
> I then start up my test program that uses the new thrift api, and run som=
e
> commands.
>
> Using multi_get_slice, and those two keys encoded as UTF8 byte-arrays, I
> only get back one column, the one under the key 'foo'. The other row I
> simply can't retrieve.
>
> However, when I use get_range_slices to get all rows, I get back two rows=
,
> with the correct column values, and the byte-array keys are identical to =
my
> encoded keys, and when I decode the byte-arrays as UTF8 drings, I get bac=
k
> my two original keys. This means that both my rows are still there, the k=
eys
> as output by Cassandra are identical to the original string keys I used w=
hen
> I created the rows in 0.6.13, but it's just impossible to retrieve the
> second row.
>
> To continue the test, I inserted a row with the key '=E3=83=89=E3=83=A1=
=E3=82=A4=E3=83=B3=E3=82=A6' encoded as
> UTF-8 again, and gave it a similar column as the original, but with a 1-b=
yte
> value of "2".
>
> Now, when I use multi_get_slice with my two encoded keys, I get back two
> rows, the 'foo' row has the old value as expected, and the other row has =
the
> new value as expected.
>
> However, when I use get_range_slices to get all rows, I get back *three*
> rows, two of which have the *exact same* byte-array key, one has the old
> column, one has the new column.
>
>
> How is this possible? How can there be two different rows with the exact
> same key? I'm guessing that it's related to the encoding of string keys i=
n
> 0.6, and that the internal representation is off somehow. I checked the
> generated thrift client for 0.6, and it UTF8-encodes all keys before send=
ing
> them to the server, so it should be UTF8 all the way, but apparently it
> isn't.
>
> Has anyone else experienced the same problem? Is it a platform-specific
> problem? Is there a way to avoid this and upgrade from 0.6 to 0.7 and not
> lose any rows? I would also really like to know which byte-array I should
> send in to get back that second row, there's gotta be some key that can b=
e
> used to get it, the row is still there after all.
>
>
> /Henrik Schr=C3=B6der
>
>
>

--0016367657bc72c4e904a28573d1
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Yeah, I&#39;ve seen that one, and I&#39;m guessing that it&#39;s the root c=
ause of my problems, something something encoding error, but that doesn&#39=
;t really help me. :-)<br><br>However, I&#39;ve done all my tests with 0.7.=
5, I&#39;m gonna try them again with 0.7.4, just to see how that version re=
acts.<br>
<br><br>/Henrik<br><br><div class=3D"gmail_quote">On Wed, May 4, 2011 at 18=
:53, Daniel Doubleday <span dir=3D"ltr">&lt;<a href=3D"mailto:daniel.double=
day@gmx.net">daniel.doubleday@gmx.net</a>&gt;</span> wrote:<br><blockquote =
class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid=
;padding-left:1ex;">
<div style=3D"word-wrap:break-word">This is a bit of a wild guess but Windo=
ws and encoding and 0.7.5 sounds like<div><br></div><div><a href=3D"https:/=
/issues.apache.org/jira/browse/CASSANDRA-2367" target=3D"_blank">https://is=
sues.apache.org/jira/browse/CASSANDRA-2367</a></div>
<div><div></div><div class=3D"h5"><div><br></div><div><a href=3D"https://is=
sues.apache.org/jira/browse/CASSANDRA-2367" target=3D"_blank"></a>=C2=A0<br=
><div><div>On May 3, 2011, at 5:15 PM, Henrik Schr=C3=B6der wrote:</div><br=
><blockquote type=3D"cite">
Hey everyone,<br><br>We did some tests before upgrading our Cassandra clust=
er from 0.6 to 0.7, just to make sure that the change in how keys are encod=
ed wouldn&#39;t cause us any dataloss. Unfortunately it seems that rows sto=
red under a unicode key couldn&#39;t be retrieved after the upgrade. We&#39=
;re running everything on Windows, and we&#39;re using the generated thrift=
 client in C# to access it.<br>

<br>I managed to make a minimal test to reproduce the error consistently:<b=
r><br>First, I started up Cassandra 0.6.13 with an empty data directory, an=
d a really simple config with a single keyspace with a single bytestype col=
umnfamily.<br>

I wrote two rows, each with a single column with a simple column name and a=
 1-byte value of &quot;1&quot;. The first row had a key using only ascii ch=
ars (&#39;foo&#39;), and the second row had a key using unicode chars (&#39=
;=E3=83=89=E3=83=A1=E3=82=A4=E3=83=B3=E3=82=A6&#39;).<br>

<br>Using multi_get, and both those keys, I got both columns back, as expec=
ted.<br>Using multi_get_slice and both those keys, I got both columns back,=
 as expected.<br>I also did a get_range_slices to get all rows in the colum=
nfamily, and I got both columns back, as expected.<br>

<br>So far so good. Then I drain and shut down Cassandra 0.6.13, and start =
up Cassandra 0.7.5, pointing to the same data directory, with a config cont=
aining the same keyspace, and I run the schematool import command.<br>
<br>
I then start up my test program that uses the new thrift api, and run some =
commands.<br><br>Using multi_get_slice, and those two keys encoded as UTF8 =
byte-arrays, I only get back one column, the one under the key &#39;foo&#39=
;. The other row I simply can&#39;t retrieve.<br>

<br>However, when I use get_range_slices to get all rows, I get back two ro=
ws, with the correct column values, and the byte-array keys are identical t=
o my encoded keys, and when I decode the byte-arrays as UTF8 drings, I get =
back my two original keys. This means that both my rows are still there, th=
e keys as output by Cassandra are identical to the original string keys I u=
sed when I created the rows in 0.6.13, but it&#39;s just impossible to retr=
ieve the second row.<br>

<br>To continue the test, I inserted a row with the key &#39;=E3=83=89=E3=
=83=A1=E3=82=A4=E3=83=B3=E3=82=A6&#39; encoded as UTF-8 again, and gave it =
a similar column as the original, but with a 1-byte value of &quot;2&quot;.=
<br><br>Now, when I use multi_get_slice with my two encoded keys, I get bac=
k two rows, the &#39;foo&#39; row has the old value as expected, and the ot=
her row has the new value as expected.<br>

<br>However, when I use get_range_slices to get all rows, I get back *three=
* rows, two of which have the *exact same* byte-array key, one has the old =
column, one has the new column. <br><br><br>How is this possible? How can t=
here be two different rows with the exact same key? I&#39;m guessing that i=
t&#39;s related to the encoding of string keys in 0.6, and that the interna=
l representation is off somehow. I checked the generated thrift client for =
0.6, and it UTF8-encodes all keys before sending them to the server, so it =
should be UTF8 all the way, but apparently it isn&#39;t.<br>

<br>Has anyone else experienced the same problem? Is it a platform-specific=
 problem? Is there a way to avoid this and upgrade from 0.6 to 0.7 and not =
lose any rows? I would also really like to know which byte-array I should s=
end in to get back that second row, there&#39;s gotta be some key that can =
be used to get it, the row is still there after all.<br>

<br><br>/Henrik Schr=C3=B6der<br>
</blockquote></div><br></div></div></div></div></blockquote></div><br>

--0016367657bc72c4e904a28573d1--