Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from
	:mime-version:content-type:subject:date:in-reply-to:to
	:references:message-id; q=dns; s=thelastpickle.com; b=ehllaxd29N
	LCdyzaQNHYb7eccns8vATj6PV7gmfndLddChdtp0Ph1foO3gy0zcplN7xff072CQ
	utPP00aoja4hRpUdH85/nQH5yJJ2E7WIZDXBQVaYC+y261p4lzX6utdXJQgq+q8T
	UqUA/GvhTtdFLeQZTDek47dWSkJhqLKZc=
From: aaron morton <aaron@thelastpickle.com>
Mime-Version: 1.0 (Apple Message framework v1084)
Content-Type: multipart/alternative; boundary=Apple-Mail-18--132214115
Subject: Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5
Date: Thu, 5 May 2011 22:48:20 +1200
In-Reply-To: <CE05FB9F-200C-4BB8-94B9-3616B5FF9034@thelastpickle.com>
To: user@cassandra.apache.org
References: <BANLkTi=e0yPeVuuyMU-5T+zJsuoEoUKh5w@mail.gmail.com>
 <80052FA0-2D52-4B1A-ABAA-9B04F843EFC4@gmx.net>
 <CE05FB9F-200C-4BB8-94B9-3616B5FF9034@thelastpickle.com>
Message-Id: <5828F3D8-B354-4E1D-B740-E52ADB983D76@thelastpickle.com>


--Apple-Mail-18--132214115
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

I take it back, the problem started in 0.6 where keys were strings. =
Looking into how 0.6 did it's thing


-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 5 May 2011, at 22:36, aaron morton wrote:

> Interesting but as we are dealing with keys it should not matter as =
they are treated as byte buffers.=20
>=20
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>=20
> On 5 May 2011, at 04:53, Daniel Doubleday wrote:
>=20
>> This is a bit of a wild guess but Windows and encoding and 0.7.5 =
sounds like
>>=20
>> https://issues.apache.org/jira/browse/CASSANDRA-2367
>>=20
>> =20
>> On May 3, 2011, at 5:15 PM, Henrik Schr=C3=B6der wrote:
>>=20
>>> Hey everyone,
>>>=20
>>> We did some tests before upgrading our Cassandra cluster from 0.6 to =
0.7, just to make sure that the change in how keys are encoded wouldn't =
cause us any dataloss. Unfortunately it seems that rows stored under a =
unicode key couldn't be retrieved after the upgrade. We're running =
everything on Windows, and we're using the generated thrift client in C# =
to access it.
>>>=20
>>> I managed to make a minimal test to reproduce the error =
consistently:
>>>=20
>>> First, I started up Cassandra 0.6.13 with an empty data directory, =
and a really simple config with a single keyspace with a single =
bytestype columnfamily.
>>> I wrote two rows, each with a single column with a simple column =
name and a 1-byte value of "1". The first row had a key using only ascii =
chars ('foo'), and the second row had a key using unicode chars =
('=E3=83=89=E3=83=A1=E3=82=A4=E3=83=B3=E3=82=A6').
>>>=20
>>> Using multi_get, and both those keys, I got both columns back, as =
expected.
>>> Using multi_get_slice and both those keys, I got both columns back, =
as expected.
>>> I also did a get_range_slices to get all rows in the columnfamily, =
and I got both columns back, as expected.
>>>=20
>>> So far so good. Then I drain and shut down Cassandra 0.6.13, and =
start up Cassandra 0.7.5, pointing to the same data directory, with a =
config containing the same keyspace, and I run the schematool import =
command.
>>>=20
>>> I then start up my test program that uses the new thrift api, and =
run some commands.
>>>=20
>>> Using multi_get_slice, and those two keys encoded as UTF8 =
byte-arrays, I only get back one column, the one under the key 'foo'. =
The other row I simply can't retrieve.
>>>=20
>>> However, when I use get_range_slices to get all rows, I get back two =
rows, with the correct column values, and the byte-array keys are =
identical to my encoded keys, and when I decode the byte-arrays as UTF8 =
drings, I get back my two original keys. This means that both my rows =
are still there, the keys as output by Cassandra are identical to the =
original string keys I used when I created the rows in 0.6.13, but it's =
just impossible to retrieve the second row.
>>>=20
>>> To continue the test, I inserted a row with the key =
'=E3=83=89=E3=83=A1=E3=82=A4=E3=83=B3=E3=82=A6' encoded as UTF-8 again, =
and gave it a similar column as the original, but with a 1-byte value of =
"2".
>>>=20
>>> Now, when I use multi_get_slice with my two encoded keys, I get back =
two rows, the 'foo' row has the old value as expected, and the other row =
has the new value as expected.
>>>=20
>>> However, when I use get_range_slices to get all rows, I get back =
*three* rows, two of which have the *exact same* byte-array key, one has =
the old column, one has the new column.=20
>>>=20
>>>=20
>>> How is this possible? How can there be two different rows with the =
exact same key? I'm guessing that it's related to the encoding of string =
keys in 0.6, and that the internal representation is off somehow. I =
checked the generated thrift client for 0.6, and it UTF8-encodes all =
keys before sending them to the server, so it should be UTF8 all the =
way, but apparently it isn't.
>>>=20
>>> Has anyone else experienced the same problem? Is it a =
platform-specific problem? Is there a way to avoid this and upgrade from =
0.6 to 0.7 and not lose any rows? I would also really like to know which =
byte-array I should send in to get back that second row, there's gotta =
be some key that can be used to get it, the row is still there after =
all.
>>>=20
>>>=20
>>> /Henrik Schr=C3=B6der
>>=20
>=20


--Apple-Mail-18--132214115
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">I =
take it back, the problem started in 0.6 where keys were strings. =
Looking into how 0.6 did it's thing<div><br></div><div><br><div>
<span class=3D"Apple-style-span" style=3D"border-collapse: separate; =
color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; =
font-variant: normal; font-weight: normal; letter-spacing: normal; =
line-height: normal; orphans: 2; text-align: auto; text-indent: 0px; =
text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; =
-webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: =
0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: =
0px; -webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
"><div><div>-----------------</div><div>Aaron Morton</div><div>Freelance =
Cassandra Developer</div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com">http://www.thelastpickle.com</a></di=
v></div></div></span></div></span></span>
</div>

<br><div><div>On 5 May 2011, at 22:36, aaron morton wrote:</div><br =
class=3D"Apple-interchange-newline"><blockquote type=3D"cite"><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; ">Interesting but as we are =
dealing with keys it should not matter as they are treated as byte =
buffers.&nbsp;<div><br></div><div><div>
<span class=3D"Apple-style-span" style=3D"border-collapse: separate; =
font-family: Helvetica; font-style: normal; font-variant: normal; =
font-weight: normal; letter-spacing: normal; line-height: normal; =
orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; =
widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; =
font-family: Helvetica; font-style: normal; font-variant: normal; =
font-weight: normal; letter-spacing: normal; line-height: normal; =
orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; =
widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; font-family: Helvetica; font-style: =
normal; font-variant: normal; font-weight: normal; letter-spacing: =
normal; line-height: normal; orphans: 2; text-indent: 0px; =
text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; =
-webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: =
0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
"><div><div>-----------------</div><div>Aaron Morton</div><div>Freelance =
Cassandra Developer</div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com/">http://www.thelastpickle.com</a></d=
iv></div></div></span></div></span></span>
</div>

<br><div><div>On 5 May 2011, at 04:53, Daniel Doubleday wrote:</div><br =
class=3D"Apple-interchange-newline"><blockquote type=3D"cite"><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; ">This is a bit of a wild guess =
but Windows and encoding and 0.7.5 sounds like<div><br></div><div><a =
href=3D"https://issues.apache.org/jira/browse/CASSANDRA-2367">https://issu=
es.apache.org/jira/browse/CASSANDRA-2367</a></div><div><br></div><div><a =
href=3D"https://issues.apache.org/jira/browse/CASSANDRA-2367"></a>&nbsp;<b=
r><div><div>On May 3, 2011, at 5:15 PM, Henrik Schr=C3=B6der =
wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite">Hey everyone,<br><br>We did some tests before upgrading =
our Cassandra cluster from 0.6 to 0.7, just to make sure that the change =
in how keys are encoded wouldn't cause us any dataloss. Unfortunately it =
seems that rows stored under a unicode key couldn't be retrieved after =
the upgrade. We're running everything on Windows, and we're using the =
generated thrift client in C# to access it.<br>
<br>I managed to make a minimal test to reproduce the error =
consistently:<br><br>First, I started up Cassandra 0.6.13 with an empty =
data directory, and a really simple config with a single keyspace with a =
single bytestype columnfamily.<br>
I wrote two rows, each with a single column with a simple column name =
and a 1-byte value of "1". The first row had a key using only ascii =
chars ('foo'), and the second row had a key using unicode chars =
('=E3=83=89=E3=83=A1=E3=82=A4=E3=83=B3=E3=82=A6').<br>
<br>Using multi_get, and both those keys, I got both columns back, as =
expected.<br>Using multi_get_slice and both those keys, I got both =
columns back, as expected.<br>I also did a get_range_slices to get all =
rows in the columnfamily, and I got both columns back, as expected.<br>
<br>So far so good. Then I drain and shut down Cassandra 0.6.13, and =
start up Cassandra 0.7.5, pointing to the same data directory, with a =
config containing the same keyspace, and I run the schematool import =
command.<br><br>
I then start up my test program that uses the new thrift api, and run =
some commands.<br><br>Using multi_get_slice, and those two keys encoded =
as UTF8 byte-arrays, I only get back one column, the one under the key =
'foo'. The other row I simply can't retrieve.<br>
<br>However, when I use get_range_slices to get all rows, I get back two =
rows, with the correct column values, and the byte-array keys are =
identical to my encoded keys, and when I decode the byte-arrays as UTF8 =
drings, I get back my two original keys. This means that both my rows =
are still there, the keys as output by Cassandra are identical to the =
original string keys I used when I created the rows in 0.6.13, but it's =
just impossible to retrieve the second row.<br>
<br>To continue the test, I inserted a row with the key =
'=E3=83=89=E3=83=A1=E3=82=A4=E3=83=B3=E3=82=A6' encoded as UTF-8 again, =
and gave it a similar column as the original, but with a 1-byte value of =
"2".<br><br>Now, when I use multi_get_slice with my two encoded keys, I =
get back two rows, the 'foo' row has the old value as expected, and the =
other row has the new value as expected.<br>
<br>However, when I use get_range_slices to get all rows, I get back =
*three* rows, two of which have the *exact same* byte-array key, one has =
the old column, one has the new column. <br><br><br>How is this =
possible? How can there be two different rows with the exact same key? =
I'm guessing that it's related to the encoding of string keys in 0.6, =
and that the internal representation is off somehow. I checked the =
generated thrift client for 0.6, and it UTF8-encodes all keys before =
sending them to the server, so it should be UTF8 all the way, but =
apparently it isn't.<br>
<br>Has anyone else experienced the same problem? Is it a =
platform-specific problem? Is there a way to avoid this and upgrade from =
0.6 to 0.7 and not lose any rows? I would also really like to know which =
byte-array I should send in to get back that second row, there's gotta =
be some key that can be used to get it, the row is still there after =
all.<br>
<br><br>/Henrik Schr=C3=B6der<br>
=
</blockquote></div><br></div></div></blockquote></div><br></div></div></bl=
ockquote></div><br></div></body></html>=

--Apple-Mail-18--132214115--