Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9C2CC3CB8 for ; Wed, 4 May 2011 15:33:36 +0000 (UTC) Received: (qmail 45253 invoked by uid 500); 4 May 2011 15:33:34 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 45219 invoked by uid 500); 4 May 2011 15:33:34 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 45211 invoked by uid 99); 4 May 2011 15:33:34 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 May 2011 15:33:34 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of skrolle@gmail.com designates 209.85.213.44 as permitted sender) Received: from [209.85.213.44] (HELO mail-yw0-f44.google.com) (209.85.213.44) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 May 2011 15:33:27 +0000 Received: by ywi6 with SMTP id 6so525533ywi.31 for ; Wed, 04 May 2011 08:33:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=bs0ezTPjQlMQ9LK54ya6Ill2eb+0qDuGlSx7DLwr+QA=; b=aKKNXqG9hopk22yk0PEeeNmE7b3nxYnmKmOXhRsK+Vf+fbJF4VZhH2hN1nobYp0LRg qzyCdeYEVhNHFfDCoPZQfSUx7ZmkQIFbqBS+DxU8QDkA+oUttL2Rb3WVPn6bRfZp/OCP ChnoDCEkwL4EqWaZZZKtP9Q0ZINo4HY4zEiqY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=Xztj3kZ3KMBOAsYa9B7nNi7jbZL9RY2jLjoPezwl+1dYGjdM7oks+Bqpc3kzT7OYtQ wzTdvk4Q3yfBaSCpgEfzcb62nVGKeLM2Cb+OTV5Yl2TGN2uVpANoJePYBRRZ3ofyR/Wj dinBZQ+1w/W6eCVByQYA9dx8QTJiCfbSrRaIQ= MIME-Version: 1.0 Received: by 10.90.126.17 with SMTP id y17mr1237532agc.64.1304523186269; Wed, 04 May 2011 08:33:06 -0700 (PDT) Received: by 10.90.55.2 with HTTP; Wed, 4 May 2011 08:33:05 -0700 (PDT) In-Reply-To: <86844899-58A8-43E8-AA38-85DBDD06594D@thelastpickle.com> References: <86844899-58A8-43E8-AA38-85DBDD06594D@thelastpickle.com> Date: Wed, 4 May 2011 17:33:05 +0200 Message-ID: Subject: Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5 From: =?ISO-8859-1?Q?Henrik_Schr=F6der?= To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0016361e82ae1e70f504a274f9bf X-Virus-Checked: Checked by ClamAV on apache.org --0016361e82ae1e70f504a274f9bf Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable My two keys that I send in my test program are 0xe695b0e69982e99693 and 0x666f6f, which decodes to "=E6=95=B0=E6=99=82=E9=96=93" and "foo" respecti= vely. So I ran my tests again, I started with a clean 0.6.13, wrote two rows with those two keys, drained, shut down, started 0.7.5, and imported my keyspace= . In my test program, when I do multi_get_slice, I send in those two keys, an= d get back a datastructure that contains the exact same keys, but only the structure under the key 0x666f6f contains any columns. When I do a simple get with the first key, I get a NotFoundException. The second key works fine. Doing get_range_slices, I get back two KeySlices, the keys are the exact same, and both have their columns. If I run sstablekeys on the datafile, it prints out: e695b0e69982e99693 666f6f If I run sstable2json on the datafile, it prints out: { "e695b0e69982e99693": [["00", "01", 1304519723589, false]], "666f6f": [["00", "01", 1304519721274, false]] } After that I re-inserted a row with the first key and then ran my tests again. Now both single gets work fine, multi_get_slice works fine, but get_range_slices return a structure with three keys: 0xe695b0e69982e99693 0xe695b0e69982e99693 0x666f6f I restarted Cassandra to make it flush the commitlog, and my datadirectory now has two data files. When I run sstablekeys on the first one it still prints out: e695b0e69982e99693 666f6f And running it on the second datafile makes it print out: e695b0e69982e99693 After all that, I forced a compaction with nodetool and restarted the server, ending up with a single datafile. When I run sstable2json on that, it prints out: { "e695b0e69982e99693": [["00", "01", 1304519723589, false]], "e695b0e69982e99693": [["00", "02", 1304521931818, false]], "666f6f": [["00", "01", 1304519721274, false]] } So I now have an SSTable with two rows with identical keys, except one of the rows doesn't really work? So, now what? And how did I end up in this state? /Henrik Schr=C3=B6der On Tue, May 3, 2011 at 22:10, aaron morton wrote: > Can you provide some details of the data returned from you do the =3D > get_range() ? It will be interesting to see the raw bytes returned for = =3D > the keys. The likely culprit is a change in the encoding. Can you also = =3D > try to grab the bytes sent for the key when doing the single select that = =3D > fails.=3D20 > > You can grab these either on the client and/or by turing on the logging = =3D > the DEBUG in conf/log4j-server.properties > > Thanks > Aaron > > On 4 May 2011, at 03:19, Henrik Schr=C3=B6der wrote: > > > The way we solved this problem is that it turned out we had only a few > hundred rows with unicode keys, so we simply extracted them, upgraded to > 0.7, and wrote them back. However, this means that among the rows, there = are > a few hundred weird duplicate rows with identical keys. > > > > Is this going to be a problem in the future? Is there a chance that the > good duplicate is cleaned out in favour of the bad duplicate so that we > suddnely lose those rows again? > > > > > > /Henrik Schr=C3=B6der > > --0016361e82ae1e70f504a274f9bf Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable My two keys that I send in my test program are 0xe695b0e69982e99693 and 0x6= 66f6f, which decodes to "=E6=95=B0=E6=99=82=E9=96=93" and "f= oo" respectively.

So I ran my tests again, I started with a cle= an 0.6.13, wrote two rows with those two keys, drained, shut down, started = 0.7.5, and imported my keyspace.

In my test program, when I do multi_get_slice, I send in those two keys= , and get back a datastructure that contains the exact same keys, but only = the structure under the key 0x666f6f contains any columns.

When I do= a simple get with the first key, I get a NotFoundException. The second key= works fine.

Doing get_range_slices, I get back two KeySlices, the keys are the exac= t same, and both have their columns.

If I run sstablekeys on the dat= afile, it prints out:
e695b0e69982e99693
666f6f

If I run sstab= le2json on the datafile, it prints out:
{
"e695b0e69982e99693": [["00", "01", 1304= 519723589, false]],
"666f6f": [["00", "01"= , 1304519721274, false]]
}


After that I re-inserted a row wit= h the first key and then ran my tests again. Now both single gets work fine= , multi_get_slice works fine, but get_range_slices return a structure with = three keys:
0xe695b0e69982e99693
0xe695b0e69982e99693
0x666f6f

I restarted= Cassandra to make it flush the commitlog, and my datadirectory now has two= data files. When I run sstablekeys on the first one it still prints out: e695b0e69982e99693
666f6f

And running it on the second datafile makes it print out:
e695b0e699= 82e99693


After all that, I forced a compaction with nodetool and restarted t= he server, ending up with a single datafile. When I run sstable2json on tha= t, it prints out:
{
"e695b0e69982e99693": [["00",= "01", 1304519723589, false]],
"e695b0e69982e99693": [["00", "02", 130452193= 1818, false]],
"666f6f": [["00", "01", 130= 4519721274, false]]
}

So I now have an SSTable with two rows with= identical keys, except one of the rows doesn't really work? So, now wh= at? And how did I end up in this state?


/Henrik Schr=C3=B6der


On Tue,= May 3, 2011 at 22:10, aaron morton <aaron@thelastpickle.com> wrote:
=
Can you provide some details of the data returned from you do the =3D
get_range() ? It will be interesting to see the raw bytes returned for =3D<= br> the keys. The likely culprit is a change in the encoding. Can you also =3D<= br> try to grab the bytes sent for the key when doing the single select that = =3D
fails.=3D20

You can grab these either on the client and/or by turing on the logging =3D=
the DEBUG in conf/log4j-server.properties

Thanks
Aaron

On 4 May 2011, at 03:19, Henrik Schr=C3=B6der wrote:

> The way we solved this problem is that it turned out we had only a few= hundred rows with unicode keys, so we simply extracted them, upgraded to 0= .7, and wrote them back. However, this means that among the rows, there are= a few hundred weird duplicate rows with identical keys.
>
> Is this going to be a problem in the future? Is there a chance that th= e good duplicate is cleaned out in favour of the bad duplicate so that we s= uddnely lose those rows again?
>
>
> /Henrik Schr=C3=B6der


--0016361e82ae1e70f504a274f9bf--