Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 52507 invoked from network); 29 Jul 2010 00:23:47 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 29 Jul 2010 00:23:47 -0000 Received: (qmail 73165 invoked by uid 500); 29 Jul 2010 00:23:46 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 73117 invoked by uid 500); 29 Jul 2010 00:23:45 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 73109 invoked by uid 99); 29 Jul 2010 00:23:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Jul 2010 00:23:45 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of daveviner@gmail.com designates 209.85.215.44 as permitted sender) Received: from [209.85.215.44] (HELO mail-ew0-f44.google.com) (209.85.215.44) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Jul 2010 00:23:38 +0000 Received: by ewy22 with SMTP id 22so2189670ewy.31 for ; Wed, 28 Jul 2010 17:23:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:content-type; bh=vKkDjtAEAKO0s4KQIeF/2wol2k1SoG6pjVE19KAgVKw=; b=a2BCcUQIAN32Xx9OngermhXqIJnS2Y2DLqKyTUGQXh5RvAjfD9ZdlTe/FvLJMdEdi3 kJ8wqBJtfk2uSjs9wWTyYL5aFd/dKn8N5ENWqZvkRibqKOb0r9TDZfplH20m2kz1QYRF hA0N5mEN0sCm7Eylb+ZjX4p3iL/eT6Z3Rkp/A= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; b=OWQogTX1ySK1oqO8lDJagj/Wypu0eWnkFuHp5shexBYbqdkXnMh1lwBtVFIOdX3Xap 3d3v7pHp6tBW19OnHJHerPUUPm5gUU6Kp5sMbz59WnvuEqp5ileirtG5O3pIQU5xqxGt bZjFsgOGvj5SKLAa/OloWu3Vt8eR4pzkENI54= MIME-Version: 1.0 Received: by 10.14.47.72 with SMTP id s48mr2795860eeb.49.1280362997813; Wed, 28 Jul 2010 17:23:17 -0700 (PDT) Sender: daveviner@gmail.com Received: by 10.14.47.11 with HTTP; Wed, 28 Jul 2010 17:23:17 -0700 (PDT) In-Reply-To: <4C50B3C5.20005@digg.com> References: <4C50B3C5.20005@digg.com> Date: Wed, 28 Jul 2010 17:23:17 -0700 X-Google-Sender-Auth: ECSfU58q9GUliZ0hnElqaINyMAM Message-ID: Subject: Re: iterating over all rows keys gets duplicate key returns From: Dave Viner To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=90e6ba6153f4ab148a048c7bbd88 X-Virus-Checked: Checked by ClamAV on apache.org --90e6ba6153f4ab148a048c7bbd88 Content-Type: text/plain; charset=ISO-8859-1 Just as a followup, here's what seems to be the resolution: 1. 0.6.4 should fix this problem. 2. Using OPP as the DHT should solve it as well. 3. Prior to 0.6.4, when using RandomPartitioner as the DHT, there's no good way to guarantee that you see *all* row keys for a column family. Strategies tried: A. iterate over the keys returned until the "start_key" is identical to the "last key returned". When start_key == last key returned, exit. -> fails since duplicate keys can appear anywhere, even as the last key returned. B. iterate over keys returned, adding the keys to a hash table. When an iteration returns no new keys, assume that all keys have been seen and exit. -> this also fails, since a particular result set can be full of duplicates, but the iteration has not traversed the entire row-key spectrum. Dave Viner On Wed, Jul 28, 2010 at 3:48 PM, Rob Coli wrote: > On 7/28/10 2:43 PM, Dave Viner wrote: > >> Hi all, >> >> I'm having a strange result in trying to iterate over all row keys for a >> particular column family. The iteration works, but I see the same row >> key returned multiple times during the iteration. >> >> I'm using cassandra 0.6.3, and I've put the code in use at >> > > For those not playing along on IRC, this was determined to be caused by : > > http://issues.apache.org/jira/browse/CASSANDRA-1042 > > Which is fixed in 0.6.4. > > =Rob > --90e6ba6153f4ab148a048c7bbd88 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Just as a followup, here's what seems to be the resolution:

1. 0.6.4 should fix this problem.
2. Using OPP as the DHT= should solve it as well. =A0
3. Prior to 0.6.4, when using Rando= mPartitioner as the DHT, there's no good way to guarantee that you see = *all* row keys for a column family.

Strategies tried:

A. iterate o= ver the keys returned until the "start_key" is identical to the &= quot;last key returned". =A0When start_key =3D=3D last key returned, e= xit.
-> fails since duplicate keys can appear anywhere, even as the last= key returned.

B. iterate over keys returned, addi= ng the keys to a hash table. =A0When an iteration returns no new keys, assu= me that all keys have been seen and exit.
-> this also fails, since a particular result set can be full of du= plicates, but the iteration has not traversed the entire row-key spectrum.<= /div>

Dave Viner

On We= d, Jul 28, 2010 at 3:48 PM, Rob Coli <rcoli@digg.com> wrote:
On 7/28/10 2:43 PM, Dave = Viner wrote:
Hi all,

I'm having a strange result in trying to iterate over all row keys for = a
particular column family. =A0The iteration works, but I see the same row key returned multiple times during the iteration.

I'm using cassandra 0.6.3, and I've put the code in use at

For those not playing along on IRC, this was determined to be caused by :
http://issues.apache.org/jira/browse/CASSANDRA-1042

Which is fixed in 0.6.4.

=3DRob

--90e6ba6153f4ab148a048c7bbd88--