Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 25556 invoked from network); 11 May 2010 14:03:50 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 11 May 2010 14:03:50 -0000 Received: (qmail 9841 invoked by uid 500); 11 May 2010 14:03:49 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 9813 invoked by uid 500); 11 May 2010 14:03:49 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 9804 invoked by uid 99); 11 May 2010 14:03:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 May 2010 14:03:49 +0000 X-ASF-Spam-Status: No, hits=2.8 required=10.0 tests=AWL,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.219.221] (HELO mail-ew0-f221.google.com) (209.85.219.221) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 May 2010 14:03:43 +0000 Received: by ewy21 with SMTP id 21so1175956ewy.5 for ; Tue, 11 May 2010 07:03:21 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.4.4 with SMTP id g4mr3361741mui.80.1273586601532; Tue, 11 May 2010 07:03:21 -0700 (PDT) Received: by 10.102.228.18 with HTTP; Tue, 11 May 2010 07:03:21 -0700 (PDT) X-Originating-IP: [80.179.102.198] In-Reply-To: References: Date: Tue, 11 May 2010 17:03:21 +0300 Message-ID: Subject: Re: Is multiget_slice performant when you're looking for lots of keys? From: David Boxenhorn To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=00163642651bf99d49048651fcac --00163642651bf99d49048651fcac Content-Type: text/plain; charset=ISO-8859-1 I have a similar issue, but I can't create a CF per type, because types are an open-ended set in my case (they are geographical locations). So I wanted to have one CF for types, and a supercolumn for each type, with the keys as columns per supercolumn. Is it a problem for me to have millions of columns in a supercolumn? On Tue, May 11, 2010 at 4:29 PM, Jonathan Ellis wrote: > multiget performs in O(N) with the number of rows requested. so will > range scanning. > > if you want to query millions of records of one type i would create a > CF per type and use hadoop to parallelize the computation. > > On Fri, May 7, 2010 at 6:16 PM, James wrote: > > Hi all, > > Apologies if I'm still stuck in RDBMS mentality - first project using > > Cassandra! > > I'll be using Cassandra to store quite a lot (10s of millions) of > records, > > each of which has a type. > > I'll want to query the records to get all of a certain type; it's an > > analagous situation to the TaggedPosts schema from Arin's blog post > > (http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model). > > The thing is, each type (or tag) row key will be pointing at millions of > > records. I know I can use multiget_slice with all those record IDs as one > > request, but is this The Right Way of "filtering" a large column family > by > > type? > > Coming from an RDBMS-ingrained mindset, it seems kind of awkward... > > Thanks! > > James > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of Riptano, the source for professional Cassandra support > http://riptano.com > --00163642651bf99d49048651fcac Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I have a similar issue, but I can't create a CF per ty= pe, because types are an open-ended set in my case (they are geographical l= ocations). So I wanted to have one CF for types, and a supercolumn for each= type, with the keys as columns per supercolumn.

Is it a problem for me to have millions of columns in a supercolumn?
On Tue, May 11, 2010 at 4:29 PM, Jonathan E= llis <jbellis@gma= il.com> wrote:
multiget performs= in O(N) with the number of rows requested. =A0so will
range scanning.

if you want to query millions of records of one type i would create a
CF per type and use hadoop to parallelize the computation.

On Fri, May 7, 2010 at 6:16 PM, James <rent.lupin.road@gmail.com> wrote:
> Hi all,
> Apologies if I'm still stuck in RDBMS mentality - first project us= ing
> Cassandra!
> I'll be using Cassandra to store quite a lot (10s of millions) of = records,
> each of which has a type.
> I'll want to query the records to get all of a certain type; it= 9;s an
> analagous situation to the TaggedPosts schema from Arin's blog pos= t
> (http://arin.me/blog/wtf-is-a-supercolumn-cassandra-d= ata-model).
> The thing is, each type (or tag) row key will be pointing at millions = of
> records. I know I can use multiget_slice with all those record IDs as = one
> request, but is this The Right Way of "filtering" a large co= lumn family by
> type?
> Coming from an RDBMS-ingrained mindset, it seems kind of awkward... > Thanks!
> James



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

--00163642651bf99d49048651fcac--