Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0595B7954 for ; Thu, 25 Aug 2011 20:52:27 +0000 (UTC) Received: (qmail 96055 invoked by uid 500); 25 Aug 2011 20:52:25 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 95895 invoked by uid 500); 25 Aug 2011 20:52:24 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 95887 invoked by uid 99); 25 Aug 2011 20:52:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Aug 2011 20:52:23 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.125.82.44] (HELO mail-ww0-f44.google.com) (74.125.82.44) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Aug 2011 20:52:19 +0000 Received: by wwf5 with SMTP id 5so2350580wwf.25 for ; Thu, 25 Aug 2011 13:51:56 -0700 (PDT) MIME-Version: 1.0 Received: by 10.227.197.72 with SMTP id ej8mr164838wbb.85.1314305516433; Thu, 25 Aug 2011 13:51:56 -0700 (PDT) Received: by 10.227.151.73 with HTTP; Thu, 25 Aug 2011 13:51:56 -0700 (PDT) In-Reply-To: References: Date: Thu, 25 Aug 2011 16:51:56 -0400 Message-ID: Subject: Re: Customized Secondary Index Schema From: Ed Anuff To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0015174fef6c6ef01f04ab5a9961 --0015174fef6c6ef01f04ab5a9961 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Agreed, that's what I meant by "there are a lot of simple ways to split it up over multiple rows", assuming it necessary. On Thu, Aug 25, 2011 at 4:24 PM, Konstantin Naryshkin wrote: > Why are you keeping all your indexes in the same row? We do a similar thi= ng > (maintain several indexes over the same data) and we just have an index > column family with keys like "dest192.168.0.1" which means destination in= dex > of 192.168.0.1. You can do rows like User_Keys_By_Last_Name_adams and > User_Keys_By_Last_Name_alden. You can keep the matching main column famil= y > key as the column name. This will ensure that your index is evenly > distributed throughout your cluster. > > ----- Original Message ----- > From: "Ed Anuff" > To: user@cassandra.apache.org > Sent: Thursday, August 25, 2011 12:48:49 PM > Subject: Re: Customized Secondary Index Schema > > How many unique last names do you anticipate having? How many characters = in > the last name do you anticipate keeping in your index? You can easily do = the > math to figure out how many you could fit on a node. I think you'll find > that the ceiling might be quite a bit higher than you think. If you have > over a couple of hundred million users it might not be the best approach. > There are a lot of very simple ways to split it up over multiple rows. As= is > the case with most things regarding Cassandra, the off-the-cuff assumptio= ns > only get you so far before you have to do some math and do some tests. > > As I mentioned in my talk, for simple uses cases like this, you probably > should just start with the built in secondary indexes, but I assume you > already have explored those. > > Ed > > > On Thu, Aug 25, 2011 at 9:27 AM, Alvin UW < alvinuw@gmail.com > wrote: > > > Yes, this is what I am worrying about. > > > 2011/8/24 Ryan King < ryan@twitter.com > > > > > > > On Tue, Aug 23, 2011 at 10:03 AM, Alvin UW < alvinuw@gmail.com > wrote: > > Hello, > > > > As mentioned by Ed Anuff in his blog and slides, one way to build > customized > > secondary index is: > > We use one CF, each row to represent a secondary index, with the > secondary > > index name as row key. > > For example, > > > > Indexes =3D { > > "User_Keys_By_Last_Name" : { > > "adams" : "e5d61f2b-=85", > > "alden" : "e80a17ba-=85", > > "anderson" : "e5d61f2b-=85", > > "davis" : "e719962b-=85", > > "doe" : "e78ece0f-=85", > > "franks" : "e66afd40-=85", > > =85 : =85, > > } > > } > > > > But the whole secondary index is partitioned into a single node, becaus= e > of > > the row key. > > All the queries against this secondary index will go to this node. Of > > course, there are some replica nodes. > > > > Do you think this is a scalability problem, or any better solution to > solve > > it? > > Its certainly a scalability problem in that this solution has a hard > ceiling (this index can't get larger than the capacity of any single > node). It will probably work on small datasets, but if your dataset is > small then why are you using cassandra? > > -ryan > > > --0015174fef6c6ef01f04ab5a9961 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Agreed, that's what I meant by "there are a lot of simple ways to = split it up over multiple rows", assuming it necessary.

On Thu, Aug 25, 2011 at 4:24 PM, Konstantin Naryshkin = <konstantinn@a= -bb.net> wrote:
Why are you keeping all your indexes in the= same row? We do a similar thing (maintain several indexes over the same da= ta) and we just have an index column family with keys like "dest192.16= 8.0.1" which means destination index of 192.168.0.1. You can do rows l= ike User_Keys_By_Last_Name_adams and User_Keys_By_Last_Name_alden. You can = keep the matching main column family key as the column name. This will ensu= re that your index is evenly distributed throughout your cluster.

----- Original Message -----
From: "Ed Anuff" <ed@anuff.com= >
To: user@cassandra.apache.org<= /a>
Sent: Thursday, August 25, 2011 12:48:49 PM
Subject: Re: Customized Secondary Index Schema

How many unique last names do you anticipate having? How many characters in= the last name do you anticipate keeping in your index? You can easily do t= he math to figure out how many you could fit on a node. I think you'll = find that the ceiling might be quite a bit higher than you think. If you ha= ve over a couple of hundred million users it might not be the best approach= . There are a lot of very simple ways to split it up over multiple rows. As= is the case with most things regarding Cassandra, the off-the-cuff assumpt= ions only get you so far before you have to do some math and do some tests.=

As I mentioned in my talk, for simple uses cases like this, you probably sh= ould just start with the built in secondary indexes, but I assume you alrea= dy have explored those.

Ed


On Thu, Aug 25, 2011 at 9:27 AM, Alvin UW <
alvinuw@gmail.com > wrote:


Yes, this is what I am worrying about.


2011/8/24 Ryan King < ryan@twitter.c= om >





On Tue, Aug 23, 2011 at 10:03 AM, Alvin UW < alvinuw@gmail.com > wrote:
> Hello,
>
> As mentioned by Ed Anuff in his blog and slides, one way to build cust= omized
> secondary index is:
> We use one CF, each row to represent a secondary index, with the secon= dary
> index name as row key.
> For example,
>
> Indexes =3D {
> "User_Keys_By_Last_Name" : {
> "adams" : "e5d61f2b-=85",
> "alden" : "e80a17ba-=85",
> "anderson" : "e5d61f2b-=85",
> "davis" : "e719962b-=85",
> "doe" : "e78ece0f-=85",
> "franks" : "e66afd40-=85",
> =85 : =85,
> }
> }
>
> But the whole secondary index is partitioned into a single node, becau= se of
> the row key.
> All the queries against this secondary index will go to this node. Of<= br> > course, there are some replica nodes.
>
> Do you think this is a scalability problem, or any better solution to = solve
> it?

Its certainly a scalability problem in that this solution has a hard
ceiling (this index can't get larger than the capacity of any single node). It will probably work on small datasets, but if your dataset is
small then why are you using cassandra?

-ryan



--0015174fef6c6ef01f04ab5a9961--