Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <fe5757dd-910c-4a18-b5b1-c4f19c960967@klap>
References: 
 <CAL+hLkQe8Pv5-WxS9kgaPC62z2ZM42v8TMdaFRb6m7iJDrNXUw@mail.gmail.com>
	<fe5757dd-910c-4a18-b5b1-c4f19c960967@klap>
Date: Thu, 25 Aug 2011 16:51:56 -0400
Message-ID: 
 <CAL+hLkR+U0npqKbtTEs_bbqjTub7vb6MO31zkc_deKv4TH9fbw@mail.gmail.com>
Subject: Re: Customized Secondary Index Schema
From: Ed Anuff <ed@anuff.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0015174fef6c6ef01f04ab5a9961

--0015174fef6c6ef01f04ab5a9961
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Agreed, that's what I meant by "there are a lot of simple ways to split it
up over multiple rows", assuming it necessary.

On Thu, Aug 25, 2011 at 4:24 PM, Konstantin Naryshkin
<konstantinn@a-bb.net>wrote:

> Why are you keeping all your indexes in the same row? We do a similar thi=
ng
> (maintain several indexes over the same data) and we just have an index
> column family with keys like "dest192.168.0.1" which means destination in=
dex
> of 192.168.0.1. You can do rows like User_Keys_By_Last_Name_adams and
> User_Keys_By_Last_Name_alden. You can keep the matching main column famil=
y
> key as the column name. This will ensure that your index is evenly
> distributed throughout your cluster.
>
> ----- Original Message -----
> From: "Ed Anuff" <ed@anuff.com>
> To: user@cassandra.apache.org
> Sent: Thursday, August 25, 2011 12:48:49 PM
> Subject: Re: Customized Secondary Index Schema
>
> How many unique last names do you anticipate having? How many characters =
in
> the last name do you anticipate keeping in your index? You can easily do =
the
> math to figure out how many you could fit on a node. I think you'll find
> that the ceiling might be quite a bit higher than you think. If you have
> over a couple of hundred million users it might not be the best approach.
> There are a lot of very simple ways to split it up over multiple rows. As=
 is
> the case with most things regarding Cassandra, the off-the-cuff assumptio=
ns
> only get you so far before you have to do some math and do some tests.
>
> As I mentioned in my talk, for simple uses cases like this, you probably
> should just start with the built in secondary indexes, but I assume you
> already have explored those.
>
> Ed
>
>
> On Thu, Aug 25, 2011 at 9:27 AM, Alvin UW < alvinuw@gmail.com > wrote:
>
>
> Yes, this is what I am worrying about.
>
>
> 2011/8/24 Ryan King < ryan@twitter.com >
>
>
>
>
>
> On Tue, Aug 23, 2011 at 10:03 AM, Alvin UW < alvinuw@gmail.com > wrote:
> > Hello,
> >
> > As mentioned by Ed Anuff in his blog and slides, one way to build
> customized
> > secondary index is:
> > We use one CF, each row to represent a secondary index, with the
> secondary
> > index name as row key.
> > For example,
> >
> > Indexes =3D {
> > "User_Keys_By_Last_Name" : {
> > "adams" : "e5d61f2b-=85",
> > "alden" : "e80a17ba-=85",
> > "anderson" : "e5d61f2b-=85",
> > "davis" : "e719962b-=85",
> > "doe" : "e78ece0f-=85",
> > "franks" : "e66afd40-=85",
> > =85 : =85,
> > }
> > }
> >
> > But the whole secondary index is partitioned into a single node, becaus=
e
> of
> > the row key.
> > All the queries against this secondary index will go to this node. Of
> > course, there are some replica nodes.
> >
> > Do you think this is a scalability problem, or any better solution to
> solve
> > it?
>
> Its certainly a scalability problem in that this solution has a hard
> ceiling (this index can't get larger than the capacity of any single
> node). It will probably work on small datasets, but if your dataset is
> small then why are you using cassandra?
>
> -ryan
>
>
>

--0015174fef6c6ef01f04ab5a9961
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Agreed, that&#39;s what I meant by &quot;there are a lot of simple ways to =
split it up over multiple rows&quot;, assuming it necessary.<br><br><div cl=
ass=3D"gmail_quote">On Thu, Aug 25, 2011 at 4:24 PM, Konstantin  Naryshkin =
<span dir=3D"ltr">&lt;<a href=3D"mailto:konstantinn@a-bb.net">konstantinn@a=
-bb.net</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex;">Why are you keeping all your indexes in the=
 same row? We do a similar thing (maintain several indexes over the same da=
ta) and we just have an index column family with keys like &quot;dest192.16=
8.0.1&quot; which means destination index of 192.168.0.1. You can do rows l=
ike User_Keys_By_Last_Name_adams and User_Keys_By_Last_Name_alden. You can =
keep the matching main column family key as the column name. This will ensu=
re that your index is evenly distributed throughout your cluster.<br>

<br>
----- Original Message -----<br>
From: &quot;Ed Anuff&quot; &lt;<a href=3D"mailto:ed@anuff.com">ed@anuff.com=
</a>&gt;<br>
To: <a href=3D"mailto:user@cassandra.apache.org">user@cassandra.apache.org<=
/a><br>
Sent: Thursday, August 25, 2011 12:48:49 PM<br>
Subject: Re: Customized Secondary Index Schema<br>
<br>
How many unique last names do you anticipate having? How many characters in=
 the last name do you anticipate keeping in your index? You can easily do t=
he math to figure out how many you could fit on a node. I think you&#39;ll =
find that the ceiling might be quite a bit higher than you think. If you ha=
ve over a couple of hundred million users it might not be the best approach=
. There are a lot of very simple ways to split it up over multiple rows. As=
 is the case with most things regarding Cassandra, the off-the-cuff assumpt=
ions only get you so far before you have to do some math and do some tests.=
<br>

<br>
As I mentioned in my talk, for simple uses cases like this, you probably sh=
ould just start with the built in secondary indexes, but I assume you alrea=
dy have explored those.<br>
<br>
Ed<br>
<br>
<br>
On Thu, Aug 25, 2011 at 9:27 AM, Alvin UW &lt; <a href=3D"mailto:alvinuw@gm=
ail.com">alvinuw@gmail.com</a> &gt; wrote:<br>
<br>
<br>
Yes, this is what I am worrying about.<br>
<br>
<br>
2011/8/24 Ryan King &lt; <a href=3D"mailto:ryan@twitter.com">ryan@twitter.c=
om</a> &gt;<br>
<br>
<br>
<br>
<br>
<br>
On Tue, Aug 23, 2011 at 10:03 AM, Alvin UW &lt; <a href=3D"mailto:alvinuw@g=
mail.com">alvinuw@gmail.com</a> &gt; wrote:<br>
&gt; Hello,<br>
&gt;<br>
&gt; As mentioned by Ed Anuff in his blog and slides, one way to build cust=
omized<br>
&gt; secondary index is:<br>
&gt; We use one CF, each row to represent a secondary index, with the secon=
dary<br>
&gt; index name as row key.<br>
&gt; For example,<br>
&gt;<br>
&gt; Indexes =3D {<br>
&gt; &quot;User_Keys_By_Last_Name&quot; : {<br>
&gt; &quot;adams&quot; : &quot;e5d61f2b-=85&quot;,<br>
&gt; &quot;alden&quot; : &quot;e80a17ba-=85&quot;,<br>
&gt; &quot;anderson&quot; : &quot;e5d61f2b-=85&quot;,<br>
&gt; &quot;davis&quot; : &quot;e719962b-=85&quot;,<br>
&gt; &quot;doe&quot; : &quot;e78ece0f-=85&quot;,<br>
&gt; &quot;franks&quot; : &quot;e66afd40-=85&quot;,<br>
&gt; =85 : =85,<br>
&gt; }<br>
&gt; }<br>
&gt;<br>
&gt; But the whole secondary index is partitioned into a single node, becau=
se of<br>
&gt; the row key.<br>
&gt; All the queries against this secondary index will go to this node. Of<=
br>
&gt; course, there are some replica nodes.<br>
&gt;<br>
&gt; Do you think this is a scalability problem, or any better solution to =
solve<br>
&gt; it?<br>
<br>
Its certainly a scalability problem in that this solution has a hard<br>
ceiling (this index can&#39;t get larger than the capacity of any single<br=
>
node). It will probably work on small datasets, but if your dataset is<br>
small then why are you using cassandra?<br>
<br>
-ryan<br>
<br>
<br>
</blockquote></div><br>

--0015174fef6c6ef01f04ab5a9961--