incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Naryshkin" <>
Subject Re: Customized Secondary Index Schema
Date Thu, 25 Aug 2011 22:06:56 GMT
Well you could group all the duplicate adams as columns in the same row. This has several advantages:
* one, I am not sure what partitioner you plan to use, but if you plan to do key range queries
 over all the same last names, you cannot use a RandomPartitioner since it does not support
key ranges properly
* two, since you will allways be looking up all the adams together (since you do not know
which you want), it makes sense to store them in a single row so that they are grouped together
on disk. One of the big disadvantages of having many columns in a row is that the must all
be read together. If you are going to query for all of them at the same time anyways, this
disadvantage does not apply to you.

----- Original Message -----
From: "Alvin UW" <>
Sent: Thursday, August 25, 2011 5:11:07 PM
Subject: Re: Customized Secondary Index Schema

Assume I use this approach, use the last names as the row keys of secondary index, and use
the base column family key as the column name.
There may be duplication key issue. We may solve it by composite key, like "adams_1" , "adams_2".
Then, we can query these index by range query starting with "adams_".
Am I right?

I want to know what's the cost difference of rang query and slice query?
If I can use either composite key or composite column name, which one gives me less query

2011/8/25 Konstantin Naryshkin < >

Why are you keeping all your indexes in the same row? We do a similar thing (maintain several
indexes over the same data) and we just have an index column family with keys like "dest192.168.0.1"
which means destination index of You can do rows like User_Keys_By_Last_Name_adams
and User_Keys_By_Last_Name_alden. You can keep the matching main column family key as the
column name. This will ensure that your index is evenly distributed throughout your cluster.

----- Original Message -----
From: "Ed Anuff" < >
Sent: Thursday, August 25, 2011 12:48:49 PM
Subject: Re: Customized Secondary Index Schema

How many unique last names do you anticipate having? How many characters in the last name
do you anticipate keeping in your index? You can easily do the math to figure out how many
you could fit on a node. I think you'll find that the ceiling might be quite a bit higher
than you think. If you have over a couple of hundred million users it might not be the best
approach. There are a lot of very simple ways to split it up over multiple rows. As is the
case with most things regarding Cassandra, the off-the-cuff assumptions only get you so far
before you have to do some math and do some tests.

As I mentioned in my talk, for simple uses cases like this, you probably should just start
with the built in secondary indexes, but I assume you already have explored those.


On Thu, Aug 25, 2011 at 9:27 AM, Alvin UW < > wrote:

Yes, this is what I am worrying about.

2011/8/24 Ryan King < >

On Tue, Aug 23, 2011 at 10:03 AM, Alvin UW < > wrote:
> Hello,
> As mentioned by Ed Anuff in his blog and slides, one way to build customized
> secondary index is:
> We use one CF, each row to represent a secondary index, with the secondary
> index name as row key.
> For example,
> Indexes = {
> "User_Keys_By_Last_Name" : {
> "adams" : "e5d61f2b-…",
> "alden" : "e80a17ba-…",
> "anderson" : "e5d61f2b-…",
> "davis" : "e719962b-…",
> "doe" : "e78ece0f-…",
> "franks" : "e66afd40-…",
> … : …,
> }
> }
> But the whole secondary index is partitioned into a single node, because of
> the row key.
> All the queries against this secondary index will go to this node. Of
> course, there are some replica nodes.
> Do you think this is a scalability problem, or any better solution to solve
> it?

Its certainly a scalability problem in that this solution has a hard
ceiling (this index can't get larger than the capacity of any single
node). It will probably work on small datasets, but if your dataset is
small then why are you using cassandra?


View raw message