cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ed Anuff ...@anuff.com>
Subject Re: Homebrew CF-indexing vs secondary indexing
Date Fri, 25 Feb 2011 20:43:32 GMT
At the risk of recapitulating a conversation that seems to happen with some
frequency on this list, the answer is going to boil down to "depends on your
data model", but using rows as indexes is one of the core usage patterns of
Cassandra, whether to store the list of keys to rows in another column
family as column names or to build inverted indexes.  That's why columns are
sorted and can be easily retrieved by sort range, so you can do things like
that.  If you're building test instances, then you're going to find out the
answer of what's best for your particular application pretty quickly.  I
think the best advice I've ever seen on this list about how to do something
with Cassandra has been "do a test with both and see what happens", and of
course, share what you find with the rest of us :)


On Fri, Feb 25, 2011 at 12:10 PM, Mohit Anchlia <mohitanchlia@gmail.com>wrote:

> Does it mean that we should design data model such that row keys
> actually become columns (and create secondary index) so that the data
> retrieval is faster. I am soon setting up big test instances to test
> all this.
>
> On Fri, Feb 25, 2011 at 11:18 AM, Ed Anuff <ed@anuff.com> wrote:
> > It's nice to see some testing in this regard, however, it's worth
> pointing
> > out something that gets lost in CF index vs secondary index discussions.
> > What you're really proving is that get_slice (across columns) is faster
> than
> > get_indexed_slices (across keys).  For up to a certain size (and it would
> be
> > nice if there were some emperical testing to determine what that size
> is),
> > get_slice should be one of the most performant operations Cassandra can
> do.
> > CF index approaches are basically all about getting your data into a
> > situation where you can use get_slice to quickly perform the search.  The
> > reasons for using Cassandra's built in secondary index support, IMHO, is
> > that (1) it's easy to use whereas CF indexes are managed by the client
> and
> > (2) there's concern about how large an index you'd be able to effectively
> > store in a CF index row.  The first point is more about Cassandra being
> > easier for newcomers, the latter point is something I'd like to see some
> > more data around.  Maybe you want to run your tests up to much larger
> sizes
> > and see if there's a point where the results change?  FWIW, I recently
> > switched back to CF-based indexes from secondary indexes, largely for the
> > flexibility in the types of queries that became possible, but it's nice
> to
> > see there's some performance benefit.  The other thing would be good to
> look
> > at is timing the overhead of what it takes to update your index as you
> > change the values that are being indexed.
> >
> >
> >
> > On Fri, Feb 25, 2011 at 10:23 AM, Ron Siemens <rsiemens@greatergood.com>
> > wrote:
> >>
> >> I updated the cassandra version in the hector package from 7.0 to 7.2.
> >>  The occasional slow-down in the CF-index went away.  I then upped the
> heap
> >> to 512MB, and the secondary-indexing then works.  Seems awfully memory
> >> hungry for my small dataset.  Even the CF-index was faster with more
> heap.
> >>  These are the times with Cassandra-0.7.2 and 512M heap.  Slightly
> different
> >> testing: I'm varying the index used which give different data size
> results.
> >>  It still surprises me that the CF index does substantially better.
> >>
> >> Secondary Index
> >>
> >> DEBUG Retrieved THS / 7293 rows, in 1051 ms
> >> DEBUG Retrieved TRS / 7289 rows, in 1448 ms
> >> DEBUG Retrieved BCS / 7788 rows, in 1553 ms
> >> DEBUG Retrieved ARS / 7426 rows, in 1479 ms
> >> DEBUG Retrieved CHS / 7290 rows, in 1575 ms
> >> DEBUG Retrieved MS / 4523 rows, in 766 ms
> >> DEBUG Retrieved PRS / 562 rows, in 40 ms
> >> DEBUG Retrieved GGF / 1162 rows, in 122 ms
> >> DEBUG Retrieved VET / 7313 rows, in 1193 ms
> >> DEBUG Retrieved AUT / 7287 rows, in 1746 ms
> >> DEBUG Retrieved LIT / 7291 rows, in 1331 ms
> >>
> >> CF Index
> >>
> >> DEBUG Retrieved THS / 7293 rows, in 17 + 759 ms
> >> DEBUG Retrieved TRS / 7289 rows, in 19 + 734 ms
> >> DEBUG Retrieved BCS / 7788 rows, in 23 + 736 ms
> >> DEBUG Retrieved ARS / 7426 rows, in 23 + 1448 ms
> >> DEBUG Retrieved CHS / 7290 rows, in 18 + 638 ms
> >> DEBUG Retrieved MS / 4523 rows, in 32 + 622 ms
> >> DEBUG Retrieved PRS / 562 rows, in 2 + 50 ms
> >> DEBUG Retrieved GGF / 1162 rows, in 3 + 79 ms
> >> DEBUG Retrieved VET / 7313 rows, in 17 + 686 ms
> >> DEBUG Retrieved AUT / 7287 rows, in 17 + 758 ms
> >> DEBUG Retrieved LIT / 7291 rows, in 17 + 745 ms
> >>
> >> On Feb 24, 2011, at 3:39 PM, Ron Siemens wrote:
> >>
> >> >
> >> > I failed to mention: this is just doing repeated data retrievals using
> >> > the index.
> >> >
> >> >> ...
> >> >>
> >> >> Sample run: Secondary index.
> >> >>
> >> >> DEBUG Retrieved THS / 7293 rows, in 2012 ms
> >> >> DEBUG Retrieved THS / 7293 rows, in 1956 ms
> >> >> DEBUG Retrieved THS / 7293 rows, in 1843 ms
> >> > ...
> >> >
> >>
> >
> >
>

Mime
View raw message