cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <jbel...@gmail.com>
Subject Re: secondary index support in Cassandra
Date Tue, 24 Mar 2009 17:48:30 GMT
This adds a lot of complexity but I definitely see people wanting easy
indexing out of the box.  So +1 in principle.

A few high-level comments:

First, for maximum flexibility, you probably want to allow indexes to
be defined in code.  That is, you'd define something like

  <ColumnFamily name="foo">
    <Index generator="com.ibm.cassandra.indexGenerator"/>
  </ColumnFamily>

and allow index generators to be loaded at runtime.  Nobody else is
going to need the specific case of
hash(rowkey):attribute1:attribute2:rowkey so abstract that out and
make it pluggable for whatever weird-ass requirements people have.

Second, I'm not a fan of queries by parsing strings.  The whole rdbms
world has been moving _away_ from SQL and towards OO interfaces for
the last 10 years.  I like the thrift API for this reason.  (It is a
little clunky in Java, but _everything_ is a little clunky in Java.
Much better in Python/Ruby/etc.)

Finally, as an implementation detail, Cassandra already does too much
in-memory when writing and merging sstables.  Don't make it worse. :)

-Jonathan

P.S. the partitioner abstraction layer in CASSANDRA-3 will allow you
to do the per-node grouping you want without weird contortions.

On Tue, Mar 24, 2009 at 11:21 AM, Jun Rao <junrao@almaden.ibm.com> wrote:
> To address the above problems, we are thinking of the following new
> implementation. Each entity is mapped to a row in Cassandra and uses a
> two-part key (groupID, entityID). We use the groupID to hash an entity to a
> node. This way, all entities for a group will be collocated in the same
> node. We then define a special CF to serve as the secondary index. In the
> definition, we specify what entity attributes need to be indexed  and in
> what order. Within a node, this special CF will index all rows stored
> locally. Every time we insert a new entity, the server automatically
> extracts the index key based on the index definition (for example, the
> index key can be of the form "hash(rowkey):attribute1:attribute2:rowkey)
> and add the index entry to the special CF. We can then access the entities
> using an extended version of the query language in Cassandra. For example,
> if we issue the following query and there is an index defined by
> (attributeX, attributeY), the query can be evaluated using the index in the
> special CF. (Note that AppEngine supports this flavor of queries.)
>
> select attributeZ
> from ROWS(HASH = hash(groupID))
> where attributeX="x"
> order by attributeY desc
> limit 50
>
> We are in the middle of prototyping this approach. We'd like to hear if
> other people are interested in this too or if people think there are better
> alternatives.
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> junrao@almaden.ibm.com

Mime
View raw message