incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Staubo <>
Subject Thoughts on a possible query language
Date Mon, 22 Jun 2009 18:12:34 GMT
Has anyone given thought to how an SQL-like query language could be
integrated into Cassandra?

I'm thinking of something which would let you evaluate a limited set
of relational select operators. For example:

  * first_name = 'Bob'
  * age > 32
  * created_at between '2009-08' and '2009-09'
  * employer_id in (34543, 13177, 9338)

First, is such functionality desired within the framework of
Cassandra, or do people prefer to keep this functionality in a
completely separate server component? There are pros and cons to keep
queries inside Cassandra. I could enumerate them, but I would like to
hear other people's thoughts first.

An alternative to a text-based query syntax would be to borrow
CouchDB's concept of views [1]. In CouchDB, views are pre-defined
indexes which are populated by filtering data through a pair of
map/reduce functions, which are usually written in JavaScript. Views
are somewhat limited in expressiveness and flexibility, and do not
address all possible use cases, but they are very efficient to compute
and store, and are a fairly elegant system.

Some challenges come to mind:

Cassandra's distributed nature means that a node's queryable indexes
can/should only reference data in that same node's partition, and that
a query might have to be executed on multiple nodes. For performance,
the query processing needs to be parallelized and pipelined.

Could a query planner/optimizer be able to reduce the number of nodes
required to satisfy a query by looking at the distribution of node
values across nodes? For example, if the column "first_name" value
"Foo" only occurs on node A, there's no need to involve node B. But
such knowledge requires the maintenance of statistics on each node
that cover all known peers, and the statistics must be kept up to date
to avoid glaring consistency issues.

Given the nature of Cassandra's column families it's not immediately
obvious to me how to best address columns in such a language.



View raw message