Mailing-List: contact cassandra-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: cassandra-dev@incubator.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-Id: <302CC611-1BD3-44D0-888B-1945CA4A7AE7@Holsman.net>
From: Ian Holsman <ian@holsman.net>
To: cassandra-dev@incubator.apache.org
In-Reply-To: <a073486d0906221142w6b1bc716jc167caee957fc446@mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v935.3)
Subject: Re: Thoughts on a possible query language
Date: Tue, 23 Jun 2009 08:13:19 +1000
References: <88daf38c0906221112r9a0316bg6f3611eb4e6c40da@mail.gmail.com>
 <a073486d0906221142w6b1bc716jc167caee957fc446@mail.gmail.com>

hey.
any chance of using hypertable's or hbase's query language as a base?

http://code.google.com/p/hypertable/wiki/HQLTutorial
http://wiki.apache.org/hadoop/Hbase/HbaseShell.

both of these are column-oriented DB's which would have similar  
semantics to ours.

I want to avoid yet another query language which is specific to a tool  
from creeping up if possible.

saying that. I don't have the time to code it, so take it a wish, and  
I will be happy with anything that makes cassandra easier to use.

On 23/06/2009, at 4:42 AM, Sandeep Tata wrote:

> There is some (unfinished) code in the current repo on CQL a SQL-like
> Cassandra Query Language that is super simple and (AFAIK) limited to  
> single
> node queries.
>
> I suspect there are bigger questions to tackle before we get to query
> lanuages in the sense we're talking about--
> 1. Data model -- Cassandra's values are byte arrays. Any proposal  
> for a
> language needs to figure out precisely what data model you're  
> planning to
> support. (your examples include numbers, dates, strings)
> 2. Secondary indexes
> 3. Query runtime (queries that run on a single node, multiple nodes,  
> query
> optimizer?)
>
> I've never understood the value of a parallel-programming abstraction
> (map-reduce) for a single node database(CouchDB) ... and I certainly  
> don't
> think we're ready to build a map-reduce view engine *in* Cassandra  
> right
> now.
>
> IMHO,  there are a bunch of interesting issues we will need to solve  
> before
> we can seriously talk about a query language.
>
>
> On Mon, Jun 22, 2009 at 11:12 AM, Alexander Staubo <alex@bengler.no>  
> wrote:
>
>> Has anyone given thought to how an SQL-like query language could be
>> integrated into Cassandra?
>>
>> I'm thinking of something which would let you evaluate a limited set
>> of relational select operators. For example:
>>
>> * first_name = 'Bob'
>> * age > 32
>> * created_at between '2009-08' and '2009-09'
>> * employer_id in (34543, 13177, 9338)
>>
>> First, is such functionality desired within the framework of
>> Cassandra, or do people prefer to keep this functionality in a
>> completely separate server component? There are pros and cons to keep
>> queries inside Cassandra. I could enumerate them, but I would like to
>> hear other people's thoughts first.
>>
>> An alternative to a text-based query syntax would be to borrow
>> CouchDB's concept of views [1]. In CouchDB, views are pre-defined
>> indexes which are populated by filtering data through a pair of
>> map/reduce functions, which are usually written in JavaScript. Views
>> are somewhat limited in expressiveness and flexibility, and do not
>> address all possible use cases, but they are very efficient to  
>> compute
>> and store, and are a fairly elegant system.
>>
>> Some challenges come to mind:
>>
>> Cassandra's distributed nature means that a node's queryable indexes
>> can/should only reference data in that same node's partition, and  
>> that
>> a query might have to be executed on multiple nodes. For performance,
>> the query processing needs to be parallelized and pipelined.
>>
>> Could a query planner/optimizer be able to reduce the number of nodes
>> required to satisfy a query by looking at the distribution of node
>> values across nodes? For example, if the column "first_name" value
>> "Foo" only occurs on node A, there's no need to involve node B. But
>> such knowledge requires the maintenance of statistics on each node
>> that cover all known peers, and the statistics must be kept up to  
>> date
>> to avoid glaring consistency issues.
>>
>> Given the nature of Cassandra's column families it's not immediately
>> obvious to me how to best address columns in such a language.
>>
>> [1] http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views
>>
>> A.
>>

--
Ian Holsman
Ian@Holsman.net