Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of
 gcdcu-cassandra-user-1@m.gmane.org designates 80.91.229.3 as permitted
 sender)
To: user@cassandra.apache.org
From: Oleg Dulin <oleg.dulin@gmail.com>
Subject: Text searches and free form queries
Date: Mon, 3 Sep 2012 09:25:25 -0400
Lines: 58
Message-ID: <k22b41$evb$1@ger.gmane.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Unison/2.1.9

Dear Distinguished Colleagues:

I need to add full-text search and somewhat free form queries to my 
application. Our data is made up of "items" that are stored in a single 
column family, and we have a bunch of secondary indices for look ups. 
An item has header fields and data fields, and the structure of the 
items CF is a super column family with row-key being item's natural ID, 
super column for header, super column for data.

Our application is made up of a several redundant/load balanced servers 
all pointing at a Cassandra cluster. Our servers run embedded Jetty.

I need to be able to find items by a combination of field values. 
Currently I have an index for items by field value which works 
reasonably well. I could also add support for data types and index 
items by fields of appropriate types, so we can do range queries on 
items.

Ultimately, though, what we want is full text search with suggestions 
and human language sensitivity. We want to search by date ranges, by 
field values, etc. I did some homework on this topic, and here is what 
I see as options:

1) Use an SQL database as a helper. This is rather clunky, not sure 
what it gets us since just about anything that can be done in SQL can 
be done in Cassandra with proper structures. Then the problem here also 
is where am I going to get an open source database that can handle the 
workload ? Probably nowhere, nor do I get natural language support.
2) Each of our servers can index data using Lucene, but again we have 
to come up with a clunky mechanism where either one of the servers does 
the indexing and results are replicated, or each server does its own 
indexing.
3) We can use Solr as is, perhaps with some small modifications it can 
run within our server JVM -- since we already run embedded Jetty. I 
like this idea, actually, but I know that Solr indexing doesn't take 
advantage of Cassandra.
4) Datastax Enterprise with search, presumably, supports Solr indexing 
of existing column families -- but for the life of me I couldn't figure 
out how exactly it does that. The Wikipedia example shows that Solr can 
create column families based on Solr schemas that I can then query 
using Cassandra itself (which is great) and supposedly I can modify 
those column families directly and Solr will reindex them (which is 
even better), but I am not sure how that fits into our server design. 
The other concern is locking in to a commercial product, something I am 
very much worried about.

So, one possibility I can see is using Solr embedded within our own 
server solution but storing its indexes in the file system outside of 
Cassandra. This is not optimal, and maybe over time i can add my own 
support for storing Solr index in Cassandra w/o relying on the Datastax 
solution.

In any case, what are your thoughts and experiences ?


Regards,
Oleg