Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7B4E3D678 for ; Mon, 3 Sep 2012 13:26:12 +0000 (UTC) Received: (qmail 9969 invoked by uid 500); 3 Sep 2012 13:26:10 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 9944 invoked by uid 500); 3 Sep 2012 13:26:09 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 9936 invoked by uid 99); 3 Sep 2012 13:26:09 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Sep 2012 13:26:09 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=FSL_RCVD_USER,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of gcdcu-cassandra-user-1@m.gmane.org designates 80.91.229.3 as permitted sender) Received: from [80.91.229.3] (HELO plane.gmane.org) (80.91.229.3) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Sep 2012 13:26:00 +0000 Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1T8Weo-0005yO-KZ for user@cassandra.apache.org; Mon, 03 Sep 2012 15:25:38 +0200 Received: from c-68-32-133-231.hsd1.nj.comcast.net ([68.32.133.231]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 03 Sep 2012 15:25:38 +0200 Received: from oleg.dulin by c-68-32-133-231.hsd1.nj.comcast.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 03 Sep 2012 15:25:38 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: user@cassandra.apache.org From: Oleg Dulin Subject: Text searches and free form queries Date: Mon, 3 Sep 2012 09:25:25 -0400 Lines: 58 Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: c-68-32-133-231.hsd1.nj.comcast.net User-Agent: Unison/2.1.9 Dear Distinguished Colleagues: I need to add full-text search and somewhat free form queries to my application. Our data is made up of "items" that are stored in a single column family, and we have a bunch of secondary indices for look ups. An item has header fields and data fields, and the structure of the items CF is a super column family with row-key being item's natural ID, super column for header, super column for data. Our application is made up of a several redundant/load balanced servers all pointing at a Cassandra cluster. Our servers run embedded Jetty. I need to be able to find items by a combination of field values. Currently I have an index for items by field value which works reasonably well. I could also add support for data types and index items by fields of appropriate types, so we can do range queries on items. Ultimately, though, what we want is full text search with suggestions and human language sensitivity. We want to search by date ranges, by field values, etc. I did some homework on this topic, and here is what I see as options: 1) Use an SQL database as a helper. This is rather clunky, not sure what it gets us since just about anything that can be done in SQL can be done in Cassandra with proper structures. Then the problem here also is where am I going to get an open source database that can handle the workload ? Probably nowhere, nor do I get natural language support. 2) Each of our servers can index data using Lucene, but again we have to come up with a clunky mechanism where either one of the servers does the indexing and results are replicated, or each server does its own indexing. 3) We can use Solr as is, perhaps with some small modifications it can run within our server JVM -- since we already run embedded Jetty. I like this idea, actually, but I know that Solr indexing doesn't take advantage of Cassandra. 4) Datastax Enterprise with search, presumably, supports Solr indexing of existing column families -- but for the life of me I couldn't figure out how exactly it does that. The Wikipedia example shows that Solr can create column families based on Solr schemas that I can then query using Cassandra itself (which is great) and supposedly I can modify those column families directly and Solr will reindex them (which is even better), but I am not sure how that fits into our server design. The other concern is locking in to a commercial product, something I am very much worried about. So, one possibility I can see is using Solr embedded within our own server solution but storing its indexes in the file system outside of Cassandra. This is not optimal, and maybe over time i can add my own support for storing Solr index in Cassandra w/o relying on the Datastax solution. In any case, what are your thoughts and experiences ? Regards, Oleg