incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Index search in provided list of rows (list of rowKeys).
Date Tue, 13 Sep 2011 21:55:17 GMT
Not sure it's a feature cassandra needs, it would radically change the meaning of get_indexes_slices().
If you already know the row keys the assumption would be you know they are the rows you want
to get. 

Feel free to add a Jira though. 

IMHO this sounds more like Sphinx not supporting all the features you need, rather than cassandra.
Can you use a different search engine such as Solr, Solandra or Elastic Search? Or 

Cheers
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 13/09/2011, at 10:27 AM, Evgeniy Ryabitskiy wrote:

> Something like this.
> 
> Actually I think it's better to extend get_indexed_slice() API instead of creating new
one thrift method.
> I wish to have something like this:
> 
> //here we run query to external search engine
> List<byte[]> keys = performSphinxQuery(someFullTextSearchQuery);
> IndexClause indexClause = new IndexClause();
> 
> //required API to set list of keys
> indexClause.setKeys(keys);
> indexClause.setExpressions(someFilteringExpressions);
> List finalResult = get_indexed_slices(colParent, indexClause, colPredicate, cLevel);
> 
> 
> 
> I can't solve my issue with single get_indexed_slice().
> Here is issue in more details: 
> 1) have ~ 6 millions records, in feature could be much more
> 2) have  > 10k different properties (stored as column values in Cassandra), in feature
could be much more
> 3) properties are text descriptions , int/float values, string values 
> 4) need to implement search over all properties. For text descriptions: full text search.
for int/float properties: range search.
> 5) Search query could use any combination of property descriptions. Like full text search
description and some range expression for int/float field.
> 6) have external search engine (Sphinx) that indexed all string and text properties
> 7) still need to perform range search for int, float fields.
> 
> So now I split my query expressions in 2 groups:
> 1) expressions that can be handled by search engine
> 2) others (additional filters)
> 
> For example I run first query to Sphinx and got list of rowKeys, with length of 100k.
 (mark as RESULT1)
> Now I need to filter it by second group of expressions. For example I have simple expression:
"age > 25".
> So imagine I would run get_indexed_slice() with this query and could possibly get half
of my records in result. (mark as RESULT2)
> Then I would need to get intersection between RESULT1 and RESULT2 on client side, which
could take a lot of time and memory.
> That is why I can't use single get_indexed_slice here.
> 
> For me is better to iterate RESULT1 (with 100k records) at client side to filter by age
and got 10-50k record as final result. Disadvantage here is that I have to fetch all 100k
records.
> 
> Evgeny.
> 
> 
> 
> 
> 
> 
> 
> 
> 


Mime
View raw message