incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evgeniy Ryabitskiy <evgeniy.ryabits...@wikimart.ru>
Subject Re: Index search in provided list of rows (list of rowKeys).
Date Mon, 12 Sep 2011 22:27:57 GMT
Something like this.

Actually I think it's better to extend get_indexed_slice() API instead of
creating new one thrift method.
I wish to have something like this:

//here we run query to external search engine
List<byte[]> keys = performSphinxQuery(someFullTextSearchQuery);
IndexClause indexClause = new IndexClause();

//required API to set list of keys
indexClause.setKeys(keys);
indexClause.setExpressions(someFilteringExpressions);
List finalResult = get_indexed_slices(colParent, indexClause, colPredicate,
cLevel);



I can't solve my issue with single get_indexed_slice().
Here is issue in more details:
1) have ~ 6 millions records, in feature could be much more
2) have  > 10k different properties (stored as column values in Cassandra),
in feature could be much more
3) properties are text descriptions , int/float values, string values
4) need to implement search over all properties. For text descriptions: full
text search. for int/float properties: range search.
5) Search query could use any combination of property descriptions. Like
full text search description and some range expression for int/float field.
6) have external search engine (Sphinx) that indexed all string and text
properties
7) still need to perform range search for int, float fields.

So now I split my query expressions in 2 groups:
1) expressions that can be handled by search engine
2) others (additional filters)

For example I run first query to Sphinx and got list of rowKeys, with length
of 100k.  (mark as RESULT1)
Now I need to filter it by second group of expressions. For example I have
simple expression: "age > 25".
So imagine I would run get_indexed_slice() with this query and could
possibly get half of my records in result. (mark as RESULT2)
Then I would need to get intersection between RESULT1 and RESULT2 on client
side, which could take a lot of time and memory.
That is why I can't use single get_indexed_slice here.

For me is better to iterate RESULT1 (with 100k records) at client side to
filter by age and got 10-50k record as final result. Disadvantage here is
that I have to fetch all 100k records.

Evgeny.

Mime
View raw message