Hi Aaron,

I failed to mention that my queries have very high frequency so this is not
a data analytics problem.

That's the reason I wanted to reduce transferring a lot of data to client or
using standard map/reduce solutions that can do the job but add too much latency.

Pig or hive are appropriate for periodically re-building the custom secondary index
but not for running the queries.

Thanks for suggestions
-Miro

On Tue, Apr 5, 2011 at 7:41 PM, aaron morton <aaron@thelastpickle.com> wrote:
Couple of thoughts...

- Use only the custom secondary indexes you have built using your categories. Pull back all items that match and then do additional filtering and sorting client side. 

- If you really need very flexible query semantics and the use case allows for it consider using something like Pig or Hive to perform the analysis. 

There is some work going on to improve indexes, not sure if add's all the things you are looking for. 

Aaron


On 4 Apr 2011, at 22:31, Miroslav Madecki wrote:

Hi,

I would like to combine custom secondary index with Cassandra's
built-in secondary indexes.

Custom secondary index uses multiple columns and is based on
historical but incomplete info about matching data in target column family. It produces list of "candidate" rows which
could be too large to be processed at the client side.

For matching on remaining attributes/columns I'd like to use
built-in secondary indexes to filter out the data at the server side.

Now given the context above I have following questions:

1) Am I using the right model for the problem at hand?
   Detailed description can be found below.

2) If it is just a question of missing functionality
   what is the best way to add it?
   Add IN/OR operator to indexed_clause/get_indexed_slice, add
   filter clause support with IN operator to multiget or
   something else.
   More info on this at the end of detailed description below.
   
3) Is somebody already working on functionality I'm missing?

Thanks
Miro


ItemCF

Each Item is a member of multiple category subsets:
A, B, C, D ...
Membership is represented with composite column names,
one column for each subset element:
A:1,A:2,A:3, B:15,B:16:B17, C:5,C6
Each item has several hundreds of such value-less columns.

In addition to that elements have multiple name/value columns:
X, Y, Z with either string or numeric values.

---------+---+---+---+-----+-----+-----+-----+------+------+---+
Row Key  | X | Y | Z | ... | A:1 | A:2 | A:3 | B:15 | B:16 |.. |
---------+---+---+---+-----+-----+-----+-----+------+------+---+    
<item id>|   |   |   |     |     |     |     |      |      |   |
---------+---+---+---+-----+-----+-----+-----+------+------+---+

Problem

I'm trying to answer query like this:

Which items are most likely to occur in a specific combination
of categories A=2 AND B=15 AND C=6 ...
that also satisfy condition such as:
X=25 AND Y=default AND (C=some OR C=none)
results are sorted by number of occurrences in descending order.

To get sorted number of occurrences in specific A, B, C category
combination I'm using secondary index. Index is created using
very large number of statistical samples.

ItemFrequencyCF

-----------+----------+----------+----------+-----------+-----+
Row Key    | <1><id1> | <5><id2> | <5><id3> | <17><id4> | ... |
-----------+----------+----------+----------+-----------+-----+
<A>:<B>:<C>|    <id1> |    <id2> |    <id3> |     <id4> |     |
-----------+----------+----------+----------+-----------+-----+

Column names are composite integers where frequency
is stored in upper half and item id in lower half to get sorting
by frequency.
Row width can reach hundreds or sometimes thousands of columns.
We are interested in most cases in only dozens or so items
but their attributes X, Y, Z may not match so we have to
do additional filtering on items from ItemCF and look further
down the list.

That filtering can't be performed by extending ItemFrequencyCF
index for following reasons.
- Statistical samples we use to create index are missing data
  for X,Y,Z
- filtering is not always performed with EQ and AND operators

In order to perform final step of getting matching rows from
ItemCF I'd like to use id keys from ItemFrequencyCF and some
sort of index clause to perform filtering on remaining
attributes X, Y, Z.

I can get rows using MULTIGET or do X, Y, Z filtering using
GET_INDEXED_SLICE but there is no operation that combines
these two.
Also indexed clause doesn't support OR/IN operator needed when
columns can contain more than one value.

Actually OR/IN operator may be used as MULTIGET equivalent
if we add 'id' column to ItemCF that has the item id (row key)
and create index expression such as:
- (id=<id1> OR id=<id2> OR id=<id3> OR id=<id4> ......)
or
- id IN (<id1>, <id2>, <id3>, <id4>)

Obviously 'id' column would have to be declared as index
column and the expression should be the first one in an index clause used (limitation of current built-in secondary index
implementation).