incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Miroslav Madecki <miro.made...@gmail.com>
Subject Re: Questions on combining custom with built-in secondary indexes
Date Wed, 06 Apr 2011 04:33:50 GMT
Hi Aaron,

I failed to mention that my queries have very high frequency so this is not
a data analytics problem.

That's the reason I wanted to reduce transferring a lot of data to client or
using standard map/reduce solutions that can do the job but add too much
latency.

Pig or hive are appropriate for periodically re-building the custom
secondary index
but not for running the queries.

Thanks for suggestions
-Miro

On Tue, Apr 5, 2011 at 7:41 PM, aaron morton <aaron@thelastpickle.com>wrote:

> Couple of thoughts...
>
> - Use only the custom secondary indexes you have built using your
> categories. Pull back all items that match and then do additional filtering
> and sorting client side.
>
> - If you really need very flexible query semantics and the use case allows
> for it consider using something like Pig or Hive to perform the analysis.
>
> There is some work going on to improve indexes, not sure if add's all the
> things you are looking for.
>
> Aaron
>
>
> On 4 Apr 2011, at 22:31, Miroslav Madecki wrote:
>
> Hi,
>
> I would like to combine custom secondary index with Cassandra's
> built-in secondary indexes.
>
> Custom secondary index uses multiple columns and is based on
> historical but incomplete info about matching data in target column family.
> It produces list of "candidate" rows which
> could be too large to be processed at the client side.
>
> For matching on remaining attributes/columns I'd like to use
> built-in secondary indexes to filter out the data at the server side.
>
> Now given the context above I have following questions:
>
> 1) Am I using the right model for the problem at hand?
>    Detailed description can be found below.
>
> 2) If it is just a question of missing functionality
>    what is the best way to add it?
>    Add IN/OR operator to indexed_clause/get_indexed_slice, add
>    filter clause support with IN operator to multiget or
>    something else.
>    More info on this at the end of detailed description below.
>
> 3) Is somebody already working on functionality I'm missing?
>
> Thanks
> Miro
>
>
> ItemCF
>
> Each Item is a member of multiple category subsets:
> A, B, C, D ...
> Membership is represented with composite column names,
> one column for each subset element:
> A:1,A:2,A:3, B:15,B:16:B17, C:5,C6
> Each item has several hundreds of such value-less columns.
>
> In addition to that elements have multiple name/value columns:
> X, Y, Z with either string or numeric values.
>
> ---------+---+---+---+-----+-----+-----+-----+------+------+---+
> Row Key  | X | Y | Z | ... | A:1 | A:2 | A:3 | B:15 | B:16 |.. |
> ---------+---+---+---+-----+-----+-----+-----+------+------+---+
> <item id>|   |   |   |     |     |     |     |      |      |   |
> ---------+---+---+---+-----+-----+-----+-----+------+------+---+
>
> Problem
>
> I'm trying to answer query like this:
>
> Which items are most likely to occur in a specific combination
> of categories A=2 AND B=15 AND C=6 ...
> that also satisfy condition such as:
> X=25 AND Y=default AND (C=some OR C=none)
> results are sorted by number of occurrences in descending order.
>
> To get sorted number of occurrences in specific A, B, C category
> combination I'm using secondary index. Index is created using
> very large number of statistical samples.
>
> ItemFrequencyCF
>
> -----------+----------+----------+----------+-----------+-----+
> Row Key    | <1><id1> | <5><id2> | <5><id3> | <17><id4>
| ... |
> -----------+----------+----------+----------+-----------+-----+
> <A>:<B>:<C>|    <id1> |    <id2> |    <id3> |   
 <id4> |     |
> -----------+----------+----------+----------+-----------+-----+
>
> Column names are composite integers where frequency
> is stored in upper half and item id in lower half to get sorting
> by frequency.
> Row width can reach hundreds or sometimes thousands of columns.
> We are interested in most cases in only dozens or so items
> but their attributes X, Y, Z may not match so we have to
> do additional filtering on items from ItemCF and look further
> down the list.
>
> That filtering can't be performed by extending ItemFrequencyCF
> index for following reasons.
> - Statistical samples we use to create index are missing data
>   for X,Y,Z
> - filtering is not always performed with EQ and AND operators
>
> In order to perform final step of getting matching rows from
> ItemCF I'd like to use id keys from ItemFrequencyCF and some
> sort of index clause to perform filtering on remaining
> attributes X, Y, Z.
>
> I can get rows using MULTIGET or do X, Y, Z filtering using
> GET_INDEXED_SLICE but there is no operation that combines
> these two.
> Also indexed clause doesn't support OR/IN operator needed when
> columns can contain more than one value.
>
> Actually OR/IN operator may be used as MULTIGET equivalent
> if we add 'id' column to ItemCF that has the item id (row key)
> and create index expression such as:
> - (id=<id1> OR id=<id2> OR id=<id3> OR id=<id4> ......)
> or
> - id IN (<id1>, <id2>, <id3>, <id4>)
>
> Obviously 'id' column would have to be declared as index
> column and the expression should be the first one in an index clause used
> (limitation of current built-in secondary index
> implementation).
>
>
>

Mime
View raw message