incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shaun Cutts <sh...@cuttshome.net>
Subject Re: order of index expressions
Date Sun, 06 Feb 2011 17:03:10 GMT
Ok,

So I understand now. You choose the index with the smallest number of matches per key on the
average. Unfortunately this doesn't work out so well for me. I am doing a query in the "edges"
columnfamily of a graph database, which should return edges with source and target labels
equal to given values.

I have about 30M edges, and the target labels have on the average more matching rows. Unfortunately
in the given case there are 2 matches on target label, and about 100K on the source label,
and I have 5000 similar queries to perform for the overall task.

What I think you should be doing is the following: open iterators on the matching keys for
each of the indexes; the inside loop would pick an iterator at random, and pull a match from
it. This would assure that the expected number of entries examined is a small multiple (#
of other indexes) of the index with the most "precision". 

Then (if you want) you can optimize using overall statistics to adjust the initial probabilities
if you want. But as you process the query you should mix these initial probabilities with
probabilities proportional to the actual fraction of overall matches generated by a given
index. (I guess you can control the speed of mixing using the standard deviations on the initial
key counts if you want).

I know you have a new type of index in the works... but it doesn't look like "trunk" has any
modifications for "scan", and presumably the strategy I just mentioned is pretty general (not
depending on histograms, etc). Does it sound like a good idea?

-- Shaun

On Feb 6, 2011, at 12:15 AM, Jonathan Ellis wrote:

> ColumnFamilyStore.scan
> 
> On Sat, Feb 5, 2011 at 10:32 PM, Shaun Cutts <shaun@cuttshome.net> wrote:
>> Thanks for the response!
>> 
>> So.. I *may* have a bug to report (at least I can generate radically different response
times based on expression order with a multiply indexed columnfamily), but first I'll have
to upgrade to a stable version (currently I have 7.0rc2 installed).
>> 
>> I was also wondering where the code that does this is... is it in
>> 
>> java.org.apache.cassandra.db.columniterator.IndexedSliceReader?
>> 
>> 
>> Thanks,
>> 
>> -- Shaun
>> 
>> On Feb 5, 2011, at 2:39 PM, Jonathan Ellis wrote:
>> 
>>> On Sat, Feb 5, 2011 at 8:48 AM, Shaun Cutts <shaun@cuttshome.net> wrote:
>>>> Hello,
>>>> I'm wondering if cassandra is sensitive to the order of index expressions
in
>>>> (pycassa call) get_indexed_slices?
>>> 
>>> No.
>>> 
>>>> If I have several column indexes available, will it attempt to optimize the
>>>> order?
>>> 
>>> Yes.
>>> 
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of DataStax, the source for professional Cassandra support
>>> http://www.datastax.com
>> 
>> 
> 
> 
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com


Mime
View raw message