cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-6933) Optimise Read Comparison Costs in collectTimeOrderedData
Date Wed, 02 Apr 2014 21:41:15 GMT


Benedict commented on CASSANDRA-6933:

Note that I am discussing the _worst case_ behaviour here. i.e. the maximal degradation of
behaviour is just about no worse than a couple of extra binary search, and realistically it
is likely to be less than one binary search extra cost (across the range we query everytime
without the optimisation) - so it only has to get it right approximately once in order to
be better. I definitely think it is the correct behaviour, given that the potential upside
is higher than the maximal downside. It's not really designed for catching "uniformly distributed"
- it just assumes it for simple modelling purposes. Uniformly distributed is actually a misnomer
anyway - this would imply dividing the search space by the names we're searching for; in reality
it simply needs to pick a sensible number to search over that is smaller than the whole range,
so it simply picks twice the last delta. 

Basically, it's predicated on the fact that you must by definition rarely have to search the
entire range for the answer you want, since we expect the range should contain multiple answers
we're looking for (if it doesn't, we'll finish early anyway and all optimisations are moot).
All we want to do is reduce the number of times we do a full binary search - any sensible
number (sqrt(n), max(diff), last(diff)) are all suitable - so long as we pick one of them
and only search a subrange.

The numbers I gave above I think demonstrated this clearly - it offers almost no increased
cost even when it gets it completely wrong, but it reduces the worst case algorithmic complexity
of the total work to O(lg2(m).lg2(n)) from m lg(n) - which is an order of magnitude reduction.
This is definitely a good idea.

> Optimise Read Comparison Costs in collectTimeOrderedData
> --------------------------------------------------------
>                 Key: CASSANDRA-6933
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>            Priority: Minor
>              Labels: performance
>             Fix For: 2.1
>         Attachments: 6933-v3.txt
> Introduce a new SearchIterator construct, which can be obtained from a ColumnFamily,
which permits efficiently iterating a subset of the cells in ascending order. Essentially,
it saves the previously visited position and searches from there, but also tries to avoid
searching the whole remaining space if possible.

This message was sent by Atlassian JIRA

View raw message