[ https://issues.apache.org/jira/browse/CASSANDRA6933?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13948721#comment13948721
]
Benedict edited comment on CASSANDRA6933 at 3/27/14 1:04 AM:

bq. OTOH for the common case of "select *" [the Thrift equivalent anyway] either is going
to be worse than just linearly scanning from i.
Note that I always do a comparison of the next element before applying my next optimisation,
so we're no worse than a linear scan
I had wanted to do gradually increasing ranges of 1, sqrt\(n\), n to ensure overhead stayed
barely measurable and performance optimal, but I think the current option is a close enough
approximation. Possibly tracking the max number skipped would also be fine as an alternative.
I must admit my math is rusty for proving this, but if you think through some obvious worst
cases, the performance of the current algorithm cannot degrade much beyond bsearch: we never
rescan the same range, and we immediately fall back to full bsearch over the remaining range;
since we double the previous space, the worst case behaviour is something like having doubling
+ 1 distances between the keys we want to select, so that we always do the most extra work
possible. In this case there are lg2 N extra ranges looked at, each approx. lg N, 1/2 lg N,
1/4 lg N, etc comparisons. Which is roughly 2 lg N extra comparisons. So it's the cost of
doing two extra bsearch over the whole range.
On the other hand in the worst case without the optimisation we're possibly doing lg N + lg
(N  1) + lg (N  2) ... extra comparisons... which I don't know how to sum, but is certainly
more.
was (Author: benedict):
bq. OTOH for the common case of "select *" [the Thrift equivalent anyway] either is going
to be worse than just linearly scanning from i.
Note that I always do a comparison of the next element before applying my next optimisation,
so we're no worse than a linear scan
I had wanted to do gradually increasing ranges of 1, sqrt(n), n to ensure overhead stayed
barely measurable and performance optimal, but I think the current option is a close enough
approximation. Possibly tracking the max number skipped would also be fine as an alternative.
I must admit my math is rusty for proving this, but if you think through some obvious worst
cases, the performance of the current algorithm cannot degrade much beyond bsearch: we never
rescan the same range, and we immediately fall back to full bsearch over the remaining range;
since we double the previous space, the worst case behaviour is something like having doubling
+ 1 distances between the keys we want to select, so that we always do the most extra work
possible. In this case there are lg2 N extra ranges looked at, each approx. lg N, 1/2 lg N,
1/4 lg N, etc comparisons. Which is roughly 2 lg N extra comparisons. So it's the cost of
doing two extra bsearch over the whole range.
On the other hand in the worst case without the optimisation we're possibly doing lg N + lg
(N  1) + lg (N  2) ... extra comparisons... which I don't know how to sum, but is certainly
more.
> Optimise Read Comparison Costs in collectTimeOrderedData
> 
>
> Key: CASSANDRA6933
> URL: https://issues.apache.org/jira/browse/CASSANDRA6933
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Benedict
> Assignee: Benedict
> Priority: Minor
> Labels: performance
> Fix For: 2.1
>
>
> Introduce a new SearchIterator construct, which can be obtained from a ColumnFamily,
which permits efficiently iterating a subset of the cells in ascending order. Essentially,
it saves the previously visited position and searches from there, but also tries to avoid
searching the whole remaining space if possible.

This message was sent by Atlassian JIRA
(v6.2#6252)
