cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-6933) Optimise Read Comparison Costs in collectTimeOrderedData
Date Thu, 27 Mar 2014 01:04:18 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13948721#comment-13948721
] 

Benedict edited comment on CASSANDRA-6933 at 3/27/14 1:04 AM:
--------------------------------------------------------------

bq. OTOH for the common case of "select *" [the Thrift equivalent anyway] either is going
to be worse than just linearly scanning from i.

Note that I always do a comparison of the next element before applying my next optimisation,
so we're no worse than a linear scan

I had wanted to do gradually increasing ranges of 1, sqrt(n), n to ensure overhead stayed
barely measurable and performance optimal, but I think the current option is a close enough
approximation. Possibly tracking the max number skipped would also be fine as an alternative.

I must admit my math is rusty for proving this, but if you think through some obvious worst
cases, the performance of the current algorithm cannot degrade much beyond bsearch: we never
re-scan the same range, and we immediately fall back to full bsearch over the remaining range;
since we double the previous space, the worst case behaviour is something like having doubling
+ 1 distances between the keys we want to select, so that we always do the most extra work
possible. In this case there are lg2 N extra ranges looked at, each approx. lg N, 1/2 lg N,
1/4 lg N, etc comparisons. Which is roughly 2 lg N extra comparisons. So it's the cost of
doing two extra bsearch over the whole range.

On the other hand in the worst case without the optimisation we're possibly doing lg N + lg
(N - 1) + lg (N - 2) ... extra comparisons... which I don't know how to sum, but is certainly
more.



was (Author: benedict):
bq. OTOH for the common case of "select *" [the Thrift equivalent anyway] either is going
to be worse than just linearly scanning from i.

Note that I always do a comparison of the next element before applying my next optimisation.

I had wanted to do gradually increasing ranges of 1, sqrt(n), n to ensure overhead stayed
barely measurable and performance optimal, but I think the current option is a close enough
approximation. Possibly tracking the max number skipped would also be fine as an alternative.

I must admit my math is rusty for proving this, but if you think through some obvious worst
cases, the performance of the current algorithm cannot degrade much beyond bsearch: we never
re-scan the same range, and we immediately fall back to full bsearch over the remaining range;
since we double the previous space, the worst case behaviour is something like having doubling
+ 1 distances between the keys we want to select, so that we always do the most extra work
possible. In this case there are lg2 N extra ranges looked at, each approx. lg N, 1/2 lg N,
1/4 lg N, etc comparisons. Which is roughly 2 lg N extra comparisons. So it's the cost of
doing two extra bsearch over the whole range.

On the other hand in the worst case without the optimisation we're possibly doing lg N + lg
(N - 1) + lg (N - 2) ... extra comparisons... which I don't know how to sum, but is certainly
more.


> Optimise Read Comparison Costs in collectTimeOrderedData
> --------------------------------------------------------
>
>                 Key: CASSANDRA-6933
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6933
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>            Priority: Minor
>              Labels: performance
>             Fix For: 2.1
>
>
> Introduce a new SearchIterator construct, which can be obtained from a ColumnFamily,
which permits efficiently iterating a subset of the cells in ascending order. Essentially,
it saves the previously visited position and searches from there, but also tries to avoid
searching the whole remaining space if possible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message