Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Wed, 2 Apr 2014 21:41:15 +0000 (UTC)
From: "Benedict (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12703776.1395857932463.53820.1396474875669@arcas>
In-Reply-To: <JIRA.12703776.1395857932463@arcas>
References: <JIRA.12703776.1395857932463@arcas>
Subject: [jira] [Commented] (CASSANDRA-6933) Optimise Read Comparison Costs
 in collectTimeOrderedData
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CASSANDRA-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958211#comment-13958211 ] 

Benedict commented on CASSANDRA-6933:
-------------------------------------

Note that I am discussing the _worst case_ behaviour here. i.e. the maximal degradation of behaviour is just about no worse than a couple of extra binary search, and realistically it is likely to be less than one binary search extra cost (across the range we query everytime without the optimisation) - so it only has to get it right approximately once in order to be better. I definitely think it is the correct behaviour, given that the potential upside is higher than the maximal downside. It's not really designed for catching "uniformly distributed" - it just assumes it for simple modelling purposes. Uniformly distributed is actually a misnomer anyway - this would imply dividing the search space by the names we're searching for; in reality it simply needs to pick a sensible number to search over that is smaller than the whole range, so it simply picks twice the last delta. 

Basically, it's predicated on the fact that you must by definition rarely have to search the entire range for the answer you want, since we expect the range should contain multiple answers we're looking for (if it doesn't, we'll finish early anyway and all optimisations are moot). All we want to do is reduce the number of times we do a full binary search - any sensible number (sqrt(n), max(diff), last(diff)) are all suitable - so long as we pick one of them and only search a subrange.

The numbers I gave above I think demonstrated this clearly - it offers almost no increased cost even when it gets it completely wrong, but it reduces the worst case algorithmic complexity of the total work to O(lg2(m).lg2(n)) from m lg(n) - which is an order of magnitude reduction. This is definitely a good idea.

> Optimise Read Comparison Costs in collectTimeOrderedData
> --------------------------------------------------------
>
>                 Key: CASSANDRA-6933
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6933
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>            Priority: Minor
>              Labels: performance
>             Fix For: 2.1
>
>         Attachments: 6933-v3.txt
>
>
> Introduce a new SearchIterator construct, which can be obtained from a ColumnFamily, which permits efficiently iterating a subset of the cells in ascending order. Essentially, it saves the previously visited position and searches from there, but also tries to avoid searching the whole remaining space if possible.


--
This message was sent by Atlassian JIRA
(v6.2#6252)