Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2E60110D18 for ; Wed, 2 Apr 2014 21:41:19 +0000 (UTC) Received: (qmail 36541 invoked by uid 500); 2 Apr 2014 21:41:17 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 36204 invoked by uid 500); 2 Apr 2014 21:41:16 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 36172 invoked by uid 99); 2 Apr 2014 21:41:15 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Apr 2014 21:41:15 +0000 Date: Wed, 2 Apr 2014 21:41:15 +0000 (UTC) From: "Benedict (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-6933) Optimise Read Comparison Costs in collectTimeOrderedData MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958211#comment-13958211 ] Benedict commented on CASSANDRA-6933: ------------------------------------- Note that I am discussing the _worst case_ behaviour here. i.e. the maximal degradation of behaviour is just about no worse than a couple of extra binary search, and realistically it is likely to be less than one binary search extra cost (across the range we query everytime without the optimisation) - so it only has to get it right approximately once in order to be better. I definitely think it is the correct behaviour, given that the potential upside is higher than the maximal downside. It's not really designed for catching "uniformly distributed" - it just assumes it for simple modelling purposes. Uniformly distributed is actually a misnomer anyway - this would imply dividing the search space by the names we're searching for; in reality it simply needs to pick a sensible number to search over that is smaller than the whole range, so it simply picks twice the last delta. Basically, it's predicated on the fact that you must by definition rarely have to search the entire range for the answer you want, since we expect the range should contain multiple answers we're looking for (if it doesn't, we'll finish early anyway and all optimisations are moot). All we want to do is reduce the number of times we do a full binary search - any sensible number (sqrt(n), max(diff), last(diff)) are all suitable - so long as we pick one of them and only search a subrange. The numbers I gave above I think demonstrated this clearly - it offers almost no increased cost even when it gets it completely wrong, but it reduces the worst case algorithmic complexity of the total work to O(lg2(m).lg2(n)) from m lg(n) - which is an order of magnitude reduction. This is definitely a good idea. > Optimise Read Comparison Costs in collectTimeOrderedData > -------------------------------------------------------- > > Key: CASSANDRA-6933 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6933 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Benedict > Assignee: Benedict > Priority: Minor > Labels: performance > Fix For: 2.1 > > Attachments: 6933-v3.txt > > > Introduce a new SearchIterator construct, which can be obtained from a ColumnFamily, which permits efficiently iterating a subset of the cells in ascending order. Essentially, it saves the previously visited position and searches from there, but also tries to avoid searching the whole remaining space if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)