cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
Date Wed, 03 Jun 2015 13:22:39 GMT


Benedict commented on CASSANDRA-8915:

Thanks for taking the time to provide this. I have some comments and suggestions for clarifying
the proof for future readers. It is always tricky to find your bearings in a new proof, and
I found myself confused by some subtle ambiguities that are hard to spot as the author.

* It would help to state that {{=}} refers to the partial order over key equality, not node
identity (the first sentence in particular lead to some double-takes) 
* "stands for" and "predicate" (and use of a function-like identifier) to descibe {{σ(N)}}
is confusing. This caused a surprising number of misunderstandings that took a good hour to
crystalise in the realisation that {{σ(N)}} meant the flag or property that encodes the result
of the predicate {{ρ(N) = N}}, not a short-hand for that predicate
* It would help to state {{P ≙ ep-heap(M,N)}} to indicate the ep-heap P rooted at N that
leads to M, to avoid confusion with the similarity of "rooted at" and "leading to" 
* I also find that switching case to capitals for heaps/structures and retaining lower case
for nodes/values is helpful for clarity. P and R are in the same run of variables as N, but
refer to a different category of things. i.e. I would have found it clearer to follow the
statement {{P ≙ ep-heap(m,n)}}.
* \{\{\}\} can then be used to mark the boundaries of mathematical expressions (so lower case
chars are still easily parsed as not prose) 

bq. 3. ... If ¬σ(P), set σ(R) := (P = R), otherwise leave σ(R) unchanged.

Perhaps clarify, that σ(P) => σ(R), so already correct. I realise you state this later,
but here is where the question is raised. The later sentence is also very hard to parse, and
could be simplified with the earlier short explanation.

bq. 3. ...  Use the algorithm recursively to *turn* the two ep-subheaps leading to P at its
children and the new value C > P into an ep-heap leading to P

Consider *merge*, to clarify that this step restores the binary-tree property.

I haven't yet corroborated the code against the  proof, but will do so shortly. I would consider
supplying the initial proof outline with its three cases along with the code, so that we can
annotate each major branch with a short demonstration of invariant maintenance.

> Improve MergeIterator performance
> ---------------------------------
>                 Key: CASSANDRA-8915
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Branimir Lambov
>            Assignee: Branimir Lambov
>            Priority: Minor
>              Labels: compaction, performance
>             Fix For: 3.x
> The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}}
operations for every item in the resulting sequence. This is quite inefficient as {{poll}}
necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires
another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}}
is the number of iterators being merged).
> This can easily be replaced with a simple custom structure that can perform replacement
of the top of the queue in a single step, which will very often complete after a couple of
comparisons and in the worst case scenarios will match the complexity of the current implementation.
> This should significantly improve merge performance for iterators with limited overlap
(e.g. levelled compaction).

This message was sent by Atlassian JIRA

View raw message