Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Fri, 12 Jun 2015 14:55:06 +0000 (UTC)
From: "Sylvain Lebresne (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12747304.1412951422000.56410.1434120905993@Atlassian.JIRA>
In-Reply-To: <JIRA.12747304.1412951422000@Atlassian.JIRA>
References: <JIRA.12747304.1412951422000@Atlassian.JIRA>
 <JIRA.12747304.1412951422468@arcas>
Subject: [jira] [Commented] (CASSANDRA-8099) Refactor and modernize the
 storage engine
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CASSANDRA-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583530#comment-14583530 ] 

Sylvain Lebresne commented on CASSANDRA-8099:
---------------------------------------------

Some update on this. I've pushed a rebased (and squashed because that made it a *lot* easier to rebase) version in [my 8099 branch|https://github.com/pcmanus/cassandra/tree/8099].

It's still missing wire backward compatibility ([~thobbs] is finishing this so this should be ready hopefully soon). Regarding tests:
* unit tests are almost green: mostly it remains some failures in the hadoop tests. I could actually use the experience of someone that knows these tests and the code involved as it's not immediately clear to me what this is even doing.
* dtests still have a fair amount of failure but I've only look at them recently and it's getting down quickly.

h2. OpOrder

I think the main problem was that a local read (done through {{SP.LocalReadRunnable}}) was potentially keeping a group "open" while waiting on other nodes. I also realized this path meant local reads (the actual read of sstables) were done outside of the {{StorageProxy} methods, and so 1) not on the thread they were supposed to be on and 2) outside of the "timeout" check. I changed this so that a local response actually materialize everything upfront (similarly to what we do today), solving the problems above. This is not perfect and I'm sure we'll improve on this in the future, but that feels like a good enough option initially.

Regarding moving {{OpOrder}} out of {{close}}, the only way to do that I can see is be to move it up the stack, in {{ReadCommandVerbHandler}} and {{SP.LocalReadRunnable}} (as suggested by Brananir above). I'm working on that (I just started and might not have the time to finish today, but it'll be done early monday for sure).

h2. Branamir's review remarks

I've integrated fixes for most of the remarks. I discuss the rest below.

bq. [CompactionIterable 125|https://github.com/pcmanus/cassandra/blob/75b98620e30b5df31431618cc21e090481f33967/src/java/org/apache/cassandra/db/compaction/CompactionIterable.java#L125]: I doubt index update belongs here, as side effect of iteration. Ideally index should be collected, not updated.

Though I don't disagree on principle, this is not different from how it's done currently (it's done the same in {{LazilyCompactRow}}, but it just happens that the old {{LazilyCompactedRow}} has been merged to {{CompactionIterable}} (now {{CompactionIterator}}) because simplifications of the patch made it unnecessary to have separate classes). Happy to look at cleaning this in a separate ticket however (probably belongs to cleaning the 2ndary index API in practice).

bq. [CompactionIterable 237|https://github.com/pcmanus/cassandra/blob/75b98620e30b5df31431618cc21e090481f33967/src/java/org/apache/cassandra/db/compaction/CompactionIterable.java#L237]: Another side effect of iteration that preferably should be handled by writer.

Maybe, but it's not that simple. Merging (which is done directly by the {{CompactionIterator}}) gets rid of empty partitions and more generally we get rid of them as soon as possible. I think that it's the right thing to do as it's easier for the rest of the code, but that means we have to do invalidation in {{CompactionIterator}}. Of course, we could special case {{CompactionIterator}} to not remove empty partitions and do cache invalidation externally, but I'm not sure it would be cleaner overall (it would somewhat more error-prone). Besides, I could argue that cache invalidation is very much a product of compaction and having it in {{CompactionIterator}} is not that bad.

bq. Validation compaction now uses CompactionIterable and thus has side effects (index & cache removal).

I've fixed that but I'll note for posterity that as far as I can tell, index removal is done for validation compaction on trunk (and all previous version) due to the use of {{LazilyCompactedRow}}. I've still disabled it (for anything that wasn't a true compaction) because I think that's the right thing to do, but that's a difference of this ticket.

bq. add that there is never content between two corresponding tombstone markers on any iterator.

That's mentioned in "Dealing with tombstones and shadowed cells". More precisely, that's what "it's part of the contract of an AtomIterator that it must not shadow it's own data" means. But I need to clean up/update the guide so I'll make sure to clarify further while at it.

bq. These objects should be Iterable instead. Having that would give clear separation between the iteration process and the entity-level data

Yes, it would be cleaner from that standpoint. And the use of iterators in the first place is indeed largely carried from the existing code, I just hadn't really though of the alternative tbh. I'll try to check next week how easily such change is. That said, I'm not sure the use of iterators directly is that confusing either, so if it turns hairy, I don't think it's worth blocking on this (that is, we can very well change that later).

> Refactor and modernize the storage engine
> -----------------------------------------
>
>                 Key: CASSANDRA-8099
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8099
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>             Fix For: 3.0 beta 1
>
>         Attachments: 8099-nit
>
>
> The current storage engine (which for this ticket I'll loosely define as "the code implementing the read/write path") is suffering from old age. One of the main problem is that the only structure it deals with is the cell, which completely ignores the more high level CQL structure that groups cell into (CQL) rows.
> This leads to many inefficiencies, like the fact that during a reads we have to group cells multiple times (to count on replica, then to count on the coordinator, then to produce the CQL resultset) because we forget about the grouping right away each time (so lots of useless cell names comparisons in particular). But outside inefficiencies, having to manually recreate the CQL structure every time we need it for something is hindering new features and makes the code more complex that it should be.
> Said storage engine also has tons of technical debt. To pick an example, the fact that during range queries we update {{SliceQueryFilter.count}} is pretty hacky and error prone. Or the overly complex ways {{AbstractQueryPager}} has to go into to simply "remove the last query result".
> So I want to bite the bullet and modernize this storage engine. I propose to do 2 main things:
> # Make the storage engine more aware of the CQL structure. In practice, instead of having partitions be a simple iterable map of cells, it should be an iterable list of row (each being itself composed of per-column cells, though obviously not exactly the same kind of cell we have today).
> # Make the engine more iterative. What I mean here is that in the read path, we end up reading all cells in memory (we put them in a ColumnFamily object), but there is really no reason to. If instead we were working with iterators all the way through, we could get to a point where we're basically transferring data from disk to the network, and we should be able to reduce GC substantially.
> Please note that such refactor should provide some performance improvements right off the bat but it's not it's primary goal either. It's primary goal is to simplify the storage engine and adds abstraction that are better suited to further optimizations.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)