cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-8099) Refactor and modernize the storage engine
Date Fri, 12 Jun 2015 14:55:06 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583530#comment-14583530
] 

Sylvain Lebresne commented on CASSANDRA-8099:
---------------------------------------------

Some update on this. I've pushed a rebased (and squashed because that made it a *lot* easier
to rebase) version in [my 8099 branch|https://github.com/pcmanus/cassandra/tree/8099].

It's still missing wire backward compatibility ([~thobbs] is finishing this so this should
be ready hopefully soon). Regarding tests:
* unit tests are almost green: mostly it remains some failures in the hadoop tests. I could
actually use the experience of someone that knows these tests and the code involved as it's
not immediately clear to me what this is even doing.
* dtests still have a fair amount of failure but I've only look at them recently and it's
getting down quickly.

h2. OpOrder

I think the main problem was that a local read (done through {{SP.LocalReadRunnable}}) was
potentially keeping a group "open" while waiting on other nodes. I also realized this path
meant local reads (the actual read of sstables) were done outside of the {{StorageProxy} methods,
and so 1) not on the thread they were supposed to be on and 2) outside of the "timeout" check.
I changed this so that a local response actually materialize everything upfront (similarly
to what we do today), solving the problems above. This is not perfect and I'm sure we'll improve
on this in the future, but that feels like a good enough option initially.

Regarding moving {{OpOrder}} out of {{close}}, the only way to do that I can see is be to
move it up the stack, in {{ReadCommandVerbHandler}} and {{SP.LocalReadRunnable}} (as suggested
by Brananir above). I'm working on that (I just started and might not have the time to finish
today, but it'll be done early monday for sure).

h2. Branamir's review remarks

I've integrated fixes for most of the remarks. I discuss the rest below.

bq. [CompactionIterable 125|https://github.com/pcmanus/cassandra/blob/75b98620e30b5df31431618cc21e090481f33967/src/java/org/apache/cassandra/db/compaction/CompactionIterable.java#L125]:
I doubt index update belongs here, as side effect of iteration. Ideally index should be collected,
not updated.

Though I don't disagree on principle, this is not different from how it's done currently (it's
done the same in {{LazilyCompactRow}}, but it just happens that the old {{LazilyCompactedRow}}
has been merged to {{CompactionIterable}} (now {{CompactionIterator}}) because simplifications
of the patch made it unnecessary to have separate classes). Happy to look at cleaning this
in a separate ticket however (probably belongs to cleaning the 2ndary index API in practice).

bq. [CompactionIterable 237|https://github.com/pcmanus/cassandra/blob/75b98620e30b5df31431618cc21e090481f33967/src/java/org/apache/cassandra/db/compaction/CompactionIterable.java#L237]:
Another side effect of iteration that preferably should be handled by writer.

Maybe, but it's not that simple. Merging (which is done directly by the {{CompactionIterator}})
gets rid of empty partitions and more generally we get rid of them as soon as possible. I
think that it's the right thing to do as it's easier for the rest of the code, but that means
we have to do invalidation in {{CompactionIterator}}. Of course, we could special case {{CompactionIterator}}
to not remove empty partitions and do cache invalidation externally, but I'm not sure it would
be cleaner overall (it would somewhat more error-prone). Besides, I could argue that cache
invalidation is very much a product of compaction and having it in {{CompactionIterator}}
is not that bad.

bq. Validation compaction now uses CompactionIterable and thus has side effects (index &
cache removal).

I've fixed that but I'll note for posterity that as far as I can tell, index removal is done
for validation compaction on trunk (and all previous version) due to the use of {{LazilyCompactedRow}}.
I've still disabled it (for anything that wasn't a true compaction) because I think that's
the right thing to do, but that's a difference of this ticket.

bq. add that there is never content between two corresponding tombstone markers on any iterator.

That's mentioned in "Dealing with tombstones and shadowed cells". More precisely, that's what
"it's part of the contract of an AtomIterator that it must not shadow it's own data" means.
But I need to clean up/update the guide so I'll make sure to clarify further while at it.

bq. These objects should be Iterable instead. Having that would give clear separation between
the iteration process and the entity-level data

Yes, it would be cleaner from that standpoint. And the use of iterators in the first place
is indeed largely carried from the existing code, I just hadn't really though of the alternative
tbh. I'll try to check next week how easily such change is. That said, I'm not sure the use
of iterators directly is that confusing either, so if it turns hairy, I don't think it's worth
blocking on this (that is, we can very well change that later).

> Refactor and modernize the storage engine
> -----------------------------------------
>
>                 Key: CASSANDRA-8099
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8099
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>             Fix For: 3.0 beta 1
>
>         Attachments: 8099-nit
>
>
> The current storage engine (which for this ticket I'll loosely define as "the code implementing
the read/write path") is suffering from old age. One of the main problem is that the only
structure it deals with is the cell, which completely ignores the more high level CQL structure
that groups cell into (CQL) rows.
> This leads to many inefficiencies, like the fact that during a reads we have to group
cells multiple times (to count on replica, then to count on the coordinator, then to produce
the CQL resultset) because we forget about the grouping right away each time (so lots of useless
cell names comparisons in particular). But outside inefficiencies, having to manually recreate
the CQL structure every time we need it for something is hindering new features and makes
the code more complex that it should be.
> Said storage engine also has tons of technical debt. To pick an example, the fact that
during range queries we update {{SliceQueryFilter.count}} is pretty hacky and error prone.
Or the overly complex ways {{AbstractQueryPager}} has to go into to simply "remove the last
query result".
> So I want to bite the bullet and modernize this storage engine. I propose to do 2 main
things:
> # Make the storage engine more aware of the CQL structure. In practice, instead of having
partitions be a simple iterable map of cells, it should be an iterable list of row (each being
itself composed of per-column cells, though obviously not exactly the same kind of cell we
have today).
> # Make the engine more iterative. What I mean here is that in the read path, we end up
reading all cells in memory (we put them in a ColumnFamily object), but there is really no
reason to. If instead we were working with iterators all the way through, we could get to
a point where we're basically transferring data from disk to the network, and we should be
able to reduce GC substantially.
> Please note that such refactor should provide some performance improvements right off
the bat but it's not it's primary goal either. It's primary goal is to simplify the storage
engine and adds abstraction that are better suited to further optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message