cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carl Yeksigian (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-11475) MV code refactor
Date Thu, 28 Apr 2016 15:14:12 GMT


Carl Yeksigian commented on CASSANDRA-11475:

Overall, this patch simplifies the code and cleans up and fixes a lot of the places that were
rushed due to the time constraints we had, so I'm really happy to see this refactor.

I'm still going over all of the details (I don't see anything obviously wrong), but I have
a few overall structure comments:
- We are still going to be holding all of the updates that we generate in memory; this refactor
does make the memory footprint lower, but still includes a potential for using up a lot of
- I think your todo is right about promoting {{TableViews}}; I'm not sure if we can completely
get rid of {{ViewManager}}, since we need something at the {{Keyspace}} level, but maybe that
stuff can be integrated into Keyspace again
- I find the {{UpdateBuilder}} pattern not to work well here. {{ViewUpdateBuilder}} should
be building {{ViewUpdate}}s, instead of {{Collection<PartitionUpdate>}}, and using the
{{build}} method at the end seems weird. I see that the mutable state is useful, but doesn't
feel like this really fits the builder pattern; to me, it seems like we'd be better off with
not exposing the rows, but I know that this is part of the refactor to remove the way we were
reading from disk.
- {{MultiViewUpdateBuilder}} doesn't really seem necessary; it seems like it should be part
of {{ViewManager}}

One small thing I noticed:
- In {{TableViews}}, {{removeByName}} is empty

> MV code refactor
> ----------------
>                 Key: CASSANDRA-11475
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>             Fix For: 3.0.x, 3.x
> While working on CASSANDRA-5546 I run into a problem with TTLs on MVs, which looking
more closely is a bug of the MV code. But one thing leading to another I reviewed a good portion
of the MV code and found the following correction problems:
> * If a base row is TTLed then even if an update remove that TTL the view entry remained
TTLed and expires, leading to an inconsistency.
> * Due to calling the wrong ctor for {{LivenessInfo}}, when a TTL was set on the base
table, the view entry was living twice as long as the TTL. Again leading to a temporary inconsistency.
> * When reading existing data to compute view updates, all deletion informations are completely
ignored (the code uses a {{PartitionIterator}} instead of an {{UnfilteredPartitionIterator}}).
This is a serious issue since it means some deletions could be totally ignored as far as views
are concerned especially when messages are delivered to a replica out of order. I'll note
that while the 2 previous points are relatively easy to fix, I didn't find an easy and clean
way to fix this one on the current code.
> Further, I think the MV code in general has inefficiencies/code complexities that should
be avoidable:
> * {{TemporalRow.Set}} is buffering both everything read and a pretty much complete copy
of the updates. That's a potentially high memory requirement. We shouldn't have to copy the
updates and we shouldn't buffer all reads but rather work incrementally.
> * {{TemporalRow}}/{{TemporalRow.Set}}/{{TemporalCell}} classes are somewhat re-inventing
the wheel. They are really just storing both an update we're doing and the corresponding existing
data, but we already have {{Row}}/{{Partition}}/{{Cell}} for that. In practice, those {{Temporal*}}
class generates a lot of allocations that we could avoid.
> * The code from CASSANDRA-10060 to avoid multiple reads of the base table with multiple
views doesn't work when the update has partition/range tombstones because the code uses {{TemporalRow.Set.setTombstonedExisting()}}
to trigger reuse, but the {{TemporalRow.Set.withNewViewPrimaryKey()}} method is used between
view and it does not preseve the {{hasTombstonedExisting}} flag.  But that oversight, which
is trivial to fix, is kind of a good thing since if you fix it, you're left with a correction
>   The read done when there is a partition deletion depends on the view itself (if there
is clustering filters in particular) and so reusing that read for other views is wrong. Which
makes that whole reuse code really dodgy imo: the read for existing data is in {{}},
suggesting that it depends on the view (which again, it does at least for partition deletion),
but it shouldn't if we're going to reuse the result across multiple views.
> * Even ignoring the previous point, we still potentially read the base table twice if
the update mix both row updates and partition/range deletions, potentially re-reading the
same values.
> * It's probably more minor but the reading code is using {{QueryPager}}, which is probably
an artifact of the initial version of the code being pre-8099, but it's not necessary anymore
(the reads are local and locally we're completely iterator based), adding, especially when
we do page. I'll note that despite using paging, the current code still buffers everything
in {{TemporalRow.Set}} anyway .
> Overall, I suspect trying to fix the problems above (particularly the fact that existing
deletion infos are ignored) is only going to add complexity with the current code and we'd
still have to fix the inefficiencies. So I propose a refactor of that code which does 2 main
> # it removes all of {{TemporalRow}} and related classes. Instead, it directly uses the
existing {{Row}} (with all its deletion infos) and the update being applied to it and compute
the updates for the view from that. I submit that this is more clear/simple, but this also
avoid copying every cell of both the existing and update data as a {{TemporalCell}}. We can
also reuse codes like {{Rows.merge}} and {{Rows.diff}} to make the handling of deletions relatively
> # instead of dealing with each view one at a time, re-iterating over all updates each
time, it iterates over each individual updates once and deal with each view for that update.
This makes it more clear that the reads has to care about every view involved, but more importantly
allow to deal with the read data incrementally, never buffering it all.

This message was sent by Atlassian JIRA

View raw message