cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paulo Motta (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11500) Obsolete MV entry may not be properly deleted
Date Wed, 09 Aug 2017 06:21:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119461#comment-16119461
] 

Paulo Motta commented on CASSANDRA-11500:
-----------------------------------------

The virtual cell proposal is pretty clever and looks like it would solve most outstanding
issues, but I'm a bit concerned about adding new structures to the storage engine to deal
with materialized-specific issues.

While I agree we should do if it's our only choice, we should explore alternatives which reuse
existing structures if possible to avoid introducing feature-specific stuff into the storage
engine.

Looking back at the original scenario which motivated this ticket (CASSANDRA-11500):
{noformat}
CREATE TABLE t (k int PRIMARY KEY, a int, b int);
CREATE MATERIALIZED VIEW mv AS SELECT * FROM t WHERE k IS NOT NULL AND a IS NOT NULL PRIMARY
KEY (k, a);

INSERT INTO t(k, a, b) VALUES (1, 1, 1) USING TIMESTAMP 0;
UPDATE t USING TIMESTAMP 4 SET b = 2 WHERE k = 1;
UPDATE t USING TIMESTAMP 2 SET a = 2 WHERE k = 1;

SELECT * FROM mv WHERE k = 1; // This currently return 2 entries, the old (invalid) and the
new one
{noformat}

It seems to me that the problem here is applying the definition for standard tables that if
a single column is live, then the whole row is live, which does not need to be the case for
MV where we can guarantee a view entry will always contain row-level liveness info.

We could solve this  by introducing a "strict " flag to the row liveness info, which has the
following semantic:
- A strict row is only live iff it's row level liveness info is live, regardless of the liveness
of its columns

Materialized views rows would have this flag set and perform deletions with its max primary
key timestamp (instead of max timestamp of all keys), and this would solve the issue above
by ensuring the row {{(1, 1)@(liveness@0, deleted@2)=(b=2@4)}} would not be live.

In addition to solving the original problem we would not create the second problem in the
ticket description of updates to the view primary key with a smaller timestamp to be shadowed
by a shadowable tombstone using the max timestamp of a non-PK column. In this approach the
shadowing tombstone mechanism would still be orthogonal to the strict liveness and working
as it is today.

This mechanism alone would not solve all other problems but at least the ones described in
this ticket description with minimal change in the storage engine, let's now go through the
other issues to see how we could solve them:

*View row expires too fast with unselected base column (CASSANDRA-13127)*

>From the discussion on CASSANDRA-13127 it seems like you found and fixed some issues with
liveness comparison in addition to no view update being generated when there is an update
to an unselected column which seems to solve this issue in addition with the strict row concept
above. Even though this will require a read  when updating columns not in the view, the MV
user is already expecting to pay an extra price for MVs anyway so it shouldn't be a problem
- if you want performance you can build views manually or use CASSANDRA-9779 hopefully when
it's ready. :-)

*based column used in view key is TTLed (CASSANDRA-13657)*

This seems to be fixed by the fix above.

*Partial update unselected columns in base (CASSANDRA-13127)*

This seem to be more of an anomaly of the partial row update semantics which has bad consequences
for MVs than a problem with MV itself. 6.0 where thrift is gone is a good occasion to revisit
this semantics rather than trying to make MV fit into it.

Right now, inserting a non-PK column (in the example {{UPDATE base USING TIMESTAMP 10 SET
b=1 WHERE k=1 AND c=1}}) will create a row (k=1, c=1, b=1)@10 on the end-user perspective
({{SELECT * FROM base WHERE k=1 and C=1}}) while internally creating only a column, which
destroys the entire row in case the same non-PK column is removed ({{DELETE b FROM base USING
TIMESTAMP 11 WHERE k=1 AND c=1;}}).

While this semantics may make sense in a column-oriented world, it's a tad bizarre in a row
oriented world, given we can delete a row by simply unsetting a non-PK column. I think the
correct and expected semantic would be: {{a column update to a non-existing row will create
it}}.

This semantic is IMO what makes the most sense in a row oriented store and it's possible to
implement it without falling into the unexpected/inconsistent behaviors of CASSANDRA-6782/CASSANDRA-6668
by adding some kind of {{CREATE_IF_NOT_EXISTS}} flag which basically keeps the oldest liveness
entry if more than one is found with this flag when merging.

This semantic change would be the most correct while still preventing this anomaly with MVs
due to the current semantic. If you agree we can create another ticket to propose this change.

*merging should be commutative (CASSANDRA-13409)*
The second case, represented by {{testCellTombstoneAndShadowableTombstones}}, is fixed by
[using regular tombstones instead of shadowable tombstones for base table column deletions|https://github.com/pauloricardomg/cassandra/commit/1aeb0acbbaad6cc9520af6d1684043a5078eefa7]
(something which was probably overlooked on CASSANDRA-10261), as suggested on CASSANDRA-13409.

The first and third case, represented by {{testCommutativeRowDeletion}} can probably be fixed
on the view update generator by emitting column tombstones for columns not being updated when
it's detected a partial update on a previously deleted row (unless I'm missing something).

*with filter conditions (CASSANDRA-13547)*

This seems like a genuine bug on the view update generator, already discussed and proposed
patch on CASSANDRA-13547, of not including the view conditions on the base table select. The
second issue discussed there of a column with lower timestamp being wrongly shadowed should
be fixed by the strict liveness fix proposed before.

*unselected columns dropped*

We could still use the virtual cells idea (but probably as an additional flag to the current
cell structure) to support this properly, but I'm not sure it makes sense to optimize and
add additional storage overhead for this use case so we could maybe just disallow dropping
columns on tables with MVs configured altogether (at least initially).

I hacked a [strict liveness prototype|https://github.com/pauloricardomg/cassandra/commits/11500-poc]
on top of the original patch for CASSANDRA-13127 which seems to fix most tests except the
ones not covered above.

While this still requires changes to the storage engine it's mostly inclusion of new flags,
while other changes would be restricted to the view update generator. I could be easily missing
something so please let me know if there are cases or inefficiencies not covered by the suggestions
above which I did not consider.

I think the scope here is quite big, and while I understood your idea was to solve all these
issues with a single approach, if there aren't flaws with the suggestions above (which require
less changes in the storage engine) and we decide to go with them, we should probably break
up the solution to this as following:
- On this ticket implement the strict liveness idea and test it thoroughly (compaction, tombstone
purging etc) including the original patch for CASSANDRA-13127 (except the shadowable liveness
commit which is probably not required)
- Reopen CASSANDRA-13409 to update view generator to generate column tombstones when receiving
a partial update for a previously deleted row
- Reopen CASSANDRA-13547 to update view generator to include unselected views on base table
select (latest patch there should already be good enough I think?)
- Open a new ticket to deal with dropped unselected columns (either disallow or implement
a simplified virtual cell idea to deal with this)
- Open a new ticket to propose new insert-if-not-exists semantic for column update on non-existing
row
- Open a new ticket to update sstable dump to include new flags and shadowable tombstone
- In my hacked prototype the frozen collection test is failing, but I didn't investigate into
it, is this something that would be solved by any of the above or is it a different issue?
Also please let me know if I forgot to comment on any other case which is not covered above.

> Obsolete MV entry may not be properly deleted
> ---------------------------------------------
>
>                 Key: CASSANDRA-11500
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11500
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Materialized Views
>            Reporter: Sylvain Lebresne
>            Assignee: ZhaoYang
>
> When a Materialized View uses a non-PK base table column in its PK, if an update changes
that column value, we add the new view entry and remove the old one. When doing that removal,
the current code uses the same timestamp than for the liveness info of the new entry, which
is the max timestamp for any columns participating to the view PK. This is not correct for
the deletion as the old view entry could have other columns with higher timestamp which won't
be deleted as can easily shown by the failing of the following test:
> {noformat}
> CREATE TABLE t (k int PRIMARY KEY, a int, b int);
> CREATE MATERIALIZED VIEW mv AS SELECT * FROM t WHERE k IS NOT NULL AND a IS NOT NULL
PRIMARY KEY (k, a);
> INSERT INTO t(k, a, b) VALUES (1, 1, 1) USING TIMESTAMP 0;
> UPDATE t USING TIMESTAMP 4 SET b = 2 WHERE k = 1;
> UPDATE t USING TIMESTAMP 2 SET a = 2 WHERE k = 1;
> SELECT * FROM mv WHERE k = 1; // This currently return 2 entries, the old (invalid) and
the new one
> {noformat}
> So the correct timestamp to use for the deletion is the biggest timestamp in the old
view entry (which we know since we read the pre-existing base row), and that is what CASSANDRA-11475
does (the test above thus doesn't fail on that branch).
> Unfortunately, even then we can still have problems if further updates requires us to
overide the old entry. Consider the following case:
> {noformat}
> CREATE TABLE t (k int PRIMARY KEY, a int, b int);
> CREATE MATERIALIZED VIEW mv AS SELECT * FROM t WHERE k IS NOT NULL AND a IS NOT NULL
PRIMARY KEY (k, a);
> INSERT INTO t(k, a, b) VALUES (1, 1, 1) USING TIMESTAMP 0;
> UPDATE t USING TIMESTAMP 10 SET b = 2 WHERE k = 1;
> UPDATE t USING TIMESTAMP 2 SET a = 2 WHERE k = 1; // This will delete the entry for a=1
with timestamp 10
> UPDATE t USING TIMESTAMP 3 SET a = 1 WHERE k = 1; // This needs to re-insert an entry
for a=1 but shouldn't be deleted by the prior deletion
> UPDATE t USING TIMESTAMP 4 SET a = 2 WHERE k = 1; // ... and we can play this game more
than once
> UPDATE t USING TIMESTAMP 5 SET a = 1 WHERE k = 1;
> ...
> {noformat}
> In a way, this is saying that the "shadowable" deletion mechanism is not general enough:
we need to be able to re-insert an entry when a prior one had been deleted before, but we
can't rely on timestamps being strictly bigger on the re-insert. In that sense, this can be
though as a similar problem than CASSANDRA-10965, though the solution there of a single flag
is not enough since we can have to replace more than once.
> I think the proper solution would be to ship enough information to always be able to
decide when a view deletion is shadowed. Which means that both liveness info (for updates)
and shadowable deletion would need to ship the timestamp of any base table column that is
part the view PK (so {{a}} in the example below).  It's doable (and not that hard really),
but it does require a change to the sstable and intra-node protocol, which makes this a bit
painful right now.
> But I'll also note that as CASSANDRA-1096 shows, the timestamp is not even enough since
on equal timestamp the value can be the deciding factor. So in theory we'd have to ship the
value of those columns (in the case of a deletion at least since we have it in the view PK
for updates). That said, on that last problem, my preference would be that we start prioritizing
CASSANDRA-6123 seriously so we don't have to care about conflicting timestamp anymore, which
would make this problem go away.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message