cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeremiah Jordan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12991) Inter-node race condition in validation compaction
Date Wed, 07 Dec 2016 12:21:59 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15728633#comment-15728633
] 

Jeremiah Jordan commented on CASSANDRA-12991:
---------------------------------------------

So I have thought about this before and decided the complexity of trying to improve this wasn't
worth the streaming saved. The consequences are that some number of in flight partitions are
going to have their unrepaired parts streamed around.

One idea I had for this was to have repair agree to a time X seconds in the future to do the
flush, and to flush only data with time stamps less than X / 2 seconds, or to split the flush
into data before that and data after that. But that seemed like a lot of extra complexity
to flushing and repair for what I figured wasn't much savings.

> Inter-node race condition in validation compaction
> --------------------------------------------------
>
>                 Key: CASSANDRA-12991
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12991
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Benjamin Roth
>            Priority: Minor
>
> Problem:
> When a validation compaction is triggered by a repair it may happen that due to flying
in mutations the merkle trees differ but the data is consistent however.
> Example:
> t = 10000: 
> Repair starts, triggers validations
> Node A starts validation
> t = 10001:
> Mutation arrives at Node A
> t = 10002:
> Mutation arrives at Node B
> t = 10003:
> Node B starts validation
> Hashes of node A+B will differ but data is consistent from a view (think of it like a
snapshot) t = 10000.
> Impact:
> Unnecessary streaming happens. This may not a big impact on low traffic CFs, partitions
but on high traffic CFs and maybe very big partitions, this may have a bigger impact and is
a waste of resources.
> Possible solution:
> Build hashes based upon a snapshot timestamp.
> This requires SSTables created after that timestamp to be filtered when doing a validation
compaction:
> - Cells with timestamp > snapshot time have to be removed
> - Tombstone range markers have to be handled
>  - Bounds have to be removed if delete timestamp > snapshot time
>  - Boundary markers have to be either changed to a bound or completely removed, depending
if start and/or end are both affected or not
> Probably this is a known behaviour. Have there been any discussions about this in the
past? Did not find an matching issue, so I created this one.
> I am happy about any feedback, whatsoever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message