cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Roth (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-12991) Inter-node race condition in validation compaction
Date Thu, 15 Dec 2016 09:44:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15750908#comment-15750908
] 

Benjamin Roth edited comment on CASSANDRA-12991 at 12/15/16 9:43 AM:
---------------------------------------------------------------------

I created a little script to calculate some possible scenarios: https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99

Output:
{quote}
 * Calculates the likeliness of a race condition leading to unnecessary repairs
 * @see https://issues.apache.org/jira/browse/CASSANDRA-12991
 *
 * This assumes that all writes are equally spread over all token ranges
 * and there is one subrange repair executed for each existing token range

3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 req/s
Total Ranges: 768
Likeliness for RC per range: 0.39%
Unnecessary range repairs per repair: 3.00

3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 req/s
Total Ranges: 768
Likeliness for RC per range: 1.56%
Unnecessary range repairs per repair: 12.00

8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 req/s
Total Ranges: 2048
Likeliness for RC per range: 2.93%
Unnecessary range repairs per repair: 60.00

8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000 req/s
Total Ranges: 2048
Likeliness for RC per range: 5.37%
Unnecessary range repairs per repair: 110.00
{quote}

You may ask why I entered latencies like 10ms or 20ms - this seems quite high. It is indeed
quite high for regular tables and a cluster that is not overloaded. Under these conditions,
the latency is dominated by your network latency, so 1ms seems quite fair to me.
As soon as you use MVs and your cluster tends to overload, higher latencies are not unrealistic.
You have to take into account that an MV operation does read before write and the latency
may vary very much. For MVs the latency is not (only) any more dominated by network latency
but by MV lock aquisition and read before write. Both factors can introduce MUCH higher latencies,
depending on concurrent operations on MV, number of SSTables, compaction strategy, just everything
that affects read performance.
If your cluster is overloaded, these effects have an even higher impact.

I observed MANY situations on our production system where writes timed out during streaming
because of lock contention and or RBW impacts. These situations mainly pop up during repair
sessions when streams cause bulk mutation applies (see StreamReceiverTask path for MVs). Impact
is even higher due to CASSANDRA-12888. Parallel repairs like e.g. reaper does, makes the situation
even more unpredictable and increases "drifts" of nodes, like Node A is overloaded but Node
B not because Node A receives a stream from a different repair but Node B does not.

This is a vicious circle driven several factors: 
- Stream puts pressure on nodes - especially larg(er) partitions
- hints tend to queue up
- hint delivery puts more pressure
- retransmission of failed hint delivery puts even more pressure
- latencies go up
- stream validations drift
- more (unnecessary) streams
- goto 0

This calculation example is just hypothetic. This *may* happen as calculated but it totally
depends on the model, cluster dimensions, cluster load, write activity, distribution of writes
and repair execution. I don't claim that fixing this issue will remove all MV performance
problems but it may be helps to remove one impediment in the mentioned vicious circle.

My proposal is NOT to control flushes. This is far too complicated and wont help. A flush,
whenever it may happen and whatever range it flushes may or may not contain a mutation that
_should_ be there. What helps is to cut off all data retrospectively at a synchronized and
fix timestamp when executing the validation. You can define a grace period (GP). When you
start validation at VS on the repair coordinator, then you expect all mutations to arrive
no later than VS that were created before VS - GP. That can be done at SSTable scanner level
by filtering all events (cells, tombstones) after VS - GP during validation compaction. Something
like the opposite of purging tombstones after GCGS.


was (Author: brstgt):
I created a little script to calculate some possible scenarios: https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99

Output:
{quote}
 * Calculates the likeliness of a race condition leading to unnecessary repairs
 * @see https://issues.apache.org/jira/browse/CASSANDRA-12991
 *
 * This assumes that all writes are equally spread over all token ranges
 * and there is one subrange repair executed for each existing token range

3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 req/s
Total Ranges: 768
Likeliness for RC per range: 0.39%
Unnecessary range repairs per repair: 3.00

3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 req/s
Total Ranges: 768
Likeliness for RC per range: 1.56%
Unnecessary range repairs per repair: 12.00

8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 req/s
Total Ranges: 2048
Likeliness for RC per range: 2.93%
Unnecessary range repairs per repair: 60.00

8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000 req/s
Total Ranges: 2048
Likeliness for RC per range: 5.37%
Unnecessary range repairs per repair: 110.00
{quote}

You may ask why I entered latencies like 10ms or 20ms - this seems quite high. It is indeed
quite high for regular tables and a cluster that is not overloaded. Under these conditions,
the latency is dominated by your network latency, so 1ms seems quite fair to me.
As soon as you use MVs and your cluster tends to overload, higher latencies are not unrealistic.
You have to take into account that an MV operation does read before write and the latency
may vary very much. For MVs the latency is not (only) any more dominated by network latency
but by MV lock aquisition and read before write. Both factors can introduce MUCH higher latencies,
depending on concurrent operations on MV, number of SSTables, compaction strategy, just everything
that affects read performance.
If your cluster is overloaded, these effects have an even higher impact.

I observed MANY situations on our production system where writes timed out during streaming
because of lock contention and or RBW impacts. These situations mainly pop up during repair
sessions when streams cause bulk mutation applies (see StreamReceiverTask path for MVs). Impact
is even higher due to CASSANDRA-12888. Parallel repairs like e.g. reaper does, makes the situation
even more unpredictable and increases "drifts" of nodes, like Node A is overloaded but Node
B not because Node A receives a stream from a different repair but Node B does not.

This is a vicious circle driven several factors: 
- Stream puts pressure on nodes - especially larg(er) partitions
- hints tend to queue up
- hint delivery puts more pressure
- retransmission of failed hint delivery puts even more pressure
- latencies go up
- stream validations drift
- more (unnecessary) streams
- goto 0

This calculation example is just hypothetic. This *may* happen as calculated but it totally
depends on the model, cluster dimensions, cluster load, write activity, distribution of writes
and repair execution. I don't claim that fixing this issue will remove all MV performance
problems but it may be helps to remove one impediment in the mentioned vicious circle.

My proposal is NOT to control flushes. This is far too complicated and wont help. A flush,
whenever it may happen and whatever range it flushes may or may not contain a mutation that
_should_ be there. The only thing that helps is to cut off all data retrospectively at a synchronized
and fix timestamp when executing the validation. You can only define a grace period (GP).
When you start validation at VS on the repair coordinator, then you expect all mutations to
arrive no later than VS that were created before VS - GP. That can IMHO only be done at SSTable
scanner level by filtering all events (cells, tombstones) after VS - GP during validation
compaction. Something like the opposite of purging tombstones after GCGS.

> Inter-node race condition in validation compaction
> --------------------------------------------------
>
>                 Key: CASSANDRA-12991
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12991
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Benjamin Roth
>            Priority: Minor
>
> Problem:
> When a validation compaction is triggered by a repair it may happen that due to flying
in mutations the merkle trees differ but the data is consistent however.
> Example:
> t = 10000: 
> Repair starts, triggers validations
> Node A starts validation
> t = 10001:
> Mutation arrives at Node A
> t = 10002:
> Mutation arrives at Node B
> t = 10003:
> Node B starts validation
> Hashes of node A+B will differ but data is consistent from a view (think of it like a
snapshot) t = 10000.
> Impact:
> Unnecessary streaming happens. This may not a big impact on low traffic CFs, partitions
but on high traffic CFs and maybe very big partitions, this may have a bigger impact and is
a waste of resources.
> Possible solution:
> Build hashes based upon a snapshot timestamp.
> This requires SSTables created after that timestamp to be filtered when doing a validation
compaction:
> - Cells with timestamp > snapshot time have to be removed
> - Tombstone range markers have to be handled
>  - Bounds have to be removed if delete timestamp > snapshot time
>  - Boundary markers have to be either changed to a bound or completely removed, depending
if start and/or end are both affected or not
> Probably this is a known behaviour. Have there been any discussions about this in the
past? Did not find an matching issue, so I created this one.
> I am happy about any feedback, whatsoever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message