cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paulo Motta (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-13299) Potential OOMs and lock contention in write path streams
Date Tue, 26 Sep 2017 15:15:01 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-13299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16180923#comment-16180923
] 

Paulo Motta commented on CASSANDRA-13299:
-----------------------------------------

Thanks for the updates, this is looking good! I managed to reproduce an OOM when repairing
a wide partition with 100K rows and verified that this patch avoids the OOM by splitting the
partition in multiple batches (found CASSANDRA-13899 on the way). Awesome job!

While splitting on the happy case is working nicely, we need to ensure the range tombstone
handling (specially range deletions) is working correctly and well tested before committing
this.

I noticed that the previous {{ThrottledUnfilteredIterator}} implementation could [potentially
return|https://github.com/jasonstack/cassandra/blob/b8cb49035d3cf77198b31df7c51a174fffe3edaf/src/java/org/apache/cassandra/db/rows/ThrottledUnfilteredIterator.java#L166]
{{throttle+2}} unfiltereds, differently from the documentation which states that the maximum
number of unfiltereds per batch is {{throttle+1}}. I also noticed that the case when there
is a row between two markers was not being tested by existing tests, since we need a range
deletion to reproduce this scenario. I fixed this and added more tests on [this commit|https://github.com/pauloricardomg/cassandra/commit/47d8ca3592cb6382bb4c308720646395306a0a69].
Could you also modify [complexThrottleWithTombstoneTest|https://github.com/jasonstack/cassandra/commit/b8cb49035d3cf77198b31df7c51a174fffe3edaf#diff-5162644c24391628b339b88c3619427cR66]
to test range deletions?

The previous change [requires|https://github.com/pauloricardomg/cassandra/commit/47d8ca3592cb6382bb4c308720646395306a0a69#diff-2acee8fea5cd82a51fda4af6e38faf13R60]
the minimum throttle size to be 2, otherwise it would not be possible to make progress on
the iterator in the presence of open and close markers.

I think that instead of throwing an {{AssertionError}} when the returned iterator is not exhausted,
we could simply exhaust it, effectively skipping entries, since this might be a possible usage
of {{ThrottledUnfilteredIterator}} so I did this on [this commit|https://github.com/pauloricardomg/cassandra/commit/04ed5ecb5183195601950fc9efd2ca9123596487].

I also added an utility method {{ThrottledUnfilteredIterator.throttle(UnfilteredPartitionIterator
partitionIterator, int maxBatchSize)}} to allow throttling an {{UnfilteredPartitionIterator}}
transparently and used that on {{StreamReceiveTask}} [on this commit|https://github.com/pauloricardomg/cassandra/commit/4f8c3b8faa2644133d301ac7bf7b748f7ec265ee].

I had another look at the {{throttled_partition_update_test}} [dtest|https://github.com/riptano/cassandra-dtest/commit/f3307adef349f232ec0ae64e902164684f32cca0]
and think we can make the following improvements:
* Right now we're [verifying the results|https://github.com/riptano/cassandra-dtest/commit/f3307adef349f232ec0ae64e902164684f32cca0#diff-62ba429edee6a4681782f078246c9893R1410]
with all the nodes UP, but it's possible that another node responds the query even though
one of the inconsistent nodes did not stream correctly. I think we should check the results
on each node individually (with the others down) to ensure they streamed data correctly from
other nodes.
*  Add [range deletions|https://issues.apache.org/jira/browse/CASSANDRA-6237] since that's
when the range tombstones special cases will be properly exercised.

Please let me know what do you think about these suggestions.

> Potential OOMs and lock contention in write path streams
> --------------------------------------------------------
>
>                 Key: CASSANDRA-13299
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13299
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Materialized Views
>            Reporter: Benjamin Roth
>            Assignee: ZhaoYang
>             Fix For: 4.x
>
>
> I see a potential OOM, when a stream (e.g. repair) goes through the write path as it
is with MVs.
> StreamReceiveTask gets a bunch of SSTableReaders. These produce rowiterators and they
again produce mutations. So every partition creates a single mutation, which in case of (very)
big partitions can result in (very) big mutations. Those are created on heap and stay there
until they finished processing.
> I don't think it is necessary to create a single mutation for each partition. Why don't
we implement a PartitionUpdateGeneratorIterator that takes a UnfilteredRowIterator and a max
size and spits out PartitionUpdates to be used to create and apply mutations?
> The max size should be something like min(reasonable_absolute_max_size, max_mutation_size,
commitlog_segment_size / 2). reasonable_absolute_max_size could be like 16M or sth.
> A mutation shouldn't be too large as it also affects MV partition locking. The longer
a MV partition is locked during a stream, the higher chances are that WTE's occur during streams.
> I could also imagine that a max number of updates per mutation regardless of size in
bytes could make sense to avoid lock contention.
> Love to get feedback and suggestions, incl. naming suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message