cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ariel Weisberg (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory
Date Thu, 10 Mar 2016 17:53:40 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189628#comment-15189628
] 

Ariel Weisberg commented on CASSANDRA-11327:
--------------------------------------------

bq. Perhaps you should outline precisely the algorithm you propose, since there's a whole
class of similar algorithms and it would narrow the discussion?

There is probably some tuning that could be done to make this smarter, but basically if right
now 1/4 of the heap is the memtable memory limit change it 1/8th (in half). Let's ignore 2i
and look at just a memtable flushing. Let's say we know what the expected on disk size is
as well as the number of partitions or rows and we can guess at the average weight of each
partition or row. Every N partitions or rows we can update the amount of free memory to reflect
the weight of what was flushed. Or we could be more precise if the tracking the weight of
what is flushed isn't difficult.

Peak footprint remains the same since we have cut the limit in half, but actual footprint
will vary between the limit and double the limit as flushing releases memory to writers while
the memory is still committed.

bq. By reducing their size, transient overload becomes more frequent, and SLAs are not met
or the cluster capacity must be increased.
I agree this is the biggest problem. I think you are right in terms of dealing with variance
in the worst case it reduces memory utilization by half, but in the average or real case maybe
it's not so bad? Maybe flushing isn't super far behind it's just a little behind?

bq. So I don't personally see the rationale for making transient overload (Cassandra's strong
suit) worse, in exchange for a really temporary reprieve on sustained overload.
I don't think we should dismiss this out of hand. I think there are users who do care about
saturating load and who care about the difficulty of determining exactly how fast they can
write to the database. Spark and bulk loading are both pain points. Right now it's very difficult
because the database doesn't provide any notice that you are about to saturate you just start
getting mass timeouts instead of backpressure.

When timeouts do occur don't those also introduce additional workload amplification in the
form of retries, hinted handoff, and repair? I am not completely sold that this kind of thing
would cripple the ability of memtables to handle variance in arrival distribution. It reduces
the window and magnitude of variance that can be tolerated certainly, but for capacity planning
purposes peak throughput isn't the only factor.

> Maintain a histogram of times when writes are blocked due to no available memory
> --------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-11327
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11327
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Ariel Weisberg
>
> I have a theory that part of the reason C* is so sensitive to timeouts during saturating
write load is that throughput is basically a sawtooth with valleys at zero. This is something
I have observed and it gets worse as you add 2i to a table or do anything that decreases the
throughput of flushing.
> I think the fix for this is to incrementally release memory pinned by memtables and 2i
during flushing instead of releasing it all at once. I know that's not really possible, but
we can fake it with memory accounting that tracks how close to completion flushing is and
releases permits for additional memory. This will lead to a bit of a sawtooth in real memory
usage, but we can account for that so the peak footprint is the same.
> I think the end result of this change will be a sawtooth, but the valley of the sawtooth
will not be zero it will be the rate at which flushing progresses. Optimizing the rate at
which flushing progresses and it's fairness with other work can then be tackled separately.
> Before we do this I think we should demonstrate that pinned memory due to flushing is
actually the issue by getting better visibility into the distribution of instances of not
having any memory by maintaining a histogram of spans of time where no memory is available
and a thread is blocked.
> [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186]
should be a relatively straightforward entry point for this. The first thread to block can
mark the start of memory starvation and the last thread out can mark the end. Have a periodic
task that tracks the amount of time spent blocked per interval of time and if it is greater
than some threshold log with more details, possibly at debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message