cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pavel Yaskevich (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
Date Wed, 05 Mar 2014 17:56:49 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13921110#comment-13921110
] 

Pavel Yaskevich commented on CASSANDRA-6689:
--------------------------------------------

bq. I've stated clearly what this introduces as a benefit: overwrite workloads no longer cause
excessive flushes

If you do a copy before of the memtable buffer, you can clearly put it back to the allocator
once it's overwritten or becomes otherwise useless, in the process of merging columns with
previous row contents.

bq. Your next sentence states how this is a large cause of memory consumption, so surely we
should be using that memory if possible for other uses (returning it to the buffer cache,
or using it internally for more caching)?

It doesn't state that is a *large cause of memory consumption*, it states that it has additional
cost but it the steady state it don't be allocating over the limit because of the properties
of the system that we have, namely the fixed number of threads.

bq. Are you performing a full object tree copy, and doing this with a running system to see
how it affects the performance of other system components? If not, it doesn't seem to be a
useful comparison. Note that this will still create a tremendous amount of heap churn, as
most of the memory used by objects right now is on-heap. So copying the records is almost
certainly no better for young gen pressure than what we currently do - in fact, it probably
makes the situation worse.

Do you mean this? Let's say we copy a Cell (or Column object), which is 1 level deep so just
allocate additional space for the object headers and do a copy, most of the work would be
spend by doing a copy of the data (name/value) anyway, so as we want to live inside of ParNew,
see how many such allocations you will be able to do in e.g. 1 second then wipe the whole
thing and do it again. We are doing mlockall too which should make that even faster as we
are sure that heap is pre-faulted already.

bq. It may not be causing the young gen pressure you're seeing, but it certainly offers some
benefit here by keeping more rows in memory so recent queries are more likely to be answered
with zero allocation, so reducing young gen pressure; it is also a foundation for improving
the row cache and introducing a shared page cache which could bring us closer to zero allocation
reads. _And so on...._

I'm not sure how this would help in the case of row cache, once reference is added to the
row cache it means that memtable would hang in there until that row is purged, so if there
is a long lived row (write once, read multiple times) in each of the regions (and we reclaim
based on regions) would that keep memtable around longer than expected?

bq. It's also not clear to me how you would be managing the reclaim of the off-heap allocations
without OpOrder, or do you mean to only use off-heap buffers for readers, or to ref-count
any memory as you're reading it? Not using off-heap memory for the memtables would negate
the main original point of this ticket: to support larger memtables, thus reducing write amplification.
Ref-counting incurs overhead linear to the size of the result set, much like copying, and
is also fiddly to get right (not convinced it's cleaner or neater), whereas OpOrder incurs
overhead proportional to the number of times you reclaim. So if you're using OpOrder, all
you're really talking about is a new RefAction: copyToAllocator() or something. So it doesn't
notably reduce complexity, it just reduces the quality of the end result.

In terms of memory usage copy adds additional linear cost yes but at the same time it makes
the system behavior more controllable/predictable which is what ops usually care about where,
even on the artificial stress test, there seems to be a low once off-heap feature is enabled
which is no surprise once you look at how much complexity does it actually add.

bq. Also, I'd love to see some evidence for this (particularly the latter). I'm not disputing
it, just would like to see what caused you to reach these conclusions. These definitely warrant
separate tickets IMO, but if you have evidence for it, it would help direct any work.

Well, it seems like you never operated a real Cassandra cluster, did you? All of the problems
that I have listed here are well known, you can even simulate this with docker VMs and making
internal network gradually slower, there is no back pressure mechanism built-in so right now
Cassandra would accept a bunch or operations on the normal speed (if the outgoing link is
physically different than internal) but suddenly would just stop accepting anything and fail
internally because of GC storm caused by all of the internode buffers hanging around.


> Partially Off Heap Memtables
> ----------------------------
>
>                 Key: CASSANDRA-6689
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>             Fix For: 2.1 beta2
>
>         Attachments: CASSANDRA-6689-small-changes.patch
>
>
> Move the contents of ByteBuffers off-heap for records written to a memtable.
> (See comments for details)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message