cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Schuller (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-2463) Flush and Compaction Unnecessarily Allocate 256MB Contiguous Buffers
Date Tue, 12 Apr 2011 22:45:05 GMT


Peter Schuller commented on CASSANDRA-2463:

A noteworthy factor here is that unless an fsync()+fadvise()/madvise() have evicted data,
in the normal case this stuff should still be in page cache for any reasonably sized row.
For truly huge rows, the penalty of seeking back should be insignificant anyway.

Total +1 on avoiding huge allocations. I was surprised to realize, when this ticket came along,
that this was happening ;)

I have been suspecting that the bloom filters are a major concern too with respect to triggering
promotion failures (but I haven't done testing to confirm this). Are there other cases than
this and the bloom filters where we know that we're doing large allocations?

> Flush and Compaction Unnecessarily Allocate 256MB Contiguous Buffers
> --------------------------------------------------------------------
>                 Key: CASSANDRA-2463
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.4
>         Environment: Any
>            Reporter: C. Scott Andreas
>              Labels: patch
>             Fix For: 0.7.4
>         Attachments: 2463-v2.txt, patch.diff
>   Original Estimate: 72h
>  Remaining Estimate: 72h
> Currently, Cassandra 0.7.x allocates a 256MB contiguous byte array at the beginning of
a memtable flush or compaction (presently hard-coded as Config.in_memory_compaction_limit_in_mb).
When several memtable flushes are triggered at once (as by `nodetool flush` or `nodetool snapshot`),
the tenured generation will typically experience extreme pressure as it attempts to locate
[n] contiguous 256mb chunks of heap to allocate. This will often trigger a promotion failure,
resulting in a stop-the-world GC until the allocation can be made. (Note that in the case
of the "release valve" being triggered, the problem is even further exacerbated; the release
valve will ironically trigger two contiguous 256MB allocations when attempting to flush the
two largest memtables).
> This patch sets the buffer to be used by BufferedRandomAccessFile to Math.min(bytesToWrite,
BufferedRandomAccessFile.DEFAULT_BUFFER_SIZE) rather than a hard-coded 256MB. The typical
resulting buffer size is 64kb.
> I've taken some time to measure the impact of this change on the base 0.7.4 release and
with this patch applied. This test involved launching Cassandra, performing four million writes
across three column families from three clients, and monitoring heap usage and garbage collections.
Cassandra was launched with 2GB of heap and the default JVM options shipped with the project.
This configuration has 7 column families with a total of 15GB of data.
> Here's the base 0.7.4 release:
> Note that on launch, we see a flush + compaction triggered almost immediately, resulting
in at least 7x very quick 256MB allocations maxing out the heap, resulting in a promotion
failure and a full GC. As flushes proceeed, we see that most of these have a corresponding
CMS, consistent with the pattern of a large allocation and immediate collection. We see a
second promotion failure and full GC at the 75% mark as the allocations cannot be satisfied
without a collection, along with several CMSs in between. In the failure cases, the allocation
requests occur so quickly that a standard CMS phase cannot completed before a ParNew attempts
to promote the surviving byte array into the tenured generation. The heap usage and GC profile
of this graph is very unhealthy.
> Here's the 0.7.4 release with this patch applied:
> This graph is very different. At launch, rather than a immediate spike to full allocation
and a promotion failure, we see a slow allocation slope reaching only 1/8th of total heap
size. As writes begin, we see several flushes and compactions, but none result in immediate,
large allocations. The ParNew collector keeps up with collections far more ably, resulting
in only one healthy CMS collection with no promotion failure. Unlike the unhealthy rapid allocation
and massive collection pattern we see in the first graph, this graph depicts a healthy sawtooth
pattern of ParNews and an occasional effective CMS with no danger of heap fragmentation resulting
in a promotion failure.
> The bottom line is that there's no need to allocate a hard-coded 256MB write buffer for
flushing memtables and compactions to disk. Doing so results in unhealthy rapid allocation
patterns and increases the probability of triggering promotion failures and full stop-the-world
GCs which can cause nodes to become unresponsive and shunned from the ring during flushes
and compactions.

This message is automatically generated by JIRA.
For more information on JIRA, see:

View raw message