incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Schuller <peter.schul...@infidyne.com>
Subject Re: Cassandra behaviour
Date Mon, 26 Jul 2010 19:07:39 GMT
> be the most portable thing to do.  I had been thinking that the bloom
> filters were created on startup, but further reading of the docs
> indicates that they are in the SSTable Index.  What is cassandra
> doing, then, when it's printing out that it's sampling indices while
> it starts?

It's reading through keys in the index and adding offset information
about roughly every 128th entry in RAM, in order to speed up reads.
Performing a binary search in an sstable from scratch would be
expensive. Because of the high cost of disk seeks, most storage
systems use btrees with a high branching factor to keep the number of
seeks low. In cassandra there is instead binary searching (owing to
the fact that sstables are sorted on disk), but pre-seeded with the
information gained from index sampling to keep the amount of seeks
bounded even in the face of very large sstables.

This should translate to memory use that scales linearly with the
number of keys too, though I don't have a good feel for the overhead
you can expect.

At least this is my current understanding. I have not looked much at
the read path in cassandra yet. This is based on
SSTableReader.loadIndexFile() and callees of that method.

> I think that is what happened; in the INFO printouts, it was saying
> CompactionManager had 50+ pending operations.  If I set commitlog_sync
> to batch and commitlog_sync_period_in_ms to 0, then cassandra can't
> write data any faster than my drives can keep up, right?  Would that
> have any effect in preventing a huge compaction backlog, or would it
> just thrash my drives a ton?

Those settings only directly affect, as far as I know, the interaction
with the commit log. Now, if your system is truly disk bound rather
than CPU bound on compaction, writes to the commit log will indeed
have the capability to effectively throttle the write speed. In such a
case I would expect more frequent fsync():s to the commit log to
throttle writes to a higher degree than they would if the commit log
was just periodically fsync():ed in the background once per minute;
however I would not use this as the means to throttle writes.

The other thing which may happen is that memtables aren't flushed fast
enough to keep up with writes. I don't remember whether or not there
was already a fix for this; I think there is, at least in trunk.
Previously you could trigger an out-of-memory condition by writing
faster than memtable flushing was happening.

However even if that is fixed (again, I'm not sure), I'm pretty sure
there is still no mechanism to throttle based on background
compaction. It's not entirely trivial to do in a sensible fashion
given how extremely asynchronous compaction is with respect to writes.

Hopefully one of the cassandra developers will chime in if I'm
misrepresenting something.

>> * Increasing the memtable size. Increasing memtable size directly
>> affects the number of times a given entry will end up having to be
>> compacted on average; i.e., it decreases the total compaction work
>> that must be done for a given insertion workload. The default is
>> something like 64 MB; on a large system you probably want this
>> significantly larger, even up to several gigs (depending on heap sizes
>> and other concerns of course).
>
> Is that {binary_,}memtable_throughput_in_mb?  It definitely sounds
> like fewer compactions would help me, so I will give that a shot.

That (minus the binary one for normal operations) and
MemtableOperationsInMillions, depending on your workload.

For large mutations MemtableOperationsInMillions may be irrelevant; it
will be more likely to be relevant the smaller your data is. I.e., the
smaller the size of the average piece of data, because as it becomes
smaller the overhead of keeping it in memory is higher. In most cases
you probably want to change both at the same time unless you are
specifically looking to tweak them in relation to each other.

> Is this anything other than ensuring that -Xmx in JVM_OPTS is
> something reasonably large?

Not as far as I know. Though I failed to list the index summary
information; I believe those goes into the same category (i.e., a need
to increase the heap size but not to adjust cassandra settings).

-- 
/ Peter Schuller

Mime
View raw message