cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Schuller (JIRA)" <>
Subject [jira] Commented: (CASSANDRA-1576) Improve the I/O subsystem for ROW-READ stage
Date Wed, 13 Oct 2010 17:46:32 GMT


Peter Schuller commented on CASSANDRA-1576:

We spoke on IRC a bit and I just wanted to summarize the results here (Chris, please correct
me if I am mis-representing anything):

* We agree mmap() should be the most efficient approach for cached data.

* The synchronization concern was synchronization implicit in filling up the RRS rather than
explicit synchronization in the read path.

* With respect to categorizing in-memory queries vs. queries that will go down to disk, Chris
pointed out mincore(2) which allows a userland app to test whether a range is in memory or
not (but it's not immediately obvious whether a mincore() call is cheap enough to be used
in this context).

* The intent with libaio isn't that a libaio syscall is necessarily cheaper than a synchronous
I/O call, but rather the intended goal is to have an I/O concurrency/queue depth significantly
higher than RRS concurrency.

* While libaio is probably not worth it to saturate the disk subsystem (where concurrency
is reasonably limited assuming non-humongous nodes), some workloads may benefit quite a lot
from significant queue depth if the reads are particularly susceptible to optimization by
the underlying disks (due to platter location/relative positioning) etc.

> Improve the I/O subsystem for ROW-READ stage
> --------------------------------------------
>                 Key: CASSANDRA-1576
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6.5, 0.7 beta 2
>            Reporter: Chris Goffinet
> I did some profiling awhile ago, and noticed that there is quite a bit of overhead that
is happening in the ROW-READ stage of Cassandra. My testing was on 0.6 branch. Jonathan mentioned
there is endpoint snitch caching in 0.7. One of the pain points is that we do synchronize
I/O in our threads. I have observed through profiling and other benchmarks, that even having
a very powerful machine (16-core Nehalem, 32GB of RAM), the amount of overhead of going through
to the page cache can still be between 2-3ms (with mmap). I observed at least 800 microseconds
more overhead if not using mmap. There is definitely overhead in this stage. I propose we
seriously consider moving to doing Asynchronous I/O in each of these threads instead. 
> Imagine the following scenario:
> 3ms with mmap to read from page cache + 1.1ms of function call overhead (observed google
iterators in 0.6, could be much better in 0.7)
> That's 4.1ms per message. With 32 threads, at best the machine is only going to be able
to serve:
> 7,804 messages/s. 
> This number also means that all your data has to be in page cache. If you start to dip
into any set of data that isn't in cache, this number is going to drop substantially, even
if your hit rate was 99%.
> Anyone with a serious data set that is greater than the total page cache of the cluster,
is going to be victim of major slowdowns as soon as any requests come in needing to fetch
data not in cache. If you run without the Direct I/O patch, and you actually have a pretty
good write load, you can expect your cluster to fall victim even more with page cache thrashing
as new SSTables are read/writen using compaction.
> All of these scenarios mentioned above were seen at Digg with 45-node cluster, 16-core
machines with a dataset larger than total page cache.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message