cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Cassandra Wiki] Update of "LargeDataSetConsiderations" by PeterSchuller
Date Tue, 27 Dec 2011 18:32:22 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "LargeDataSetConsiderations" page has been changed by PeterSchuller:

   * Cassandra will read through sstable index files on start-up, doing what is known as "index
sampling". This is used to keep a subset (currently and by default, 1 out of 100) of keys
and and their on-disk location in the index, in memory. See [[ArchitectureInternals]]. This
means that the larger the index files are, the longer it takes to perform this sampling. Thus,
for very large indexes (typically when you have a very large number of keys) the index sampling
on start-up may be a significant issue.
   * A negative side-effect of a large row-cache is start-up time. The periodic saving of
the row cache information only saves the keys that are cached; the data has to be pre-fetched
on start-up. On a large data set, this is probably going to be seek-bound and the time it
takes to warm up the row cache will be linear with respect to the row cache size (assuming
sufficiently large amounts of data that the seek bound I/O is not subject to optimization
by disks).
    * Potential future improvement: [[|CASSANDRA-1625]].
+  * The total number of rows per node correlates directly with the size of bloom filters
and sampled index entries. Expect the base memory requirement of a node to increase linearly
with the number of keys (assuming the average row key size remains constant).
+   * You can decrease the memory use due to index sampling by changing the index sampling
interval in cassandra.yaml
+   * You should soon be able to tweak the bloom filter sizes too once [[|CASSANDRA-3497]]
is done

View raw message