cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Cassandra Wiki] Update of "LargeDataSetConsiderations" by jeremyhanna
Date Wed, 04 Sep 2013 13:35:32 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "LargeDataSetConsiderations" page has been changed by jeremyhanna:
https://wiki.apache.org/cassandra/LargeDataSetConsiderations?action=diff&rev1=21&rev2=22

Comment:
Revising for current Cassandra features and considerations.

  = Using Cassandra for large data sets (lots of data per node) =
  
- This page aims to to give some advise as to the issues one may need to consider when using
Cassandra for large data sets (meaning hundreds of gigabytes or terabytes per node). The intent
is not to make original claims, but to collect in one place some issues that are operationally
relevant. Other parts of the wiki are highly recommended in order to fully understand the
issues involved.
+ This page aims to to give some advice as to the issues one may need to consider when using
Cassandra for large data sets (meaning hundreds of gigabytes or terabytes per node). The intent
is not to make original claims, but to collect in one place some issues that are operationally
relevant. Other parts of the wiki are highly recommended in order to fully understand the
issues involved.
  
- This is a work in progress. If you find information out of date (e.g., a JIRA ticket referenced
has been resolved but this document has not been updated), please help by editing or e-mail:ing
cassandra-user.
+ This is a work in progress. If you find information out of date (e.g., a JIRA ticket referenced
has been resolved but this document has not been updated), please help by editing or e-mailing
cassandra-user.
  
- Note that not all of these issues are specific to Cassandra (for example, any storage system
is subject to the trade-offs of cache sizes relative to active set size, and IOPS will always
be strongly correlated with the percentage of requests that penetrate caching layers).
+ Note that not all of these issues are specific to Cassandra.  For example, any storage system
is subject to the trade-offs of cache sizes relative to active set size, and IOPS will always
be strongly correlated with the percentage of requests that penetrate caching layers.  Also
of note, the more data stored per node, the more data will have to be streamed in bootstrapping
new or replacement nodes.
  
- Unless otherwise noted, the points refer to Cassandra 0.7 and above.
+ '''Assumes Cassandra 1.2+'''
  
+ Significant work has been done to allow for more data to be stored on each node:
+  * Row cache can be serialized off-heap.  Keep in mind that the existing row cache implementation
still maintains off-heap entire rows of data and when that row is called for those rows are
deserialized into the heap.  Alternate row cache implementations are being worked on to make
the row cache more generally useful, see [[https://issues.apache.org/jira/browse/CASSANDRA-2864|CASSANDRA-2864]].
+  * Bloom filters:
+   * Moved off-heap [[https://issues.apache.org/jira/browse/CASSANDRA-4865|CASSANDRA-4865]]
+   * Tunable via bloom_filter_fp_chance.  Starting in Cassandra 1.2 there are better defaults:
0.01 for column families using the !SizeTieredCompactionStrategy and 0.1 for column families
using the !LeveledCompactionStrategy.  Note that a change to this property will take effect
as new sstables are built.
+  * Compression metadata has been moved off-heap in 1.2 (reference)
+  * Partition summary has been reduced in 1.2.5 and moved off-heap in 2.0
+  * Key cache has been serialized off-heap in 2.0 (reference)
-  * Disk space usage in Cassandra can vary fairly suddenly over time. If you have significant
amounts of data such that available disk space is not significantly higher than usage, consider:
-   * Compaction of a column family can up to double the disk space used by said column family
(in the case of a major compaction and no deletions). If your data is predominantly made up
of a single, or a select few, column families then doubling the disk space for a CF may be
a significant amount compared to your total disk usage.
-   * Repair operations can increase disk space demands (particularly in 0.6, less so in 0.7;
TODO: provide actual maximum growth and what it depends on).
-  * As your data set becomes larger and larger (assuming significantly larger than memory),
you become more and more dependent on caching to elide I/O operations. As you plan and test
your capacity, keep in mind that:
-   * The cassandra row cache is in the JVM heap and unaffected (remains warm) by compactions
and repair operations. This is a plus, but the down-side is that the row cache is not very
memory efficient compared to the operating system page cache.
-   * For 0.6.8 and below, the key cache is affected by compaction because it is per-sstable,
and compaction moves data to new sstables.
-    * Was fixed/improved as of [[https://issues.apache.org/jira/browse/CASSANDRA-1878|CASSANDRA-1878]],
for 0.6.9 and 0.7.0.
-   * The operating system's page cache is affected by compaction and repair operations. If
you are relying on the page cache to keep the active set in memory, you may see significant
degradation on performance as a result of compaction and repair operations.
-    * Potential future improvements: [[https://issues.apache.org/jira/browse/CASSANDRA-1470|CASSANDRA-1470]],
[[https://issues.apache.org/jira/browse/CASSANDRA-1882|CASSANDRA-1882]].
-  * Prior to 0.7.1 (fixed in [[https://issues.apache.org/jira/browse/CASSANDRA-1555|CASSANDRA-1555]]),
if you had column families with more than 143 million row keys in them, bloom filter false
positive rates would be likely to go up because of implementation concerns that limited the
maximum size of a bloom filter. See [[ArchitectureInternals]] for information on how bloom
filters are used. The negative effects of hitting this limit is that reads will start taking
additional seeks to disk as the row count increases. Note that the effect you are seeing at
any given moment will depend on when compaction was last run, because the bloom filter limit
is per-sstable. It is an issue for column families because after a major compaction, the entire
column family will be in a single sstable.
-  * Compaction is currently not concurrent, so only a single compaction runs at a time. This
means that sstable counts may spike during larger compactions as several smaller sstables
are written while a large compaction is happening. This can cause additional seeks on reads.
-   * Potential future improvements: [[https://issues.apache.org/jira/browse/CASSANDRA-1876|CASSANDRA-1876]]
and [[https://issues.apache.org/jira/browse/CASSANDRA-1881|CASSANDRA-1881]]
+  * Parallel compactions as in [[https://issues.apache.org/jira/browse/CASSANDRA-2191|CASSANDRA-2191]]
and [[https://issues.apache.org/jira/browse/CASSANDRA-4310|CASSANDRA-4310]]
-   * Potentially already fixed for 0.8 (todo: go through ticket history and make sure what
it implies): [[https://issues.apache.org/jira/browse/CASSANDRA-2191|CASSANDRA-2191]]
+  * Multi-threaded compaction for high IO hardware (reference)
+  * Virtual nodes to increase the parallelism and reduce the time of bootstrapping new nodes
+  * sstable index files are no longer loaded on startup (reference)
+ 
+ Other points to consider:
+ 
+  * Moving data structures off-heap means that the structure gets serialized off-heap until
it is needed.  Then it is deserialized temporarily in the heap and is the garbage collected
when it is no longer used.
+  * Disk space usage in Cassandra can vary over time:
+   * Compaction: with the !SizeTieredCompactionStrategy, compaction can up to double the
disk space used.  With the !LeveledCompactionStrategy, usually only requires about 10% overhead
(see http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra).
+   * Repair: repair operations can increase disk space demands, see http://www.datastax.com/dev/blog/advanced-repair-techniques
for details and how it can be improved.
-  * Consider the choice of file system. Removal of large files is notoriously slow and seek
bound on e.g. ext2/ext3. Consider xfs or ext4fs. This affects background unlink():ing of sstables
that happens every now and then, and also affects start-up time (if there are sstables pending
removal when a node is starting up, they are removed as part of the start-up proceess; it
may thus be detrimental if removing a terabyte of sstables takes an hour (numbers are ballparks,
not accurately measured and depends on circumstances)).
+  * Consider the choice of file system. Removal of large files is notoriously slow and seek
bound on e.g. ext2/ext3. Consider xfs or ext4fs. This affects background unlink():ing of sstables
that happens every now and then, and also affects start-up time (if there are sstables pending
removal when a node is starting up, they are removed as part of the start-up procees; it may
thus be detrimental if removing a terabyte of sstables takes an hour (numbers are ballparks,
not accurately measured and depends on circumstances)).
   * Adding nodes is a slow process if each node is responsible for a large amount of data.
Plan for this; do not try to throw additional hardware at a cluster at the last minute.
+  * The operating system's page cache is affected by compaction and repair operations. If
you are relying on the page cache to keep the active set in memory, you may see significant
degradation on performance as a result of compaction and repair operations.  See the cassandra.yaml
for settings to reduce this impact.
+  * The partition (or sampled) index entries for each sstable can start to add up.  You can
reduce the memory usage by tuning the interval that it samples at.  The setting is index_interval
the cassandra.yaml.  See the comments there for more information.
-  * Cassandra will read through sstable index files on start-up, doing what is known as "index
sampling". This is used to keep a subset (currently and by default, 1 out of 100) of keys
and and their on-disk location in the index, in memory. See [[ArchitectureInternals]]. This
means that the larger the index files are, the longer it takes to perform this sampling. Thus,
for very large indexes (typically when you have a very large number of keys) the index sampling
on start-up may be a significant issue.
-  * A negative side-effect of a large row-cache is start-up time. The periodic saving of
the row cache information only saves the keys that are cached; the data has to be pre-fetched
on start-up. On a large data set, this is probably going to be seek-bound and the time it
takes to warm up the row cache will be linear with respect to the row cache size (assuming
sufficiently large amounts of data that the seek bound I/O is not subject to optimization
by disks).
-   * Potential future improvement: [[https://issues.apache.org/jira/browse/CASSANDRA-1625|CASSANDRA-1625]].
-  * The total number of rows per node correlates directly with the size of bloom filters
and sampled index entries. Expect the base memory requirement of a node to increase linearly
with the number of keys (assuming the average row key size remains constant). If you are not
using caching at all (e.g. you are doing analysis type workloads), expect these two to be
the two biggest consumers of memory.
-   * You can decrease the memory use due to index sampling by changing the index sampling
interval in cassandra.yaml
-   * You should soon be able to tweak the bloom filter sizes too once [[https://issues.apache.org/jira/browse/CASSANDRA-3497|CASSANDRA-3497]]
is done
  
+ Other references to improvements:
+  * [[http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2|Performance
improvements in Cassandra 1.2]]
+  * [[http://www.datastax.com/dev/blog/six-mid-series-changes-to-know-about-in-1-2-x|Six
mid-series changes in Cassandra 1.2]]
+ 

Mime
View raw message