Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
From: Apache Wiki <wikidiffs@apache.org>
To: Apache Wiki <wikidiffs@apache.org>
Date: Wed, 04 Sep 2013 13:35:32 -0000
Message-ID: <20130904133532.1586.32032@eos.apache.org>
Subject: 
 =?utf-8?q?=5BCassandra_Wiki=5D_Update_of_=22LargeDataSetConsiderations=22?=
 =?utf-8?q?_by_jeremyhanna?=
Auto-Submitted: auto-generated

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for=
 change notification.

The "LargeDataSetConsiderations" page has been changed by jeremyhanna:
https://wiki.apache.org/cassandra/LargeDataSetConsiderations?action=3Ddiff&=
rev1=3D21&rev2=3D22

Comment:
Revising for current Cassandra features and considerations.

  =3D Using Cassandra for large data sets (lots of data per node) =3D
  =

- This page aims to to give some advise as to the issues one may need to co=
nsider when using Cassandra for large data sets (meaning hundreds of gigaby=
tes or terabytes per node). The intent is not to make original claims, but =
to collect in one place some issues that are operationally relevant. Other =
parts of the wiki are highly recommended in order to fully understand the i=
ssues involved.
+ This page aims to to give some advice as to the issues one may need to co=
nsider when using Cassandra for large data sets (meaning hundreds of gigaby=
tes or terabytes per node). The intent is not to make original claims, but =
to collect in one place some issues that are operationally relevant. Other =
parts of the wiki are highly recommended in order to fully understand the i=
ssues involved.
  =

- This is a work in progress. If you find information out of date (e.g., a =
JIRA ticket referenced has been resolved but this document has not been upd=
ated), please help by editing or e-mail:ing cassandra-user.
+ This is a work in progress. If you find information out of date (e.g., a =
JIRA ticket referenced has been resolved but this document has not been upd=
ated), please help by editing or e-mailing cassandra-user.
  =

- Note that not all of these issues are specific to Cassandra (for example,=
 any storage system is subject to the trade-offs of cache sizes relative to=
 active set size, and IOPS will always be strongly correlated with the perc=
entage of requests that penetrate caching layers).
+ Note that not all of these issues are specific to Cassandra.  For example=
, any storage system is subject to the trade-offs of cache sizes relative t=
o active set size, and IOPS will always be strongly correlated with the per=
centage of requests that penetrate caching layers.  Also of note, the more =
data stored per node, the more data will have to be streamed in bootstrappi=
ng new or replacement nodes.
  =

- Unless otherwise noted, the points refer to Cassandra 0.7 and above.
+ '''Assumes Cassandra 1.2+'''
  =

+ Significant work has been done to allow for more data to be stored on eac=
h node:
+  * Row cache can be serialized off-heap.  Keep in mind that the existing =
row cache implementation still maintains off-heap entire rows of data and w=
hen that row is called for those rows are deserialized into the heap.  Alte=
rnate row cache implementations are being worked on to make the row cache m=
ore generally useful, see [[https://issues.apache.org/jira/browse/CASSANDRA=
-2864|CASSANDRA-2864]].
+  * Bloom filters:
+   * Moved off-heap [[https://issues.apache.org/jira/browse/CASSANDRA-4865=
|CASSANDRA-4865]]
+   * Tunable via bloom_filter_fp_chance.  Starting in Cassandra 1.2 there =
are better defaults: 0.01 for column families using the !SizeTieredCompacti=
onStrategy and 0.1 for column families using the !LeveledCompactionStrategy=
.  Note that a change to this property will take effect as new sstables are=
 built.
+  * Compression metadata has been moved off-heap in 1.2 (reference)
+  * Partition summary has been reduced in 1.2.5 and moved off-heap in 2.0
+  * Key cache has been serialized off-heap in 2.0 (reference)
-  * Disk space usage in Cassandra can vary fairly suddenly over time. If y=
ou have significant amounts of data such that available disk space is not s=
ignificantly higher than usage, consider:
-   * Compaction of a column family can up to double the disk space used by=
 said column family (in the case of a major compaction and no deletions). I=
f your data is predominantly made up of a single, or a select few, column f=
amilies then doubling the disk space for a CF may be a significant amount c=
ompared to your total disk usage.
-   * Repair operations can increase disk space demands (particularly in 0.=
6, less so in 0.7; TODO: provide actual maximum growth and what it depends =
on).
-  * As your data set becomes larger and larger (assuming significantly lar=
ger than memory), you become more and more dependent on caching to elide I/=
O operations. As you plan and test your capacity, keep in mind that:
-   * The cassandra row cache is in the JVM heap and unaffected (remains wa=
rm) by compactions and repair operations. This is a plus, but the down-side=
 is that the row cache is not very memory efficient compared to the operati=
ng system page cache.
-   * For 0.6.8 and below, the key cache is affected by compaction because =
it is per-sstable, and compaction moves data to new sstables.
-    * Was fixed/improved as of [[https://issues.apache.org/jira/browse/CAS=
SANDRA-1878|CASSANDRA-1878]], for 0.6.9 and 0.7.0.
-   * The operating system's page cache is affected by compaction and repai=
r operations. If you are relying on the page cache to keep the active set i=
n memory, you may see significant degradation on performance as a result of=
 compaction and repair operations.
-    * Potential future improvements: [[https://issues.apache.org/jira/brow=
se/CASSANDRA-1470|CASSANDRA-1470]], [[https://issues.apache.org/jira/browse=
/CASSANDRA-1882|CASSANDRA-1882]].
-  * Prior to 0.7.1 (fixed in [[https://issues.apache.org/jira/browse/CASSA=
NDRA-1555|CASSANDRA-1555]]), if you had column families with more than 143 =
million row keys in them, bloom filter false positive rates would be likely=
 to go up because of implementation concerns that limited the maximum size =
of a bloom filter. See [[ArchitectureInternals]] for information on how blo=
om filters are used. The negative effects of hitting this limit is that rea=
ds will start taking additional seeks to disk as the row count increases. N=
ote that the effect you are seeing at any given moment will depend on when =
compaction was last run, because the bloom filter limit is per-sstable. It =
is an issue for column families because after a major compaction, the entir=
e column family will be in a single sstable.
-  * Compaction is currently not concurrent, so only a single compaction ru=
ns at a time. This means that sstable counts may spike during larger compac=
tions as several smaller sstables are written while a large compaction is h=
appening. This can cause additional seeks on reads.
-   * Potential future improvements: [[https://issues.apache.org/jira/brows=
e/CASSANDRA-1876|CASSANDRA-1876]] and [[https://issues.apache.org/jira/brow=
se/CASSANDRA-1881|CASSANDRA-1881]]
+  * Parallel compactions as in [[https://issues.apache.org/jira/browse/CAS=
SANDRA-2191|CASSANDRA-2191]] and [[https://issues.apache.org/jira/browse/CA=
SSANDRA-4310|CASSANDRA-4310]]
-   * Potentially already fixed for 0.8 (todo: go through ticket history an=
d make sure what it implies): [[https://issues.apache.org/jira/browse/CASSA=
NDRA-2191|CASSANDRA-2191]]
+  * Multi-threaded compaction for high IO hardware (reference)
+  * Virtual nodes to increase the parallelism and reduce the time of boots=
trapping new nodes
+  * sstable index files are no longer loaded on startup (reference)
+ =

+ Other points to consider:
+ =

+  * Moving data structures off-heap means that the structure gets serializ=
ed off-heap until it is needed.  Then it is deserialized temporarily in the=
 heap and is the garbage collected when it is no longer used.
+  * Disk space usage in Cassandra can vary over time:
+   * Compaction: with the !SizeTieredCompactionStrategy, compaction can up=
 to double the disk space used.  With the !LeveledCompactionStrategy, usual=
ly only requires about 10% overhead (see http://www.datastax.com/dev/blog/l=
eveled-compaction-in-apache-cassandra).
+   * Repair: repair operations can increase disk space demands, see http:/=
/www.datastax.com/dev/blog/advanced-repair-techniques for details and how i=
t can be improved.
-  * Consider the choice of file system. Removal of large files is notoriou=
sly slow and seek bound on e.g. ext2/ext3. Consider xfs or ext4fs. This aff=
ects background unlink():ing of sstables that happens every now and then, a=
nd also affects start-up time (if there are sstables pending removal when a=
 node is starting up, they are removed as part of the start-up proceess; it=
 may thus be detrimental if removing a terabyte of sstables takes an hour (=
numbers are ballparks, not accurately measured and depends on circumstances=
)).
+  * Consider the choice of file system. Removal of large files is notoriou=
sly slow and seek bound on e.g. ext2/ext3. Consider xfs or ext4fs. This aff=
ects background unlink():ing of sstables that happens every now and then, a=
nd also affects start-up time (if there are sstables pending removal when a=
 node is starting up, they are removed as part of the start-up procees; it =
may thus be detrimental if removing a terabyte of sstables takes an hour (n=
umbers are ballparks, not accurately measured and depends on circumstances)=
).
   * Adding nodes is a slow process if each node is responsible for a large=
 amount of data. Plan for this; do not try to throw additional hardware at =
a cluster at the last minute.
+  * The operating system's page cache is affected by compaction and repair=
 operations. If you are relying on the page cache to keep the active set in=
 memory, you may see significant degradation on performance as a result of =
compaction and repair operations.  See the cassandra.yaml for settings to r=
educe this impact.
+  * The partition (or sampled) index entries for each sstable can start to=
 add up.  You can reduce the memory usage by tuning the interval that it sa=
mples at.  The setting is index_interval the cassandra.yaml.  See the comme=
nts there for more information.
-  * Cassandra will read through sstable index files on start-up, doing wha=
t is known as "index sampling". This is used to keep a subset (currently an=
d by default, 1 out of 100) of keys and and their on-disk location in the i=
ndex, in memory. See [[ArchitectureInternals]]. This means that the larger =
the index files are, the longer it takes to perform this sampling. Thus, fo=
r very large indexes (typically when you have a very large number of keys) =
the index sampling on start-up may be a significant issue.
-  * A negative side-effect of a large row-cache is start-up time. The peri=
odic saving of the row cache information only saves the keys that are cache=
d; the data has to be pre-fetched on start-up. On a large data set, this is=
 probably going to be seek-bound and the time it takes to warm up the row c=
ache will be linear with respect to the row cache size (assuming sufficient=
ly large amounts of data that the seek bound I/O is not subject to optimiza=
tion by disks).
-   * Potential future improvement: [[https://issues.apache.org/jira/browse=
/CASSANDRA-1625|CASSANDRA-1625]].
-  * The total number of rows per node correlates directly with the size of=
 bloom filters and sampled index entries. Expect the base memory requiremen=
t of a node to increase linearly with the number of keys (assuming the aver=
age row key size remains constant). If you are not using caching at all (e.=
g. you are doing analysis type workloads), expect these two to be the two b=
iggest consumers of memory.
-   * You can decrease the memory use due to index sampling by changing the=
 index sampling interval in cassandra.yaml
-   * You should soon be able to tweak the bloom filter sizes too once [[ht=
tps://issues.apache.org/jira/browse/CASSANDRA-3497|CASSANDRA-3497]] is done
  =

+ Other references to improvements:
+  * [[http://www.datastax.com/dev/blog/performance-improvements-in-cassand=
ra-1-2|Performance improvements in Cassandra 1.2]]
+  * [[http://www.datastax.com/dev/blog/six-mid-series-changes-to-know-abou=
t-in-1-2-x|Six mid-series changes in Cassandra 1.2]]
+=20