Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A5F5410B87 for ; Wed, 4 Sep 2013 13:35:59 +0000 (UTC) Received: (qmail 43420 invoked by uid 500); 4 Sep 2013 13:35:59 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 43238 invoked by uid 500); 4 Sep 2013 13:35:59 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 43230 invoked by uid 99); 4 Sep 2013 13:35:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Sep 2013 13:35:58 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.131] (HELO eos.apache.org) (140.211.11.131) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Sep 2013 13:35:54 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id A9639368 for ; Wed, 4 Sep 2013 13:35:32 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Wed, 04 Sep 2013 13:35:32 -0000 Message-ID: <20130904133532.1586.32032@eos.apache.org> Subject: =?utf-8?q?=5BCassandra_Wiki=5D_Update_of_=22LargeDataSetConsiderations=22?= =?utf-8?q?_by_jeremyhanna?= Auto-Submitted: auto-generated X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for= change notification. The "LargeDataSetConsiderations" page has been changed by jeremyhanna: https://wiki.apache.org/cassandra/LargeDataSetConsiderations?action=3Ddiff&= rev1=3D21&rev2=3D22 Comment: Revising for current Cassandra features and considerations. =3D Using Cassandra for large data sets (lots of data per node) =3D = - This page aims to to give some advise as to the issues one may need to co= nsider when using Cassandra for large data sets (meaning hundreds of gigaby= tes or terabytes per node). The intent is not to make original claims, but = to collect in one place some issues that are operationally relevant. Other = parts of the wiki are highly recommended in order to fully understand the i= ssues involved. + This page aims to to give some advice as to the issues one may need to co= nsider when using Cassandra for large data sets (meaning hundreds of gigaby= tes or terabytes per node). The intent is not to make original claims, but = to collect in one place some issues that are operationally relevant. Other = parts of the wiki are highly recommended in order to fully understand the i= ssues involved. = - This is a work in progress. If you find information out of date (e.g., a = JIRA ticket referenced has been resolved but this document has not been upd= ated), please help by editing or e-mail:ing cassandra-user. + This is a work in progress. If you find information out of date (e.g., a = JIRA ticket referenced has been resolved but this document has not been upd= ated), please help by editing or e-mailing cassandra-user. = - Note that not all of these issues are specific to Cassandra (for example,= any storage system is subject to the trade-offs of cache sizes relative to= active set size, and IOPS will always be strongly correlated with the perc= entage of requests that penetrate caching layers). + Note that not all of these issues are specific to Cassandra. For example= , any storage system is subject to the trade-offs of cache sizes relative t= o active set size, and IOPS will always be strongly correlated with the per= centage of requests that penetrate caching layers. Also of note, the more = data stored per node, the more data will have to be streamed in bootstrappi= ng new or replacement nodes. = - Unless otherwise noted, the points refer to Cassandra 0.7 and above. + '''Assumes Cassandra 1.2+''' = + Significant work has been done to allow for more data to be stored on eac= h node: + * Row cache can be serialized off-heap. Keep in mind that the existing = row cache implementation still maintains off-heap entire rows of data and w= hen that row is called for those rows are deserialized into the heap. Alte= rnate row cache implementations are being worked on to make the row cache m= ore generally useful, see [[https://issues.apache.org/jira/browse/CASSANDRA= -2864|CASSANDRA-2864]]. + * Bloom filters: + * Moved off-heap [[https://issues.apache.org/jira/browse/CASSANDRA-4865= |CASSANDRA-4865]] + * Tunable via bloom_filter_fp_chance. Starting in Cassandra 1.2 there = are better defaults: 0.01 for column families using the !SizeTieredCompacti= onStrategy and 0.1 for column families using the !LeveledCompactionStrategy= . Note that a change to this property will take effect as new sstables are= built. + * Compression metadata has been moved off-heap in 1.2 (reference) + * Partition summary has been reduced in 1.2.5 and moved off-heap in 2.0 + * Key cache has been serialized off-heap in 2.0 (reference) - * Disk space usage in Cassandra can vary fairly suddenly over time. If y= ou have significant amounts of data such that available disk space is not s= ignificantly higher than usage, consider: - * Compaction of a column family can up to double the disk space used by= said column family (in the case of a major compaction and no deletions). I= f your data is predominantly made up of a single, or a select few, column f= amilies then doubling the disk space for a CF may be a significant amount c= ompared to your total disk usage. - * Repair operations can increase disk space demands (particularly in 0.= 6, less so in 0.7; TODO: provide actual maximum growth and what it depends = on). - * As your data set becomes larger and larger (assuming significantly lar= ger than memory), you become more and more dependent on caching to elide I/= O operations. As you plan and test your capacity, keep in mind that: - * The cassandra row cache is in the JVM heap and unaffected (remains wa= rm) by compactions and repair operations. This is a plus, but the down-side= is that the row cache is not very memory efficient compared to the operati= ng system page cache. - * For 0.6.8 and below, the key cache is affected by compaction because = it is per-sstable, and compaction moves data to new sstables. - * Was fixed/improved as of [[https://issues.apache.org/jira/browse/CAS= SANDRA-1878|CASSANDRA-1878]], for 0.6.9 and 0.7.0. - * The operating system's page cache is affected by compaction and repai= r operations. If you are relying on the page cache to keep the active set i= n memory, you may see significant degradation on performance as a result of= compaction and repair operations. - * Potential future improvements: [[https://issues.apache.org/jira/brow= se/CASSANDRA-1470|CASSANDRA-1470]], [[https://issues.apache.org/jira/browse= /CASSANDRA-1882|CASSANDRA-1882]]. - * Prior to 0.7.1 (fixed in [[https://issues.apache.org/jira/browse/CASSA= NDRA-1555|CASSANDRA-1555]]), if you had column families with more than 143 = million row keys in them, bloom filter false positive rates would be likely= to go up because of implementation concerns that limited the maximum size = of a bloom filter. See [[ArchitectureInternals]] for information on how blo= om filters are used. The negative effects of hitting this limit is that rea= ds will start taking additional seeks to disk as the row count increases. N= ote that the effect you are seeing at any given moment will depend on when = compaction was last run, because the bloom filter limit is per-sstable. It = is an issue for column families because after a major compaction, the entir= e column family will be in a single sstable. - * Compaction is currently not concurrent, so only a single compaction ru= ns at a time. This means that sstable counts may spike during larger compac= tions as several smaller sstables are written while a large compaction is h= appening. This can cause additional seeks on reads. - * Potential future improvements: [[https://issues.apache.org/jira/brows= e/CASSANDRA-1876|CASSANDRA-1876]] and [[https://issues.apache.org/jira/brow= se/CASSANDRA-1881|CASSANDRA-1881]] + * Parallel compactions as in [[https://issues.apache.org/jira/browse/CAS= SANDRA-2191|CASSANDRA-2191]] and [[https://issues.apache.org/jira/browse/CA= SSANDRA-4310|CASSANDRA-4310]] - * Potentially already fixed for 0.8 (todo: go through ticket history an= d make sure what it implies): [[https://issues.apache.org/jira/browse/CASSA= NDRA-2191|CASSANDRA-2191]] + * Multi-threaded compaction for high IO hardware (reference) + * Virtual nodes to increase the parallelism and reduce the time of boots= trapping new nodes + * sstable index files are no longer loaded on startup (reference) + = + Other points to consider: + = + * Moving data structures off-heap means that the structure gets serializ= ed off-heap until it is needed. Then it is deserialized temporarily in the= heap and is the garbage collected when it is no longer used. + * Disk space usage in Cassandra can vary over time: + * Compaction: with the !SizeTieredCompactionStrategy, compaction can up= to double the disk space used. With the !LeveledCompactionStrategy, usual= ly only requires about 10% overhead (see http://www.datastax.com/dev/blog/l= eveled-compaction-in-apache-cassandra). + * Repair: repair operations can increase disk space demands, see http:/= /www.datastax.com/dev/blog/advanced-repair-techniques for details and how i= t can be improved. - * Consider the choice of file system. Removal of large files is notoriou= sly slow and seek bound on e.g. ext2/ext3. Consider xfs or ext4fs. This aff= ects background unlink():ing of sstables that happens every now and then, a= nd also affects start-up time (if there are sstables pending removal when a= node is starting up, they are removed as part of the start-up proceess; it= may thus be detrimental if removing a terabyte of sstables takes an hour (= numbers are ballparks, not accurately measured and depends on circumstances= )). + * Consider the choice of file system. Removal of large files is notoriou= sly slow and seek bound on e.g. ext2/ext3. Consider xfs or ext4fs. This aff= ects background unlink():ing of sstables that happens every now and then, a= nd also affects start-up time (if there are sstables pending removal when a= node is starting up, they are removed as part of the start-up procees; it = may thus be detrimental if removing a terabyte of sstables takes an hour (n= umbers are ballparks, not accurately measured and depends on circumstances)= ). * Adding nodes is a slow process if each node is responsible for a large= amount of data. Plan for this; do not try to throw additional hardware at = a cluster at the last minute. + * The operating system's page cache is affected by compaction and repair= operations. If you are relying on the page cache to keep the active set in= memory, you may see significant degradation on performance as a result of = compaction and repair operations. See the cassandra.yaml for settings to r= educe this impact. + * The partition (or sampled) index entries for each sstable can start to= add up. You can reduce the memory usage by tuning the interval that it sa= mples at. The setting is index_interval the cassandra.yaml. See the comme= nts there for more information. - * Cassandra will read through sstable index files on start-up, doing wha= t is known as "index sampling". This is used to keep a subset (currently an= d by default, 1 out of 100) of keys and and their on-disk location in the i= ndex, in memory. See [[ArchitectureInternals]]. This means that the larger = the index files are, the longer it takes to perform this sampling. Thus, fo= r very large indexes (typically when you have a very large number of keys) = the index sampling on start-up may be a significant issue. - * A negative side-effect of a large row-cache is start-up time. The peri= odic saving of the row cache information only saves the keys that are cache= d; the data has to be pre-fetched on start-up. On a large data set, this is= probably going to be seek-bound and the time it takes to warm up the row c= ache will be linear with respect to the row cache size (assuming sufficient= ly large amounts of data that the seek bound I/O is not subject to optimiza= tion by disks). - * Potential future improvement: [[https://issues.apache.org/jira/browse= /CASSANDRA-1625|CASSANDRA-1625]]. - * The total number of rows per node correlates directly with the size of= bloom filters and sampled index entries. Expect the base memory requiremen= t of a node to increase linearly with the number of keys (assuming the aver= age row key size remains constant). If you are not using caching at all (e.= g. you are doing analysis type workloads), expect these two to be the two b= iggest consumers of memory. - * You can decrease the memory use due to index sampling by changing the= index sampling interval in cassandra.yaml - * You should soon be able to tweak the bloom filter sizes too once [[ht= tps://issues.apache.org/jira/browse/CASSANDRA-3497|CASSANDRA-3497]] is done = + Other references to improvements: + * [[http://www.datastax.com/dev/blog/performance-improvements-in-cassand= ra-1-2|Performance improvements in Cassandra 1.2]] + * [[http://www.datastax.com/dev/blog/six-mid-series-changes-to-know-abou= t-in-1-2-x|Six mid-series changes in Cassandra 1.2]] +=20