cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Cassandra Wiki] Update of "LargeDataSetConsiderations" by PeterSchuller
Date Sat, 18 Dec 2010 17:02:15 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "LargeDataSetConsiderations" page has been changed by PeterSchuller.
http://wiki.apache.org/cassandra/LargeDataSetConsiderations?action=diff&rev1=7&rev2=8

--------------------------------------------------

    * The operating system's page cache is affected by compaction and repair operations. If
you are relying on the page cache to keep the active set in memory, you may see significant
degradation on performance as a result of compaction and repair operations.
     * Potential future improvements: [[https://issues.apache.org/jira/browse/CASSANDRA-1470|CASSANDRA-1470]],
[[https://issues.apache.org/jira/browse/CASSANDRA-1882|CASSANDRA-1882]].
   * If you have column families with more than 143 million row keys in them, bloom filter
false positive rates are likely to go up because of implementation concerns that limit the
maximum size of a bloom filter. See [[ArchitectureInternals]] for information on how bloom
filters are used. The negative effects of hitting this limit is that reads will start taking
additional seeks to disk as the row count increases. Note that the effect you are seeing at
any given moment will depend on when compaction was last run, because the bloom filter limit
is per-sstable. It is an issue for column families because after a major compaction, the entire
column family will be in a single sstable.
-   * This will likely be addressed in the future: See [[https://issues.apache.org/jira/browse/CASSANDRA-1608|CASSANDRA-1608]]
and TODO: bigger-bf jira
+   * This will likely be addressed in the future: See [[https://issues.apache.org/jira/browse/CASSANDRA-1608|CASSANDRA-1608]]
and [[https://issues.apache.org/jira/browse/CASSANDRA-1555|CASSANDRA-1555]]
   * Compaction is currently not concurrent, so only a single compaction runs at a time. This
means that sstable counts may spike during larger compactions as several smaller sstables
are written while a large compaction is happening. This can cause additional seeks on reads.
-   * TODO: link to parallel compaction JIRA ticket, file another one specifically for ensuring
this issue is addressed (the pre-existing only deals with using multiple cores for throughput
reasons)
+   * Potential future improvements: [[https://issues.apache.org/jira/browse/CASSANDRA-1876|CASSANDRA-1876]]
and [[https://issues.apache.org/jira/browse/CASSANDRA-1881|CASSANDRA-1881]]
   * Consider the choice of file system. Removal of large files is notoriously slow and seek
bound on e.g. ext2/ext3. Consider xfs or ext4fs.
   * Adding nodes is a slow process if each node is responsible for a large amount of data.
Plan for this; do not try to throw additional hardware at a cluster at the last minute.
  

Mime
View raw message