cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From slebre...@apache.org
Subject [01/34] cassandra git commit: Add initial in-tree documentation (very incomplete so far)
Date Mon, 27 Jun 2016 18:33:56 GMT
Repository: cassandra
Updated Branches:
  refs/heads/trunk 48e4d5dae -> c7b9401bf


http://git-wip-us.apache.org/repos/asf/cassandra/blob/cad277be/doc/source/faq.rst
----------------------------------------------------------------------
diff --git a/doc/source/faq.rst b/doc/source/faq.rst
new file mode 100644
index 0000000..4ac0be4
--- /dev/null
+++ b/doc/source/faq.rst
@@ -0,0 +1,20 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+..
+..     http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS,
+.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+.. See the License for the specific language governing permissions and
+.. limitations under the License.
+
+Frequently Asked Questions
+==========================
+
+.. TODO: todo

http://git-wip-us.apache.org/repos/asf/cassandra/blob/cad277be/doc/source/getting_started.rst
----------------------------------------------------------------------
diff --git a/doc/source/getting_started.rst b/doc/source/getting_started.rst
new file mode 100644
index 0000000..c30fb1e
--- /dev/null
+++ b/doc/source/getting_started.rst
@@ -0,0 +1,252 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+..
+..     http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS,
+.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+.. See the License for the specific language governing permissions and
+.. limitations under the License.
+
+.. highlight:: none
+
+Getting Started
+===============
+
+Installing Cassandra
+--------------------
+
+Prerequisites
+^^^^^^^^^^^^^
+
+- The latest version of Java 8, either the `Oracle Java Standard Edition 8
+  <http://www.oracle.com/technetwork/java/javase/downloads/index.html>`__ or `OpenJDK
8 <http://openjdk.java.net/>`__. To
+  verify that you have the correct version of java installed, type ``java -version``.
+
+- For using cqlsh, the latest version of `Python 2.7 <https://www.python.org/downloads/>`__.
To verify that you have
+  the correct version of Python installed, type ``python --version``.
+
+Installation from binary tarball files
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+- Download the latest stable release from the `Apache Cassandra downloads website <http://cassandra.apache.org/download/>`__.
+
+- Untar the file somewhere, for example:
+
+::
+
+    tar -xvf apache-cassandra-3.6-bin.tar.gz cassandra
+
+The files will be extracted into ``apache-cassandra-3.6``, you need to substitute 3.6 with
the release number that you
+have downloaded.
+
+- Optionally add ``apache-cassandra-3.6\bin`` to your path.
+- Start Cassandra in the foreground by invoking ``bin/cassandra -f`` from the command line.
Press "Control-C" to stop
+  Cassandra. Start Cassandra in the background by invoking ``bin/cassandra`` from the command
line. Invoke ``kill pid``
+  or ``pkill -f CassandraDaemon`` to stop Cassandra, where pid is the Cassandra process id,
which you can find for
+  example by invoking ``pgrep -f CassandraDaemon``.
+- Verify that Cassandra is running by invoking ``bin/nodetool status`` from the command line.
+- Configuration files are located in the ``conf`` sub-directory.
+- Since Cassandra 2.1, log and data directories are located in the ``logs`` and ``data``
sub-directories respectively.
+  Older versions defaulted to ``/var/log/cassandra`` and ``/var/lib/cassandra``. Due to this,
it is necessary to either
+  start Cassandra with root privileges or change ``conf/cassandra.yaml`` to use directories
owned by the current user,
+  as explained below in the section on changing the location of directories.
+
+Installation from Debian packages
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+- Add the Apache repository of Cassandra to ``/etc/apt/sources.list.d/cassandra.sources.list``,
for example for version
+  3.6:
+
+::
+
+    echo "deb http://www.apache.org/dist/cassandra/debian 36x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
+
+- Update the repositories:
+
+::
+
+    sudo apt-get update
+
+- If you encounter this error:
+
+::
+
+    GPG error: http://www.apache.org 36x InRelease: The following signatures couldn't be
verified because the public key is not available: NO_PUBKEY 749D6EEC0353B12C
+
+Then add the public key 749D6EEC0353B12C as follows:
+
+::
+
+    gpg --keyserver pgp.mit.edu --recv-keys 749D6EEC0353B12C
+    gpg --export --armor 749D6EEC0353B12C | sudo apt-key add -
+
+and repeat ``sudo apt-get update``. The actual key may be different, you get it from the
error message itself. For a
+full list of Apache contributors public keys, you can refer to `this link <https://www.apache.org/dist/cassandra/KEYS>`__.
+
+- Install Cassandra:
+
+::
+
+    sudo apt-get install cassandra
+
+- You can start Cassandra with ``sudo service cassandra start`` and stop it with ``sudo service
cassandra stop``.
+  However, normally the service will start automatically. For this reason be sure to stop
it if you need to make any
+  configuration changes.
+- Verify that Cassandra is running by invoking ``nodetool status`` from the command line.
+- The default location of configuration files is ``/etc/cassandra``.
+- The default location of log and data directories is ``/var/log/cassandra/`` and ``/var/lib/cassandra``.
+
+Configuring Cassandra
+---------------------
+
+For running Cassandra on a single node, the steps above are enough, you don't really need
to change any configuration.
+However, when you deploy a cluster of nodes, or use clients that are not on the same host,
then there are some
+parameters that must be changed.
+
+The Cassandra configuration files can be found in the ``conf`` directory of tarballs. For
packages, the configuration
+files will be located in ``/etc/cassandra``.
+
+Main runtime properties
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Most of configuration in Cassandra is done via yaml properties that can be set in ``cassandra.yaml``.
At a minimum you
+should consider setting the following properties:
+
+- ``cluster_name``: the name of your cluster.
+- ``seeds``: a comma separated list of the IP addresses of your cluster seeds.
+- ``storage_port``: you don't necessarily need to change this but make sure that there are
no firewalls blocking this
+  port.
+- ``listen_address``: the IP address of your node, this is what allows other nodes to communicate
with this node so it
+  is important that you change it. Alternatively, you can set ``listen_interface`` to tell
Cassandra which interface to
+  use, and consecutively which address to use. Set only one, not both.
+- ``native_transport_port``: as for storage\_port, make sure this port is not blocked by
firewalls as clients will
+  communicate with Cassandra on this port.
+
+Changing the location of directories
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The following yaml properties control the location of directories:
+
+- ``data_file_directories``: one or more directories where data files are located.
+- ``commitlog_directory``: the directory where commitlog files are located.
+- ``saved_caches_directory``: the directory where saved caches are located.
+- ``hints_directory``: the directory where hints are located.
+
+For performance reasons, if you have multiple disks, consider putting commitlog and data
files on different disks.
+
+Environment variables
+^^^^^^^^^^^^^^^^^^^^^
+
+JVM-level settings such as heap size can be set in ``cassandra-env.sh``.  You can add any
additional JVM command line
+argument to the ``JVM_OPTS`` environment variable; when Cassandra starts these arguments
will be passed to the JVM.
+
+Logging
+^^^^^^^
+
+The logger in use is logback. You can change logging properties by editing ``logback.xml``.
By default it will log at
+INFO level into a file called ``system.log`` and at debug level into a file called ``debug.log``.
When running in the
+foreground, it will also log at INFO level to the console.
+
+
+cqlsh
+-----
+
+.. todo:: TODO
+
+
+Cassandra client drivers
+------------------------
+
+Here are known Cassandra client drivers organized by language. Before choosing a driver,
you should verify the Cassandra
+version and functionality supported by a specific driver.
+
+Java
+^^^^
+
+- `Achilles <http://achilles.archinnov.info/>`__
+- `Astyanax <https://github.com/Netflix/astyanax/wiki/Getting-Started>`__
+- `Casser <https://github.com/noorq/casser>`__
+- `Datastax Java driver <https://github.com/datastax/java-driver>`__
+- `Kundera <https://github.com/impetus-opensource/Kundera>`__
+- `PlayORM <https://github.com/deanhiller/playorm>`__
+
+Python
+^^^^^^
+
+- `Datastax Python driver <https://github.com/datastax/python-driver>`__
+
+Ruby
+^^^^
+
+- `Datastax Ruby driver <https://github.com/datastax/ruby-driver>`__
+
+C# / .NET
+^^^^^^^^^
+
+- `Cassandra Sharp <https://github.com/pchalamet/cassandra-sharp>`__
+- `Datastax C# driver <https://github.com/datastax/csharp-driver>`__
+- `Fluent Cassandra <https://github.com/managedfusion/fluentcassandra>`__
+
+Nodejs
+^^^^^^
+
+- `Datastax Nodejs driver <https://github.com/datastax/nodejs-driver>`__
+- `Node-Cassandra-CQL <https://github.com/jorgebay/node-cassandra-cql>`__
+
+PHP
+^^^
+
+- `CQL \| PHP <http://code.google.com/a/apache-extras.org/p/cassandra-pdo>`__
+- `Datastax PHP driver <https://github.com/datastax/php-driver/>`__
+- `PHP-Cassandra <https://github.com/aparkhomenko/php-cassandra>`__
+- `PHP Library for Cassandra <http://evseevnn.github.io/php-cassandra-binary/>`__
+
+C++
+^^^
+
+- `Datastax C++ driver <https://github.com/datastax/cpp-driver>`__
+- `libQTCassandra <http://sourceforge.net/projects/libqtcassandra>`__
+
+Scala
+^^^^^
+
+- `Datastax Spark connector <https://github.com/datastax/spark-cassandra-connector>`__
+- `Phantom <https://github.com/newzly/phantom>`__
+- `Quill <https://github.com/getquill/quill>`__
+
+Clojure
+^^^^^^^
+
+- `Alia <https://github.com/mpenet/alia>`__
+- `Cassaforte <https://github.com/clojurewerkz/cassaforte>`__
+- `Hayt <https://github.com/mpenet/hayt>`__
+
+Erlang
+^^^^^^
+
+- `CQerl <https://github.com/matehat/cqerl>`__
+- `Erlcass <https://github.com/silviucpp/erlcass>`__
+
+Go
+^^
+
+- `CQLc <http://relops.com/cqlc/>`__
+- `Gocassa <https://github.com/hailocab/gocassa>`__
+- `GoCQL <https://github.com/gocql/gocql>`__
+
+Haskell
+^^^^^^^
+
+- `Cassy <https://github.com/ozataman/cassy>`__
+
+Rust
+^^^^
+
+- `Rust CQL <https://github.com/neich/rust-cql>`__

http://git-wip-us.apache.org/repos/asf/cassandra/blob/cad277be/doc/source/index.rst
----------------------------------------------------------------------
diff --git a/doc/source/index.rst b/doc/source/index.rst
new file mode 100644
index 0000000..9573729
--- /dev/null
+++ b/doc/source/index.rst
@@ -0,0 +1,35 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+..
+..     http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS,
+.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+.. See the License for the specific language governing permissions and
+.. limitations under the License.
+
+Welcome to Apache Cassandra's documentation!
+============================================
+
+This is the official documentation for `Apache Cassandra <http://cassandra.apache.org>`__
|version|.  If you would like
+to contribute to this documentation, you are welcome to do so by submitting your contribution
like any other patch
+following `these instructions <https://wiki.apache.org/cassandra/HowToContribute>`__.
+
+Contents:
+
+.. toctree::
+   :maxdepth: 2
+
+   getting_started
+   architecture
+   cql
+   operations
+   troubleshooting
+   faq
+   contactus

http://git-wip-us.apache.org/repos/asf/cassandra/blob/cad277be/doc/source/operations.rst
----------------------------------------------------------------------
diff --git a/doc/source/operations.rst b/doc/source/operations.rst
new file mode 100644
index 0000000..8228746
--- /dev/null
+++ b/doc/source/operations.rst
@@ -0,0 +1,369 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+..
+..     http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS,
+.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+.. See the License for the specific language governing permissions and
+.. limitations under the License.
+
+.. highlight:: none
+
+Operating Cassandra
+===================
+
+Replication Strategies
+----------------------
+
+.. todo:: todo
+
+Snitches
+--------
+
+.. todo:: todo
+
+Adding, replacing, moving and removing nodes
+--------------------------------------------
+
+.. todo:: todo
+
+Repair
+------
+
+.. todo:: todo
+
+Read repair
+-----------
+
+.. todo:: todo
+
+Hints
+-----
+
+.. todo:: todo
+
+Compaction
+----------
+
+Size Tiered
+^^^^^^^^^^^
+
+.. todo:: todo
+
+Leveled
+^^^^^^^
+
+.. todo:: todo
+
+TimeWindow
+^^^^^^^^^^
+.. todo:: todo
+
+DateTiered
+^^^^^^^^^^
+.. todo:: todo
+
+Tombstones and Garbage Collection (GC) Grace
+--------------------------------------------
+
+Why Tombstones
+^^^^^^^^^^^^^^
+
+When a delete request is received by Cassandra it does not actually remove the data from
the underlying store. Instead
+it writes a special piece of data known as a tombstone. The Tombstone represents the delete
and causes all values which
+occurred before the tombstone to not appear in queries to the database. This approach is
used instead of removing values
+because of the distributed nature of Cassandra.
+
+Deletes without tombstones
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Imagine a three node cluster which has the value [A] replicated to every node.::
+
+    [A], [A], [A]
+
+If one of the nodes fails and and our delete operation only removes existing values we can
end up with a cluster that
+looks like::
+
+    [], [], [A]
+
+Then a repair operation would replace the value of [A] back onto the two
+nodes which are missing the value.::
+
+    [A], [A], [A]
+
+This would cause our data to be resurrected even though it had been
+deleted.
+
+Deletes with Tombstones
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Starting again with a three node cluster which has the value [A] replicated to every node.::
+
+    [A], [A], [A]
+
+If instead of removing data we add a tombstone record, our single node failure situation
will look like this.::
+
+    [A, Tombstone[A]], [A, Tombstone[A]], [A]
+
+Now when we issue a repair the Tombstone will be copied to the replica, rather than the deleted
data being
+resurrected.::
+
+    [A, Tombstone[A]], [A, Tombstone[A]], [A, Tombstone[A]]
+
+Our repair operation will correctly put the state of the system to what we expect with the
record [A] marked as deleted
+on all nodes. This does mean we will end up accruing Tombstones which will permanently accumulate
disk space. To avoid
+keeping tombstones forever we have a parameter known as ``gc_grace_seconds`` for every table
in Cassandra.
+
+The gc_grace_seconds parameter and Tombstone Removal
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The table level ``gc_grace_seconds`` parameter controls how long Cassandra will retain tombstones
through compaction
+events before finally removing them. This duration should directly reflect the amount of
time a user expects to allow
+before recovering a failed node. After ``gc_grace_seconds`` has expired the tombstone can
be removed meaning there will
+no longer be any record that a certain piece of data was deleted. This means if a node remains
down or disconnected for
+longer than ``gc_grace_seconds`` it's deleted data will be repaired back to the other nodes
and re-appear in the
+cluster. This is basically the same as in the "Deletes without Tombstones" section. Note
that tombstones will not be
+removed until a compaction event even if ``gc_grace_seconds`` has elapsed.
+
+The default value for ``gc_grace_seconds`` is 864000 which is equivalent to 10 days. This
can be set when creating or
+altering a table using ``WITH gc_grace_seconds``.
+
+
+Bloom Filters
+-------------
+
+In the read path, Cassandra merges data on disk (in SSTables) with data in RAM (in memtables).
To avoid checking every
+SSTable data file for the partition being requested, Cassandra employs a data structure known
as a bloom filter.
+
+Bloom filters are a probabilistic data structure that allows Cassandra to determine one of
two possible states: - The
+data definitely does not exist in the given file, or - The data probably exists in the given
file.
+
+While bloom filters can not guarantee that the data exists in a given SSTable, bloom filters
can be made more accurate
+by allowing them to consume more RAM. Operators have the opportunity to tune this behavior
per table by adjusting the
+the ``bloom_filter_fp_chance`` to a float between 0 and 1.
+
+The default value for ``bloom_filter_fp_chance`` is 0.1 for tables using LeveledCompactionStrategy
and 0.01 for all
+other cases.
+
+Bloom filters are stored in RAM, but are stored offheap, so operators should not consider
bloom filters when selecting
+the maximum heap size.  As accuracy improves (as the ``bloom_filter_fp_chance`` gets closer
to 0), memory usage
+increases non-linearly - the bloom filter for ``bloom_filter_fp_chance = 0.01`` will require
about three times as much
+memory as the same table with ``bloom_filter_fp_chance = 0.1``.
+
+Typical values for ``bloom_filter_fp_chance`` are usually between 0.01 (1%) to 0.1 (10%)
false-positive chance, where
+Cassandra may scan an SSTable for a row, only to find that it does not exist on the disk.
The parameter should be tuned
+by use case:
+
+- Users with more RAM and slower disks may benefit from setting the ``bloom_filter_fp_chance``
to a numerically lower
+  number (such as 0.01) to avoid excess IO operations
+- Users with less RAM, more dense nodes, or very fast disks may tolerate a higher ``bloom_filter_fp_chance``
in order to
+  save RAM at the expense of excess IO operations
+- In workloads that rarely read, or that only perform reads by scanning the entire data set
(such as analytics
+  workloads), setting the ``bloom_filter_fp_chance`` to a much higher number is acceptable.
+
+Changing
+^^^^^^^^
+
+The bloom filter false positive chance is visible in the ``DESCRIBE TABLE`` output as the
field
+``bloom_filter_fp_chance``. Operators can change the value with an ``ALTER TABLE`` statement:
+::
+
+    ALTER TABLE keyspace.table WITH bloom_filter_fp_chance=0.01
+
+Operators should be aware, however, that this change is not immediate: the bloom filter is
calculated when the file is
+written, and persisted on disk as the Filter component of the SSTable. Upon issuing an ``ALTER
TABLE`` statement, new
+files on disk will be written with the new ``bloom_filter_fp_chance``, but existing sstables
will not be modified until
+they are compacted - if an operator needs a change to ``bloom_filter_fp_chance`` to take
effect, they can trigger an
+SSTable rewrite using ``nodetool scrub`` or ``nodetool upgradesstables -a``, both of which
will rebuild the sstables on
+disk, regenerating the bloom filters in the progress.
+
+
+Compression
+-----------
+
+Cassandra offers operators the ability to configure compression on a per-table basis. Compression
reduces the size of
+data on disk by compressing the SSTable in user-configurable compression ``chunk_length_in_kb``.
Because Cassandra
+SSTables are immutable, the CPU cost of compressing is only necessary when the SSTable is
written - subsequent updates
+to data will land in different SSTables, so Cassandra will not need to decompress, overwrite,
and recompress data when
+UPDATE commands are issued. On reads, Cassandra will locate the relevant compressed chunks
on disk, decompress the full
+chunk, and then proceed with the remainder of the read path (merging data from disks and
memtables, read repair, and so
+on).
+
+Configuring Compression
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Compression is configured on a per-table basis as an optional argument to ``CREATE TABLE``
or ``ALTER TABLE``. By
+default, three options are relevant:
+
+- ``class`` specifies the compression class - Cassandra provides three classes (``LZ4Compressor``,
+  ``SnappyCompressor``, and ``DeflateCompressor`` ). The default is ``SnappyCompressor``.
+- ``chunk_length_in_kb`` specifies the number of kilobytes of data per compression chunk.
The default is 64KB.
+- ``crc_check_chance`` determines how likely Cassandra is to verify the checksum on each
compression chunk during
+  reads. The default is 1.0.
+
+Users can set compression using the following syntax:
+
+::
+
+    CREATE TABLE keyspace.table (id int PRIMARY KEY) WITH compression = {'class': 'LZ4Compressor'};
+
+Or
+
+::
+
+    ALTER TABLE keyspace.table WITH compression = {'class': 'SnappyCompressor', 'chunk_length_in_kb':
128, 'crc_check_chance': 0.5};
+
+Once enabled, compression can be disabled with ``ALTER TABLE`` setting ``enabled`` to ``false``:
+
+::
+
+    ALTER TABLE keyspace.table WITH compression = {'enabled':'false'};
+
+Operators should be aware, however, that changing compression is not immediate. The data
is compressed when the SSTable
+is written, and as SSTables are immutable, the compression will not be modified until the
table is compacted. Upon
+issuing a change to the compression options via ``ALTER TABLE``, the existing SSTables will
not be modified until they
+are compacted - if an operator needs compression changes to take effect immediately, the
operator can trigger an SSTable
+rewrite using ``nodetool scrub`` or ``nodetool upgradesstables -a``, both of which will rebuild
the SSTables on disk,
+re-compressing the data in the process.
+
+Benefits and Uses
+^^^^^^^^^^^^^^^^^
+
+Compression's primary benefit is that it reduces the amount of data written to disk. Not
only does the reduced size save
+in storage requirements, it often increases read and write throughput, as the CPU overhead
of compressing data is faster
+than the time it would take to read or write the larger volume of uncompressed data from
disk.
+
+Compression is most useful in tables comprised of many rows, where the rows are similar in
nature. Tables containing
+similar text columns (such as repeated JSON blobs) often compress very well.
+
+Operational Impact
+^^^^^^^^^^^^^^^^^^
+
+- Compression metadata is stored offheap and scales with data on disk.  This often requires
1-3GB of offheap RAM per
+  terabyte of data on disk, though the exact usage varies with ``chunk_length_in_kb`` and
compression ratios.
+
+- Streaming operations involve compressing and decompressing data on compressed tables -
in some code paths (such as
+  non-vnode bootstrap), the CPU overhead of compression can be a limiting factor.
+
+- The compression path checksums data to ensure correctness - while the traditional Cassandra
read path does not have a
+  way to ensure correctness of data on disk, compressed tables allow the user to set ``crc_check_chance``
(a float from
+  0.0 to 1.0) to allow Cassandra to probabilistically validate chunks on read to verify bits
on disk are not corrupt.
+
+Advanced Use
+^^^^^^^^^^^^
+
+Advanced users can provide their own compression class by implementing the interface at
+``org.apache.cassandra.io.compress.ICompressor``.
+
+Backups
+-------
+
+.. todo:: todo
+
+Monitoring
+----------
+
+JMX
+^^^
+.. todo:: todo
+
+Metric Reporters
+^^^^^^^^^^^^^^^^
+.. todo:: todo
+
+Security
+--------
+
+Roles
+^^^^^
+
+.. todo:: todo
+
+JMX access
+^^^^^^^^^^
+
+.. todo:: todo
+
+Nodetool (and other tooling)
+----------------------------
+
+.. todo:: Try to autogenerate this from Nodetool’s help.
+
+Hardware Choices
+----------------
+
+Like most databases, Cassandra throughput improves with more CPU cores, more RAM, and faster
disks. While Cassandra can
+be made to run on small servers for testing or development environments (including Raspberry
Pis), a minimal production
+server requires at least 2 cores, and at least 8GB of RAM. Typical production servers have
8 or more cores and at least
+32GB of RAM.
+
+CPU
+^^^
+Cassandra is highly concurrent, handling many simultaneous requests (both read and write)
using multiple threads running
+on as many CPU cores as possible. The Cassandra write path tends to be heavily optimized
(writing to the commitlog and
+then inserting the data into the memtable), so writes, in particular, tend to be CPU bound.
Consequently, adding
+additional CPU cores often increases throughput of both reads and writes.
+
+Memory
+^^^^^^
+Cassandra runs within a Java VM, which will pre-allocate a fixed size heap (java's Xmx system
parameter). In addition to
+the heap, Cassandra will use significant amounts of RAM offheap for compression metadata,
bloom filters, row, key, and
+counter caches, and an in process page cache. Finally, Cassandra will take advantage of the
operating system's page
+cache, storing recently accessed portions files in RAM for rapid re-use.
+
+For optimal performance, operators should benchmark and tune their clusters based on their
individual workload. However,
+basic guidelines suggest:
+
+-  ECC RAM should always be used, as Cassandra has few internal safeguards to protect against
bit level corruption
+-  The Cassandra heap should be no less than 2GB, and no more than 50% of your system RAM
+-  Heaps smaller than 12GB should consider ParNew/ConcurrentMarkSweep garbage collection
+-  Heaps larger than 12GB should consider G1GC
+
+Disks
+^^^^^
+Cassandra persists data to disk for two very different purposes. The first is to the commitlog
when a new write is made
+so that it can be replayed after a crash or system shutdown. The second is to the data directory
when thresholds are
+exceeded and memtables are flushed to disk as SSTables.
+
+Commitlogs receive every write made to a Cassandra node and have the potential to block client
operations, but they are
+only ever read on node start-up. SSTable (data file) writes on the other hand occur asynchronously,
but are read to
+satisfy client look-ups. SSTables are also periodically merged and rewritten in a process
called compaction.  The data
+held in the commitlog directory is data that has not been permanently saved to the SSTable
data directories - it will be
+periodically purged once it is flushed to the SSTable data files.
+
+Cassandra performs very well on both spinning hard drives and solid state disks. In both
cases, Cassandra's sorted
+immutable SSTables allow for linear reads, few seeks, and few overwrites, maximizing throughput
for HDDs and lifespan of
+SSDs by avoiding write amplification. However, when using spinning disks, it's important
that the commitlog
+(``commitlog_directory``) be on one physical disk (not simply a partition, but a physical
disk), and the data files
+(``data_file_directories``) be set to a separate physical disk. By separating the commitlog
from the data directory,
+writes can benefit from sequential appends to the commitlog without having to seek around
the platter as reads request
+data from various SSTables on disk.
+
+In most cases, Cassandra is designed to provide redundancy via multiple independent, inexpensive
servers. For this
+reason, using NFS or a SAN for data directories is an antipattern and should typically be
avoided.  Similarly, servers
+with multiple disks are often better served by using RAID0 or JBOD than RAID1 or RAID5 -
replication provided by
+Cassandra obsoletes the need for replication at the disk layer, so it's typically recommended
that operators take
+advantage of the additional throughput of RAID0 rather than protecting against failures with
RAID1 or RAID5.
+
+Common Cloud Choices
+^^^^^^^^^^^^^^^^^^^^
+
+Many large users of Cassandra run in various clouds, including AWS, Azure, and GCE - Cassandra
will happily run in any
+of these environments. Users should choose similar hardware to what would be needed in physical
space. In EC2, popular
+options include:
+
+- m1.xlarge instances, which provide 1.6TB of local ephemeral spinning storage and sufficient
RAM to run moderate
+  workloads
+- i2 instances, which provide both a high RAM:CPU ratio and local ephemeral SSDs
+- m4.2xlarge / c4.4xlarge instances, which provide modern CPUs, enhanced networking and work
well with EBS GP2 (SSD)
+  storage
+
+Generally, disk and network performance increases with instance size and generation, so newer
generations of instances
+and larger instance types within each family often perform better than their smaller or older
alternatives.

http://git-wip-us.apache.org/repos/asf/cassandra/blob/cad277be/doc/source/troubleshooting.rst
----------------------------------------------------------------------
diff --git a/doc/source/troubleshooting.rst b/doc/source/troubleshooting.rst
new file mode 100644
index 0000000..2e5cf10
--- /dev/null
+++ b/doc/source/troubleshooting.rst
@@ -0,0 +1,20 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+..
+..     http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS,
+.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+.. See the License for the specific language governing permissions and
+.. limitations under the License.
+
+Troubleshooting
+===============
+
+.. TODO: todo


Mime
View raw message