cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Cassandra Wiki] Update of "Operations" by Chris Goffinet
Date Tue, 08 Dec 2009 23:46:47 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "Operations" page has been changed by Chris Goffinet.
http://wiki.apache.org/cassandra/Operations?action=diff&rev1=5&rev2=6

--------------------------------------------------

  The following applies to Cassandra 0.5, which is currently in '''beta'''.
  
  = Ring management =
- 
  Each Cassandra server [node] is assigned a unique Token that determines what keys it is
the primary replica for.  If you sort all nodes' Tokens, the Range of keys each is responsible
for is (!PreviousToken, !MyToken], that is, from the previous token (exclusive) to the node's
token (inclusive).  The machine with the lowest Token gets both all keys less than that token,
and all keys greater than the largest Token; this is called a "wrapping Range."
  
  (Note that there is nothing special about being the "primary" replica, in the sense of being
a point of failure.)
@@ -11, +10 @@

  When the !RandomPartitioner is used, Tokens are integers from 0 to 2**127.  Keys are converted
to this range by MD5 hashing for comparison with Tokens.  (Thus, keys are always convertible
to Tokens, but the reverse is not always true.)
  
  == Token selection ==
- 
  Using a strong hash function means !RandomPartitioner keys will, on average, be evenly spread
across the Token space, but you can still have imbalances if your Tokens do not divide up
the range evenly, so you should specify !InitialToken to your first nodes as `i * (2**127
/ N)` for i = 1 .. N.
  
  With order preserving partioners, your key distribution will be application-dependent. 
You should still take your best guess at specifying initial tokens (guided by sampling actual
data, if possible), but you will be more dependent on active load balancing (see below) and/or
adding new nodes to hot spots.
@@ -19, +17 @@

  Once data is placed on the cluster, the partitioner may not be changed without wiping and
starting over.
  
  == Replication ==
+ A Cassandra cluster always divides up the key space into ranges delimited by Tokens as described
above, but additional replica placement is customizable via !IReplicaPlacementStrategy in
the configuration file.  The standard strategies are
  
- A Cassandra cluster always divides up the key space into ranges delimited by Tokens as described
above, but additional replica placement is customizable via !IReplicaPlacementStrategy in
the configuration file.  The standard strategies are
   * !RackUnawareStrategy: replicas are always placed on the next (in increasing Token order)
N-1 nodes along the ring
   * !RackAwareStrategy: replica 2 is is placed in the first node along the ring the belongs
in '''another''' data center than the first; the remaining N-2 replicas, if any, are placed
on the first nodes along the ring in the '''same''' rack as the first
  
@@ -29, +27 @@

  Replication strategy may not be changed without wiping your data and starting over.
  
  = Adding new nodes =
- 
  Adding new nodes is called "bootstrapping."
  
  To bootstrap a node, turn !AutoBootstrap on in the configuration file, and start it.
@@ -37, +34 @@

  If you explicitly specify an !InitialToken in the configuration, the new node will bootstrap
to that position on the ring.  Otherwise, it will pick a Token that will give it half the
keys from the node with the most disk space used, that does not already have another node
boostrapping into its Range.
  
  Important things to note:
+ 
   1. You should wait long enough for all the nodes in your cluster to become aware of the
bootstrapping node via gossip before starting another bootstrap.  For most clusters 30s will
be plenty of time.
   1. Automatically picking a Token only allows doubling your cluster size at once; for more
than that, let the first group finish before starting another.
   1. As a safety measure, Cassandra does not automatically remove data from nodes that "lose"
part of their Token Range to a newly added node.  Run "nodeprobe cleanup" on the source node(s)
when you are satisfied the new node is up and working. If you do not do this the old data
will still be counted against the load on that node and future bootstrap attempts at choosing
a location will be thrown off.
  
  = Removing nodes entirely =
- 
  You can take a node out of the cluster with `nodeprobe decommission.`  The node must be
live at decommission time (until CASSANDRA-564 is done).
  
  Again, no data is removed automatically, so if you want to put the node back into service
and you don't need the data on it anymore, it should be removed manually.
  
  = Moving nodes =
- 
  Moving is essentially a convenience over decommission + bootstrap.
  
  == Load balancing ==
- 
  Also essentially a convenience over decommission + bootstrap, only instead of telling the
node where to move on the ring it will choose its location based on the same heuristic as
Token selection on bootstrap.
  
  = Consistency =
- 
- Cassandra allows clients to specify the desired consistency level on reads and writes. 
(See [[API]].)  If R + W > N, where R, W, and N are respectively the read replica count,
the write replica count, and the replication factor, all client reads will see the most recent
write.  Otherwise, readers '''may''' see older versions, for periods of typically a few ms;
this is called "eventual consistency."  See [[http://www.allthingsdistributed.com/2008/12/eventually_consistent.html]]
and [[http://queue.acm.org/detail.cfm?id=1466448]] for more.
+ Cassandra allows clients to specify the desired consistency level on reads and writes. 
(See [[API]].)  If R + W > N, where R, W, and N are respectively the read replica count,
the write replica count, and the replication factor, all client reads will see the most recent
write.  Otherwise, readers '''may''' see older versions, for periods of typically a few ms;
this is called "eventual consistency."  See http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
and http://queue.acm.org/detail.cfm?id=1466448 for more.
  
  == Repairing missing or inconsistent data ==
- 
  Cassandra repairs data in two ways:
  
   1. Read Repair: every time a read is performed, Cassandra compares the versions at each
replica (in the background, if a low consistency was requested by the reader to minimize latency),
and the newest version is sent to any out-of-date replicas.
   1. Anti-Entropy: when `nodeprobe repair` is run, Cassandra performs a major compaction,
computes a Merkle Tree of the data on that node, and compares it with the versions on other
replicas, to catch any out of sync data that hasn't been read recently.  This is intended
to be run infrequently (e.g., weekly) since major compaction is relatively expensive.
  
  == Handling failure ==
+ If a node goes down and comes back up, the ordinary repair mechanisms will be adequate to
deal with any inconsistent data.  If a node goes down entirely, you should be aware of the
following as well:
  
- If a node goes down and comes back up, the ordinary repair mechanisms will be adequate to
deal with any inconsistent data.  If a node goes down entirely, you should be aware of the
following as well:
   1. Remove the old node from the ring first, or bring up a replacement node with the same
IP and Token as the old; otherwise, the old node will stay part of the ring in a "down" state,
which will degrade your replication factor for the affected Range
    * If you don't know the Token of the old node, you can retrieve it from any of the other
nodes' `system` keyspace, !ColumnFamily `LocationInfo`, key `L`.
+   * You can also run  `nodeprobe ring `to lookup a node's token (Unless there was some kind
of outage, and the others came up but not the down one).
   1. Removing the old node, then bootstrapping the new one, may be more performant than using
Anti-Entropy.  Testing needed.
    * Even brute-force rsyncing of data from the relevant replicas and running cleanup on
the replacement node may be more performant
  
  = Backing up data =
- 
  Cassandra can snapshot data while online using `nodeprobe snapshot`.  You can then back
up those snapshots using any desired system, although leaving them where they are is probably
the option that makes the most sense on large clusters.
  
  Currently, only flushed data is snapshotted (not data that only exists in the commitlog).
 Run `nodeprobe flush` first and wait for that to complete, to make sure you get '''all'''
data in the snapshot.
@@ -83, +76 @@

  To revert to a snapshot, shut down the node, clear out the old commitlog and sstables, and
move the sstables from the snapshot location to the live data directory.
  
  == Import / export ==
- 
  Cassandra can also export data as JSON with `bin/sstable2json`, and import it with `bin/json2sstable`.
 Eric to document. :)
  
  = Monitoring =
- 
  Cassandra exposes internal metrics as JMX data.  This is a common standard in the JVM world;
OpenNMS, Nagios, and Munin at least offer some level of JMX support.
  
- Chris (or other volunteer :) to describe some important metrics to watch
+ Running `nodeprobe cfstats` can provide an overview of each Column Family, and important
metrics to graph your cluster. Some folks prefer having to deal with non-jmx clients, there
is a JMX-to-REST bridge available at http://code.google.com/p/polarrose-jmx-rest-bridge/
  
+ Import metrics to watch on a per-Column Family basis would be: '''Read Count, Read Latency,
Write Count and Write Latency'''. '''Pending Tasks''' tell you if things are backing up. These
metrics can also be exposed using any JMX client such as `jconsole`
+ 
+ For debugging purposes you can use jconsole, and the MBeans tab to look at PendingTasks
for thread pools. If you see one particular thread backing up, this can give you an indication
of a problem. One example would be ROW-MUTATION-STAGE. If you are seeing a lot of tasks being
built up, your hardware or configuration tuning is probably the bottleneck.
+ 
+ Running `nodeprobe tpstats` will dump all of those threads to console if you don't want
to use jconsole. Example:
+ 
+ {{{
+ Pool Name                    Active   Pending      Completed
+ FILEUTILS-DELETE-POOL             0         0            119
+ MESSAGING-SERVICE-POOL            3         4       81594002
+ STREAM-STAGE                      0         0              3
+ RESPONSE-STAGE                    0         0       48353537
+ ROW-READ-STAGE                    0         0          13754
+ LB-OPERATIONS                     0         0              0
+ COMMITLOG                         1         0       78080398
+ GMFD                              0         0        1091592
+ MESSAGE-DESERIALIZER-POOL         0         0      126022919
+ LB-TARGET                         0         0              0
+ CONSISTENCY-MANAGER               0         0           2899
+ ROW-MUTATION-STAGE                1         2       81719765
+ MESSAGE-STREAMING-POOL            0         0            129
+ LOAD-BALANCER-STAGE               0         0              0
+ FLUSH-SORTER-POOL                 0         0            218
+ MEMTABLE-POST-FLUSHER             0         0            218
+ COMPACTION-POOL                   0         0            464
+ FLUSH-WRITER-POOL                 0         0            218
+ HINTED-HANDOFF-POOL               0         0            154
+ }}}
+ 

Mime
View raw message