cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Cassandra Wiki] Update of "Operations" by JonathanEllis
Date Tue, 17 May 2011 15:50:40 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "Operations" page has been changed by JonathanEllis.
The comment on this change is: add NTS notes to token selection.
http://wiki.apache.org/cassandra/Operations?action=diff&rev1=89&rev2=90

--------------------------------------------------

  === Token selection ===
  Using a strong hash function means !RandomPartitioner keys will, on average, be evenly spread
across the Token space, but you can still have imbalances if your Tokens do not divide up
the range evenly, so you should specify !InitialToken to your first nodes as `i * (2**127
/ N)` for i = 0 .. N-1. In Cassandra 0.7, you should specify `initial_token` in `cassandra.yaml`.
  
+ With !NetworkTopologyStrategy, you should calculate the tokens the nodes in each DC independantly.
Tokens still neded to be unique, so you can add 1 to the tokens in the 2nd DC, add 2 in the
3rd, and so on.  Thus, for a 4-node cluster in 2 datacenters, you would have
+ {{{
+ DC1
+ node 1 = 0
+ node 2 = 85070591730234615865843651857942052864
+ 
+ DC2
+ node 1 = 1
+ node 2 = 85070591730234615865843651857942052865
+ }}}
+ 
  With order preserving partitioners, your key distribution will be application-dependent.
 You should still take your best guess at specifying initial tokens (guided by sampling actual
data, if possible), but you will be more dependent on active load balancing (see below) and/or
adding new nodes to hot spots.
  
  Once data is placed on the cluster, the partitioner may not be changed without wiping and
starting over.
@@ -40, +51 @@

  Replication factor is not really intended to be changed in a live cluster either, but increasing
it is conceptually simple: update the replication_factor from the CLI (see below), then run
repair against each node in your cluster so that all the new replicas that are supposed to
have the data, actually do.
  
  Until repair is finished, you have 3 options:
+ 
   * read at ConsistencyLevel.QUORUM or ALL (depending on your existing replication factor)
to make sure that a replica that actually has the data is consulted
   * continue reading at lower CL, accepting that some requests will fail (usually only the
first for a given query, if ReadRepair is enabled)
   * take downtime while repair runs
@@ -49, +61 @@

  Reducing replication factor is easily done and only requires running cleanup afterwards
to remove extra replicas.
  
  To update the replication factor on a live cluster, forget about cassandra.yaml. Rather
you want to use '''cassandra-cli''':
+ 
-     update keyspace Keyspace1 with replication_factor = 3;
+  . update keyspace Keyspace1 with replication_factor = 3;
  
  === Network topology ===
  Besides datacenters, you can also tell Cassandra which nodes are in the same rack within
a datacenter.  Cassandra will use this to route both reads and data movement for Range changes
to the nearest replicas.  This is configured by a user-pluggable !EndpointSnitch class in
the configuration file.
@@ -97, +110 @@

  
  Here's a python program which can be used to calculate new tokens for the nodes. There's
more info on the subject at Ben Black's presentation at Cassandra Summit 2010. http://www.datastax.com/blog/slides-and-videos-cassandra-summit-2010
  
-   def tokens(nodes):                        
+  . def tokens(nodes):
-       for x in xrange(nodes):         
+   . for x in xrange(nodes):
-           print 2 ** 127 / nodes * x
+    . print 2 ** 127 / nodes * x
  
- In versions of Cassandra 0.7.* and lower, there's also `nodetool loadbalance`: essentially
a convenience over decommission + bootstrap, only instead of telling the target node where
to move on the ring it will choose its location based on the same heuristic as Token selection
on bootstrap. You should not use this as it doesn't rebalance the entire ring. 
+ In versions of Cassandra 0.7.* and lower, there's also `nodetool loadbalance`: essentially
a convenience over decommission + bootstrap, only instead of telling the target node where
to move on the ring it will choose its location based on the same heuristic as Token selection
on bootstrap. You should not use this as it doesn't rebalance the entire ring.
  
- The status of move and balancing operations can be monitored using `nodetool` with the `netstat`
argument. 
+ The status of move and balancing operations can be monitored using `nodetool` with the `netstat`
argument.  (Cassandra 0.6.* and lower use the `streams` argument).
- (Cassandra 0.6.* and lower use the `streams` argument).
  
  == Consistency ==
  Cassandra allows clients to specify the desired consistency level on reads and writes. 
(See [[API]].)  If R + W > N, where R, W, and N are respectively the read replica count,
the write replica count, and the replication factor, all client reads will see the most recent
write.  Otherwise, readers '''may''' see older versions, for periods of typically a few ms;
this is called "eventual consistency."  See http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
and http://queue.acm.org/detail.cfm?id=1466448 for more.
@@ -115, +127 @@

  Cassandra repairs data in two ways:
  
   1. Read Repair: every time a read is performed, Cassandra compares the versions at each
replica (in the background, if a low consistency was requested by the reader to minimize latency),
and the newest version is sent to any out-of-date replicas.
-  1. Anti-Entropy: when `nodetool repair` is run, Cassandra computes a Merkle tree for each
range of data on that node, and compares it with the versions on other replicas, to catch
any out of sync data that hasn't been read recently.  This is intended to be run infrequently
(e.g., weekly) since computing the Merkle tree is relatively expensive in disk i/o and CPU,
since it scans ALL the data on the machine (but it is is very network efficient).  
+  1. Anti-Entropy: when `nodetool repair` is run, Cassandra computes a Merkle tree for each
range of data on that node, and compares it with the versions on other replicas, to catch
any out of sync data that hasn't been read recently.  This is intended to be run infrequently
(e.g., weekly) since computing the Merkle tree is relatively expensive in disk i/o and CPU,
since it scans ALL the data on the machine (but it is is very network efficient).
  
- Running `nodetool repair`:
- Like all nodetool operations in 0.7, repair is blocking: it will wait for the repair to
finish and then exit.  This may take a long time on large data sets.
+ Running `nodetool repair`: Like all nodetool operations in 0.7, repair is blocking: it will
wait for the repair to finish and then exit.  This may take a long time on large data sets.
  
  It is safe to run repair against multiple machines at the same time, but to minimize the
impact on your application workload it is recommended to wait for it to complete on one node
before invoking it against the next.
  
  === Frequency of nodetool repair ===
- 
- Unless your application performs no deletes, it is vital that production clusters run `nodetool
repair` periodically on all nodes in the cluster. The hard requirement for repair frequency
is the value used for GCGraceSeconds (see [[DistributedDeletes]]). Running nodetool repair
often enough to guarantee that all nodes have performed a repair in a given period GCGraceSeconds
long, ensures that deletes are not "forgotten" in the cluster.
+ Unless your application performs no deletes, it is vital that production clusters run `nodetool
repair` periodically on all nodes in the cluster. The hard requirement for repair frequency
is the value used for GCGraceSeconds (see DistributedDeletes). Running nodetool repair often
enough to guarantee that all nodes have performed a repair in a given period GCGraceSeconds
long, ensures that deletes are not "forgotten" in the cluster.
  
  Consider how to schedule your repairs. A repair causes additional disk and CPU activity
on the nodes participating in the repair, and it will typically be a good idea to spread repairs
out over time so as to minimize the chances of repairs running concurrently on many nodes.
  
  ==== Dealing with the consequences of nodetool repair not running within GCGraceSeconds
====
- 
- If `nodetool repair` has not been run often enough to the pointthat GCGraceSeconds has passed,
you risk forgotten deletes (see [[DistributedDeletes]]). In addition to data popping up that
has been deleted, you may see inconsistencies in data return from different nodes that will
not self-heal by read-repair or further `nodetool repair`. Some further details on this latter
effect is documented in [[https://issues.apache.org/jira/browse/CASSANDRA-1316|CASSANDRA-1316]].
+ If `nodetool repair` has not been run often enough to the pointthat GCGraceSeconds has passed,
you risk forgotten deletes (see DistributedDeletes). In addition to data popping up that has
been deleted, you may see inconsistencies in data return from different nodes that will not
self-heal by read-repair or further `nodetool repair`. Some further details on this latter
effect is documented in [[https://issues.apache.org/jira/browse/CASSANDRA-1316|CASSANDRA-1316]].
  
  There are at least three ways to deal with this scenario.
  
   1. Treat the node in question as failed, and replace it as described further below.
-  2. To minimize the amount of forgotten deletes, first increase GCGraceSeconds across the
cluster (rolling restart required), perform a full repair on all nodes, and then change GCRaceSeconds
back again. This has the advantage of ensuring tombstones spread as much as possible, minimizing
the amount of data that may "pop back up" (forgotten delete).
+  1. To minimize the amount of forgotten deletes, first increase GCGraceSeconds across the
cluster (rolling restart required), perform a full repair on all nodes, and then change GCRaceSeconds
back again. This has the advantage of ensuring tombstones spread as much as possible, minimizing
the amount of data that may "pop back up" (forgotten delete).
-  3. Yet another option, that will result in more forgotten deletes than the previous suggestion
but is easier to do, is to ensure 'nodetool repair' has been run on all nodes, and then perform
a compaction to expire toombstones. Following this, read-repair and regular `nodetool repair`
should cause the cluster to converge.
+  1. Yet another option, that will result in more forgotten deletes than the previous suggestion
but is easier to do, is to ensure 'nodetool repair' has been run on all nodes, and then perform
a compaction to expire toombstones. Following this, read-repair and regular `nodetool repair`
should cause the cluster to converge.
  
  === Handling failure ===
  If a node goes down and comes back up, the ordinary repair mechanisms will be adequate to
deal with any inconsistent data.  Remember though that if a node misses updates and is not
repaired for longer than your configured GCGraceSeconds (default: 10 days), it could have
missed remove operations permanently.  Unless your application performs no removes, you should
wipe its data directory, re-bootstrap it, and removetoken its old entry in the ring (see below).
@@ -158, +167 @@

  {{{
  Exception in thread "main" java.io.IOException: Cannot run program "ln": java.io.IOException:
error=12, Cannot allocate memory
  }}}
- This is caused by the operating system trying to allocate the child "ln" process a memory
space as large as the parent process (the cassandra server), even though '''it's not going
to use it'''. So if you have a machine with 8GB of RAM and no swap, and you gave 6GB to the
cassandra server, it will fail during this because the operating system wants 12 GB of virtual
memory before allowing you to create the process. 
+ This is caused by the operating system trying to allocate the child "ln" process a memory
space as large as the parent process (the cassandra server), even though '''it's not going
to use it'''. So if you have a machine with 8GB of RAM and no swap, and you gave 6GB to the
cassandra server, it will fail during this because the operating system wants 12 GB of virtual
memory before allowing you to create the process.
  
  This error can be worked around by either :
  
@@ -167, +176 @@

  OR
  
   * creating a swap file, snapshotting, removing swap file
+ 
  OR
+ 
   * turning on "memory overcommit"
  
  To restore a snapshot:
@@ -210, +221 @@

  
  Running `nodetool cfstats` can provide an overview of each Column Family, and important
metrics to graph your cluster. Some folks prefer having to deal with non-jmx clients, there
is a JMX-to-REST bridge available at http://code.google.com/p/polarrose-jmx-rest-bridge/
  
- Important metrics to watch on a per-Column Family basis would be: '''Read Count, Read Latency,
Write Count and Write Latency'''. '''Pending Tasks''' tell you if things are backing up. These
metrics can also be exposed using any JMX client such as `jconsole`.  (See also [[http://simplygenius.com/2010/08/jconsole-via-socks-ssh-tunnel.html]]
for how to proxy JConsole to firewalled machines.)
+ Important metrics to watch on a per-Column Family basis would be: '''Read Count, Read Latency,
Write Count and Write Latency'''. '''Pending Tasks''' tell you if things are backing up. These
metrics can also be exposed using any JMX client such as `jconsole`.  (See also http://simplygenius.com/2010/08/jconsole-via-socks-ssh-tunnel.html
for how to proxy JConsole to firewalled machines.)
  
  You can also use jconsole, and the MBeans tab to look at PendingTasks for thread pools.
If you see one particular thread backing up, this can give you an indication of a problem.
One example would be ROW-MUTATION-STAGE indicating that write requests are arriving faster
than they can be handled. A more subtle example is the FLUSH stages: if these start backing
up, cassandra is accepting writes into memory fast enough, but the sort-and-write-to-disk
stages are falling behind.
  
@@ -240, +251 @@

  FLUSH-WRITER-POOL                 0         0            218
  HINTED-HANDOFF-POOL               0         0            154
  }}}
- 
  == Monitoring with MX4J ==
- mx4j provides an HTML and HTTP interface to JMX. Starting from version 0.7.0 cassandra lets
you hook up mx4j very easily.
+ mx4j provides an HTML and HTTP interface to JMX. Starting from version 0.7.0 cassandra lets
you hook up mx4j very easily. To enable mx4j on a Cassandra node:
- To enable mx4j on a Cassandra node:
+ 
   * Download mx4j-tools.jar from http://mx4j.sourceforge.net/
   * Add mx4j-tools.jar to the classpath (e.g. under lib/)
   * Start cassandra
-  * In the log you should see a message such as Http``Atapter started on port 8081
+  * In the log you should see a message such as HttpAtapter started on port 8081
   * To choose a different port (8081 is the default) or a different listen address (0.0.0.0
is not the default) edit conf/cassandra-env.sh and uncomment #MX4J_ADDRESS="-Dmx4jaddress=0.0.0.0"
and #MX4J_PORT="-Dmx4jport=8081"
  
  Now browse to http://cassandra:8081/ and use the HTML interface.

Mime
View raw message