Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
From: Apache Wiki <wikidiffs@apache.org>
To: Apache Wiki <wikidiffs@apache.org>
Date: Tue, 17 May 2011 15:50:40 -0000
Message-ID: <20110517155040.1287.7750@eos.apache.org>
Subject: 
 =?utf-8?q?=5BCassandra_Wiki=5D_Update_of_=22Operations=22_by_JonathanElli?=
 =?utf-8?q?s?=

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for=
 change notification.

The "Operations" page has been changed by JonathanEllis.
The comment on this change is: add NTS notes to token selection.
http://wiki.apache.org/cassandra/Operations?action=3Ddiff&rev1=3D89&rev2=3D=
90

--------------------------------------------------

  =3D=3D=3D Token selection =3D=3D=3D
  Using a strong hash function means !RandomPartitioner keys will, on avera=
ge, be evenly spread across the Token space, but you can still have imbalan=
ces if your Tokens do not divide up the range evenly, so you should specify=
 !InitialToken to your first nodes as `i * (2**127 / N)` for i =3D 0 .. N-1=
. In Cassandra 0.7, you should specify `initial_token` in `cassandra.yaml`.
  =

+ With !NetworkTopologyStrategy, you should calculate the tokens the nodes =
in each DC independantly. Tokens still neded to be unique, so you can add 1=
 to the tokens in the 2nd DC, add 2 in the 3rd, and so on.  Thus, for a 4-n=
ode cluster in 2 datacenters, you would have
+ {{{
+ DC1
+ node 1 =3D 0
+ node 2 =3D 85070591730234615865843651857942052864
+ =

+ DC2
+ node 1 =3D 1
+ node 2 =3D 85070591730234615865843651857942052865
+ }}}
+ =

  With order preserving partitioners, your key distribution will be applica=
tion-dependent.  You should still take your best guess at specifying initia=
l tokens (guided by sampling actual data, if possible), but you will be mor=
e dependent on active load balancing (see below) and/or adding new nodes to=
 hot spots.
  =

  Once data is placed on the cluster, the partitioner may not be changed wi=
thout wiping and starting over.
@@ -40, +51 @@

  Replication factor is not really intended to be changed in a live cluster=
 either, but increasing it is conceptually simple: update the replication_f=
actor from the CLI (see below), then run repair against each node in your c=
luster so that all the new replicas that are supposed to have the data, act=
ually do.
  =

  Until repair is finished, you have 3 options:
+ =

   * read at ConsistencyLevel.QUORUM or ALL (depending on your existing rep=
lication factor) to make sure that a replica that actually has the data is =
consulted
   * continue reading at lower CL, accepting that some requests will fail (=
usually only the first for a given query, if ReadRepair is enabled)
   * take downtime while repair runs
@@ -49, +61 @@

  Reducing replication factor is easily done and only requires running clea=
nup afterwards to remove extra replicas.
  =

  To update the replication factor on a live cluster, forget about cassandr=
a.yaml. Rather you want to use '''cassandra-cli''':
+ =

-     update keyspace Keyspace1 with replication_factor =3D 3;
+  . update keyspace Keyspace1 with replication_factor =3D 3;
  =

  =3D=3D=3D Network topology =3D=3D=3D
  Besides datacenters, you can also tell Cassandra which nodes are in the s=
ame rack within a datacenter.  Cassandra will use this to route both reads =
and data movement for Range changes to the nearest replicas.  This is confi=
gured by a user-pluggable !EndpointSnitch class in the configuration file.
@@ -97, +110 @@

  =

  Here's a python program which can be used to calculate new tokens for the=
 nodes. There's more info on the subject at Ben Black's presentation at Cas=
sandra Summit 2010. http://www.datastax.com/blog/slides-and-videos-cassandr=
a-summit-2010
  =

-   def tokens(nodes):                        =

+  . def tokens(nodes):
-       for x in xrange(nodes):         =

+   . for x in xrange(nodes):
-           print 2 ** 127 / nodes * x
+    . print 2 ** 127 / nodes * x
  =

- In versions of Cassandra 0.7.* and lower, there's also `nodetool loadbala=
nce`: essentially a convenience over decommission + bootstrap, only instead=
 of telling the target node where to move on the ring it will choose its lo=
cation based on the same heuristic as Token selection on bootstrap. You sho=
uld not use this as it doesn't rebalance the entire ring. =

+ In versions of Cassandra 0.7.* and lower, there's also `nodetool loadbala=
nce`: essentially a convenience over decommission + bootstrap, only instead=
 of telling the target node where to move on the ring it will choose its lo=
cation based on the same heuristic as Token selection on bootstrap. You sho=
uld not use this as it doesn't rebalance the entire ring.
  =

- The status of move and balancing operations can be monitored using `nodet=
ool` with the `netstat` argument. =

+ The status of move and balancing operations can be monitored using `nodet=
ool` with the `netstat` argument.  (Cassandra 0.6.* and lower use the `stre=
ams` argument).
- (Cassandra 0.6.* and lower use the `streams` argument).
  =

  =3D=3D Consistency =3D=3D
  Cassandra allows clients to specify the desired consistency level on read=
s and writes.  (See [[API]].)  If R + W > N, where R, W, and N are respecti=
vely the read replica count, the write replica count, and the replication f=
actor, all client reads will see the most recent write.  Otherwise, readers=
 '''may''' see older versions, for periods of typically a few ms; this is c=
alled "eventual consistency."  See http://www.allthingsdistributed.com/2008=
/12/eventually_consistent.html and http://queue.acm.org/detail.cfm?id=3D146=
6448 for more.
@@ -115, +127 @@

  Cassandra repairs data in two ways:
  =

   1. Read Repair: every time a read is performed, Cassandra compares the v=
ersions at each replica (in the background, if a low consistency was reques=
ted by the reader to minimize latency), and the newest version is sent to a=
ny out-of-date replicas.
-  1. Anti-Entropy: when `nodetool repair` is run, Cassandra computes a Mer=
kle tree for each range of data on that node, and compares it with the vers=
ions on other replicas, to catch any out of sync data that hasn't been read=
 recently.  This is intended to be run infrequently (e.g., weekly) since co=
mputing the Merkle tree is relatively expensive in disk i/o and CPU, since =
it scans ALL the data on the machine (but it is is very network efficient).=
  =

+  1. Anti-Entropy: when `nodetool repair` is run, Cassandra computes a Mer=
kle tree for each range of data on that node, and compares it with the vers=
ions on other replicas, to catch any out of sync data that hasn't been read=
 recently.  This is intended to be run infrequently (e.g., weekly) since co=
mputing the Merkle tree is relatively expensive in disk i/o and CPU, since =
it scans ALL the data on the machine (but it is is very network efficient).
  =

- Running `nodetool repair`:
- Like all nodetool operations in 0.7, repair is blocking: it will wait for=
 the repair to finish and then exit.  This may take a long time on large da=
ta sets.
+ Running `nodetool repair`: Like all nodetool operations in 0.7, repair is=
 blocking: it will wait for the repair to finish and then exit.  This may t=
ake a long time on large data sets.
  =

  It is safe to run repair against multiple machines at the same time, but =
to minimize the impact on your application workload it is recommended to wa=
it for it to complete on one node before invoking it against the next.
  =

  =3D=3D=3D Frequency of nodetool repair =3D=3D=3D
- =

- Unless your application performs no deletes, it is vital that production =
clusters run `nodetool repair` periodically on all nodes in the cluster. Th=
e hard requirement for repair frequency is the value used for GCGraceSecond=
s (see [[DistributedDeletes]]). Running nodetool repair often enough to gua=
rantee that all nodes have performed a repair in a given period GCGraceSeco=
nds long, ensures that deletes are not "forgotten" in the cluster.
+ Unless your application performs no deletes, it is vital that production =
clusters run `nodetool repair` periodically on all nodes in the cluster. Th=
e hard requirement for repair frequency is the value used for GCGraceSecond=
s (see DistributedDeletes). Running nodetool repair often enough to guarant=
ee that all nodes have performed a repair in a given period GCGraceSeconds =
long, ensures that deletes are not "forgotten" in the cluster.
  =

  Consider how to schedule your repairs. A repair causes additional disk an=
d CPU activity on the nodes participating in the repair, and it will typica=
lly be a good idea to spread repairs out over time so as to minimize the ch=
ances of repairs running concurrently on many nodes.
  =

  =3D=3D=3D=3D Dealing with the consequences of nodetool repair not running=
 within GCGraceSeconds =3D=3D=3D=3D
- =

- If `nodetool repair` has not been run often enough to the pointthat GCGra=
ceSeconds has passed, you risk forgotten deletes (see [[DistributedDeletes]=
]). In addition to data popping up that has been deleted, you may see incon=
sistencies in data return from different nodes that will not self-heal by r=
ead-repair or further `nodetool repair`. Some further details on this latte=
r effect is documented in [[https://issues.apache.org/jira/browse/CASSANDRA=
-1316|CASSANDRA-1316]].
+ If `nodetool repair` has not been run often enough to the pointthat GCGra=
ceSeconds has passed, you risk forgotten deletes (see DistributedDeletes). =
In addition to data popping up that has been deleted, you may see inconsist=
encies in data return from different nodes that will not self-heal by read-=
repair or further `nodetool repair`. Some further details on this latter ef=
fect is documented in [[https://issues.apache.org/jira/browse/CASSANDRA-131=
6|CASSANDRA-1316]].
  =

  There are at least three ways to deal with this scenario.
  =

   1. Treat the node in question as failed, and replace it as described fur=
ther below.
-  2. To minimize the amount of forgotten deletes, first increase GCGraceSe=
conds across the cluster (rolling restart required), perform a full repair =
on all nodes, and then change GCRaceSeconds back again. This has the advant=
age of ensuring tombstones spread as much as possible, minimizing the amoun=
t of data that may "pop back up" (forgotten delete).
+  1. To minimize the amount of forgotten deletes, first increase GCGraceSe=
conds across the cluster (rolling restart required), perform a full repair =
on all nodes, and then change GCRaceSeconds back again. This has the advant=
age of ensuring tombstones spread as much as possible, minimizing the amoun=
t of data that may "pop back up" (forgotten delete).
-  3. Yet another option, that will result in more forgotten deletes than t=
he previous suggestion but is easier to do, is to ensure 'nodetool repair' =
has been run on all nodes, and then perform a compaction to expire toombsto=
nes. Following this, read-repair and regular `nodetool repair` should cause=
 the cluster to converge.
+  1. Yet another option, that will result in more forgotten deletes than t=
he previous suggestion but is easier to do, is to ensure 'nodetool repair' =
has been run on all nodes, and then perform a compaction to expire toombsto=
nes. Following this, read-repair and regular `nodetool repair` should cause=
 the cluster to converge.
  =

  =3D=3D=3D Handling failure =3D=3D=3D
  If a node goes down and comes back up, the ordinary repair mechanisms wil=
l be adequate to deal with any inconsistent data.  Remember though that if =
a node misses updates and is not repaired for longer than your configured G=
CGraceSeconds (default: 10 days), it could have missed remove operations pe=
rmanently.  Unless your application performs no removes, you should wipe it=
s data directory, re-bootstrap it, and removetoken its old entry in the rin=
g (see below).
@@ -158, +167 @@

  {{{
  Exception in thread "main" java.io.IOException: Cannot run program "ln": =
java.io.IOException: error=3D12, Cannot allocate memory
  }}}
- This is caused by the operating system trying to allocate the child "ln" =
process a memory space as large as the parent process (the cassandra server=
), even though '''it's not going to use it'''. So if you have a machine wit=
h 8GB of RAM and no swap, and you gave 6GB to the cassandra server, it will=
 fail during this because the operating system wants 12 GB of virtual memor=
y before allowing you to create the process. =

+ This is caused by the operating system trying to allocate the child "ln" =
process a memory space as large as the parent process (the cassandra server=
), even though '''it's not going to use it'''. So if you have a machine wit=
h 8GB of RAM and no swap, and you gave 6GB to the cassandra server, it will=
 fail during this because the operating system wants 12 GB of virtual memor=
y before allowing you to create the process.
  =

  This error can be worked around by either :
  =

@@ -167, +176 @@

  OR
  =

   * creating a swap file, snapshotting, removing swap file
+ =

  OR
+ =

   * turning on "memory overcommit"
  =

  To restore a snapshot:
@@ -210, +221 @@

  =

  Running `nodetool cfstats` can provide an overview of each Column Family,=
 and important metrics to graph your cluster. Some folks prefer having to d=
eal with non-jmx clients, there is a JMX-to-REST bridge available at http:/=
/code.google.com/p/polarrose-jmx-rest-bridge/
  =

- Important metrics to watch on a per-Column Family basis would be: '''Read=
 Count, Read Latency, Write Count and Write Latency'''. '''Pending Tasks'''=
 tell you if things are backing up. These metrics can also be exposed using=
 any JMX client such as `jconsole`.  (See also [[http://simplygenius.com/20=
10/08/jconsole-via-socks-ssh-tunnel.html]] for how to proxy JConsole to fir=
ewalled machines.)
+ Important metrics to watch on a per-Column Family basis would be: '''Read=
 Count, Read Latency, Write Count and Write Latency'''. '''Pending Tasks'''=
 tell you if things are backing up. These metrics can also be exposed using=
 any JMX client such as `jconsole`.  (See also http://simplygenius.com/2010=
/08/jconsole-via-socks-ssh-tunnel.html for how to proxy JConsole to firewal=
led machines.)
  =

  You can also use jconsole, and the MBeans tab to look at PendingTasks for=
 thread pools. If you see one particular thread backing up, this can give y=
ou an indication of a problem. One example would be ROW-MUTATION-STAGE indi=
cating that write requests are arriving faster than they can be handled. A =
more subtle example is the FLUSH stages: if these start backing up, cassand=
ra is accepting writes into memory fast enough, but the sort-and-write-to-d=
isk stages are falling behind.
  =

@@ -240, +251 @@

  FLUSH-WRITER-POOL                 0         0            218
  HINTED-HANDOFF-POOL               0         0            154
  }}}
- =

  =3D=3D Monitoring with MX4J =3D=3D
- mx4j provides an HTML and HTTP interface to JMX. Starting from version 0.=
7.0 cassandra lets you hook up mx4j very easily.
+ mx4j provides an HTML and HTTP interface to JMX. Starting from version 0.=
7.0 cassandra lets you hook up mx4j very easily. To enable mx4j on a Cassan=
dra node:
- To enable mx4j on a Cassandra node:
+ =

   * Download mx4j-tools.jar from http://mx4j.sourceforge.net/
   * Add mx4j-tools.jar to the classpath (e.g. under lib/)
   * Start cassandra
-  * In the log you should see a message such as Http``Atapter started on p=
ort 8081
+  * In the log you should see a message such as HttpAtapter started on por=
t 8081
   * To choose a different port (8081 is the default) or a different listen=
 address (0.0.0.0 is not the default) edit conf/cassandra-env.sh and uncomm=
ent #MX4J_ADDRESS=3D"-Dmx4jaddress=3D0.0.0.0" and #MX4J_PORT=3D"-Dmx4jport=
=3D8081"
  =

  Now browse to http://cassandra:8081/ and use the HTML interface.