Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D96984573 for ; Tue, 17 May 2011 15:51:02 +0000 (UTC) Received: (qmail 97689 invoked by uid 500); 17 May 2011 15:51:02 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 97622 invoked by uid 500); 17 May 2011 15:51:02 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 97614 invoked by uid 500); 17 May 2011 15:51:02 -0000 Delivered-To: apmail-incubator-cassandra-commits@incubator.apache.org Received: (qmail 97611 invoked by uid 99); 17 May 2011 15:51:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 May 2011 15:51:02 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.131] (HELO eos.apache.org) (140.211.11.131) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 May 2011 15:51:00 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id A262DC1A; Tue, 17 May 2011 15:50:40 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Tue, 17 May 2011 15:50:40 -0000 Message-ID: <20110517155040.1287.7750@eos.apache.org> Subject: =?utf-8?q?=5BCassandra_Wiki=5D_Update_of_=22Operations=22_by_JonathanElli?= =?utf-8?q?s?= Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for= change notification. The "Operations" page has been changed by JonathanEllis. The comment on this change is: add NTS notes to token selection. http://wiki.apache.org/cassandra/Operations?action=3Ddiff&rev1=3D89&rev2=3D= 90 -------------------------------------------------- =3D=3D=3D Token selection =3D=3D=3D Using a strong hash function means !RandomPartitioner keys will, on avera= ge, be evenly spread across the Token space, but you can still have imbalan= ces if your Tokens do not divide up the range evenly, so you should specify= !InitialToken to your first nodes as `i * (2**127 / N)` for i =3D 0 .. N-1= . In Cassandra 0.7, you should specify `initial_token` in `cassandra.yaml`. = + With !NetworkTopologyStrategy, you should calculate the tokens the nodes = in each DC independantly. Tokens still neded to be unique, so you can add 1= to the tokens in the 2nd DC, add 2 in the 3rd, and so on. Thus, for a 4-n= ode cluster in 2 datacenters, you would have + {{{ + DC1 + node 1 =3D 0 + node 2 =3D 85070591730234615865843651857942052864 + = + DC2 + node 1 =3D 1 + node 2 =3D 85070591730234615865843651857942052865 + }}} + = With order preserving partitioners, your key distribution will be applica= tion-dependent. You should still take your best guess at specifying initia= l tokens (guided by sampling actual data, if possible), but you will be mor= e dependent on active load balancing (see below) and/or adding new nodes to= hot spots. = Once data is placed on the cluster, the partitioner may not be changed wi= thout wiping and starting over. @@ -40, +51 @@ Replication factor is not really intended to be changed in a live cluster= either, but increasing it is conceptually simple: update the replication_f= actor from the CLI (see below), then run repair against each node in your c= luster so that all the new replicas that are supposed to have the data, act= ually do. = Until repair is finished, you have 3 options: + = * read at ConsistencyLevel.QUORUM or ALL (depending on your existing rep= lication factor) to make sure that a replica that actually has the data is = consulted * continue reading at lower CL, accepting that some requests will fail (= usually only the first for a given query, if ReadRepair is enabled) * take downtime while repair runs @@ -49, +61 @@ Reducing replication factor is easily done and only requires running clea= nup afterwards to remove extra replicas. = To update the replication factor on a live cluster, forget about cassandr= a.yaml. Rather you want to use '''cassandra-cli''': + = - update keyspace Keyspace1 with replication_factor =3D 3; + . update keyspace Keyspace1 with replication_factor =3D 3; = =3D=3D=3D Network topology =3D=3D=3D Besides datacenters, you can also tell Cassandra which nodes are in the s= ame rack within a datacenter. Cassandra will use this to route both reads = and data movement for Range changes to the nearest replicas. This is confi= gured by a user-pluggable !EndpointSnitch class in the configuration file. @@ -97, +110 @@ = Here's a python program which can be used to calculate new tokens for the= nodes. There's more info on the subject at Ben Black's presentation at Cas= sandra Summit 2010. http://www.datastax.com/blog/slides-and-videos-cassandr= a-summit-2010 = - def tokens(nodes): = + . def tokens(nodes): - for x in xrange(nodes): = + . for x in xrange(nodes): - print 2 ** 127 / nodes * x + . print 2 ** 127 / nodes * x = - In versions of Cassandra 0.7.* and lower, there's also `nodetool loadbala= nce`: essentially a convenience over decommission + bootstrap, only instead= of telling the target node where to move on the ring it will choose its lo= cation based on the same heuristic as Token selection on bootstrap. You sho= uld not use this as it doesn't rebalance the entire ring. = + In versions of Cassandra 0.7.* and lower, there's also `nodetool loadbala= nce`: essentially a convenience over decommission + bootstrap, only instead= of telling the target node where to move on the ring it will choose its lo= cation based on the same heuristic as Token selection on bootstrap. You sho= uld not use this as it doesn't rebalance the entire ring. = - The status of move and balancing operations can be monitored using `nodet= ool` with the `netstat` argument. = + The status of move and balancing operations can be monitored using `nodet= ool` with the `netstat` argument. (Cassandra 0.6.* and lower use the `stre= ams` argument). - (Cassandra 0.6.* and lower use the `streams` argument). = =3D=3D Consistency =3D=3D Cassandra allows clients to specify the desired consistency level on read= s and writes. (See [[API]].) If R + W > N, where R, W, and N are respecti= vely the read replica count, the write replica count, and the replication f= actor, all client reads will see the most recent write. Otherwise, readers= '''may''' see older versions, for periods of typically a few ms; this is c= alled "eventual consistency." See http://www.allthingsdistributed.com/2008= /12/eventually_consistent.html and http://queue.acm.org/detail.cfm?id=3D146= 6448 for more. @@ -115, +127 @@ Cassandra repairs data in two ways: = 1. Read Repair: every time a read is performed, Cassandra compares the v= ersions at each replica (in the background, if a low consistency was reques= ted by the reader to minimize latency), and the newest version is sent to a= ny out-of-date replicas. - 1. Anti-Entropy: when `nodetool repair` is run, Cassandra computes a Mer= kle tree for each range of data on that node, and compares it with the vers= ions on other replicas, to catch any out of sync data that hasn't been read= recently. This is intended to be run infrequently (e.g., weekly) since co= mputing the Merkle tree is relatively expensive in disk i/o and CPU, since = it scans ALL the data on the machine (but it is is very network efficient).= = + 1. Anti-Entropy: when `nodetool repair` is run, Cassandra computes a Mer= kle tree for each range of data on that node, and compares it with the vers= ions on other replicas, to catch any out of sync data that hasn't been read= recently. This is intended to be run infrequently (e.g., weekly) since co= mputing the Merkle tree is relatively expensive in disk i/o and CPU, since = it scans ALL the data on the machine (but it is is very network efficient). = - Running `nodetool repair`: - Like all nodetool operations in 0.7, repair is blocking: it will wait for= the repair to finish and then exit. This may take a long time on large da= ta sets. + Running `nodetool repair`: Like all nodetool operations in 0.7, repair is= blocking: it will wait for the repair to finish and then exit. This may t= ake a long time on large data sets. = It is safe to run repair against multiple machines at the same time, but = to minimize the impact on your application workload it is recommended to wa= it for it to complete on one node before invoking it against the next. = =3D=3D=3D Frequency of nodetool repair =3D=3D=3D - = - Unless your application performs no deletes, it is vital that production = clusters run `nodetool repair` periodically on all nodes in the cluster. Th= e hard requirement for repair frequency is the value used for GCGraceSecond= s (see [[DistributedDeletes]]). Running nodetool repair often enough to gua= rantee that all nodes have performed a repair in a given period GCGraceSeco= nds long, ensures that deletes are not "forgotten" in the cluster. + Unless your application performs no deletes, it is vital that production = clusters run `nodetool repair` periodically on all nodes in the cluster. Th= e hard requirement for repair frequency is the value used for GCGraceSecond= s (see DistributedDeletes). Running nodetool repair often enough to guarant= ee that all nodes have performed a repair in a given period GCGraceSeconds = long, ensures that deletes are not "forgotten" in the cluster. = Consider how to schedule your repairs. A repair causes additional disk an= d CPU activity on the nodes participating in the repair, and it will typica= lly be a good idea to spread repairs out over time so as to minimize the ch= ances of repairs running concurrently on many nodes. = =3D=3D=3D=3D Dealing with the consequences of nodetool repair not running= within GCGraceSeconds =3D=3D=3D=3D - = - If `nodetool repair` has not been run often enough to the pointthat GCGra= ceSeconds has passed, you risk forgotten deletes (see [[DistributedDeletes]= ]). In addition to data popping up that has been deleted, you may see incon= sistencies in data return from different nodes that will not self-heal by r= ead-repair or further `nodetool repair`. Some further details on this latte= r effect is documented in [[https://issues.apache.org/jira/browse/CASSANDRA= -1316|CASSANDRA-1316]]. + If `nodetool repair` has not been run often enough to the pointthat GCGra= ceSeconds has passed, you risk forgotten deletes (see DistributedDeletes). = In addition to data popping up that has been deleted, you may see inconsist= encies in data return from different nodes that will not self-heal by read-= repair or further `nodetool repair`. Some further details on this latter ef= fect is documented in [[https://issues.apache.org/jira/browse/CASSANDRA-131= 6|CASSANDRA-1316]]. = There are at least three ways to deal with this scenario. = 1. Treat the node in question as failed, and replace it as described fur= ther below. - 2. To minimize the amount of forgotten deletes, first increase GCGraceSe= conds across the cluster (rolling restart required), perform a full repair = on all nodes, and then change GCRaceSeconds back again. This has the advant= age of ensuring tombstones spread as much as possible, minimizing the amoun= t of data that may "pop back up" (forgotten delete). + 1. To minimize the amount of forgotten deletes, first increase GCGraceSe= conds across the cluster (rolling restart required), perform a full repair = on all nodes, and then change GCRaceSeconds back again. This has the advant= age of ensuring tombstones spread as much as possible, minimizing the amoun= t of data that may "pop back up" (forgotten delete). - 3. Yet another option, that will result in more forgotten deletes than t= he previous suggestion but is easier to do, is to ensure 'nodetool repair' = has been run on all nodes, and then perform a compaction to expire toombsto= nes. Following this, read-repair and regular `nodetool repair` should cause= the cluster to converge. + 1. Yet another option, that will result in more forgotten deletes than t= he previous suggestion but is easier to do, is to ensure 'nodetool repair' = has been run on all nodes, and then perform a compaction to expire toombsto= nes. Following this, read-repair and regular `nodetool repair` should cause= the cluster to converge. = =3D=3D=3D Handling failure =3D=3D=3D If a node goes down and comes back up, the ordinary repair mechanisms wil= l be adequate to deal with any inconsistent data. Remember though that if = a node misses updates and is not repaired for longer than your configured G= CGraceSeconds (default: 10 days), it could have missed remove operations pe= rmanently. Unless your application performs no removes, you should wipe it= s data directory, re-bootstrap it, and removetoken its old entry in the rin= g (see below). @@ -158, +167 @@ {{{ Exception in thread "main" java.io.IOException: Cannot run program "ln": = java.io.IOException: error=3D12, Cannot allocate memory }}} - This is caused by the operating system trying to allocate the child "ln" = process a memory space as large as the parent process (the cassandra server= ), even though '''it's not going to use it'''. So if you have a machine wit= h 8GB of RAM and no swap, and you gave 6GB to the cassandra server, it will= fail during this because the operating system wants 12 GB of virtual memor= y before allowing you to create the process. = + This is caused by the operating system trying to allocate the child "ln" = process a memory space as large as the parent process (the cassandra server= ), even though '''it's not going to use it'''. So if you have a machine wit= h 8GB of RAM and no swap, and you gave 6GB to the cassandra server, it will= fail during this because the operating system wants 12 GB of virtual memor= y before allowing you to create the process. = This error can be worked around by either : = @@ -167, +176 @@ OR = * creating a swap file, snapshotting, removing swap file + = OR + = * turning on "memory overcommit" = To restore a snapshot: @@ -210, +221 @@ = Running `nodetool cfstats` can provide an overview of each Column Family,= and important metrics to graph your cluster. Some folks prefer having to d= eal with non-jmx clients, there is a JMX-to-REST bridge available at http:/= /code.google.com/p/polarrose-jmx-rest-bridge/ = - Important metrics to watch on a per-Column Family basis would be: '''Read= Count, Read Latency, Write Count and Write Latency'''. '''Pending Tasks'''= tell you if things are backing up. These metrics can also be exposed using= any JMX client such as `jconsole`. (See also [[http://simplygenius.com/20= 10/08/jconsole-via-socks-ssh-tunnel.html]] for how to proxy JConsole to fir= ewalled machines.) + Important metrics to watch on a per-Column Family basis would be: '''Read= Count, Read Latency, Write Count and Write Latency'''. '''Pending Tasks'''= tell you if things are backing up. These metrics can also be exposed using= any JMX client such as `jconsole`. (See also http://simplygenius.com/2010= /08/jconsole-via-socks-ssh-tunnel.html for how to proxy JConsole to firewal= led machines.) = You can also use jconsole, and the MBeans tab to look at PendingTasks for= thread pools. If you see one particular thread backing up, this can give y= ou an indication of a problem. One example would be ROW-MUTATION-STAGE indi= cating that write requests are arriving faster than they can be handled. A = more subtle example is the FLUSH stages: if these start backing up, cassand= ra is accepting writes into memory fast enough, but the sort-and-write-to-d= isk stages are falling behind. = @@ -240, +251 @@ FLUSH-WRITER-POOL 0 0 218 HINTED-HANDOFF-POOL 0 0 154 }}} - = =3D=3D Monitoring with MX4J =3D=3D - mx4j provides an HTML and HTTP interface to JMX. Starting from version 0.= 7.0 cassandra lets you hook up mx4j very easily. + mx4j provides an HTML and HTTP interface to JMX. Starting from version 0.= 7.0 cassandra lets you hook up mx4j very easily. To enable mx4j on a Cassan= dra node: - To enable mx4j on a Cassandra node: + = * Download mx4j-tools.jar from http://mx4j.sourceforge.net/ * Add mx4j-tools.jar to the classpath (e.g. under lib/) * Start cassandra - * In the log you should see a message such as Http``Atapter started on p= ort 8081 + * In the log you should see a message such as HttpAtapter started on por= t 8081 * To choose a different port (8081 is the default) or a different listen= address (0.0.0.0 is not the default) edit conf/cassandra-env.sh and uncomm= ent #MX4J_ADDRESS=3D"-Dmx4jaddress=3D0.0.0.0" and #MX4J_PORT=3D"-Dmx4jport= =3D8081" = Now browse to http://cassandra:8081/ and use the HTML interface.