Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9379B6BD8 for ; Tue, 12 Jul 2011 07:38:28 +0000 (UTC) Received: (qmail 47073 invoked by uid 500); 12 Jul 2011 07:38:26 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 46980 invoked by uid 500); 12 Jul 2011 07:38:16 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 46969 invoked by uid 99); 12 Jul 2011 07:38:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jul 2011 07:38:12 +0000 X-ASF-Spam-Status: No, hits=4.3 required=5.0 tests=FS_REPLICA,SPF_NEUTRAL,TO_NO_BRKTS_PCNT X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.115.26.24] (HELO u20.r0101.c1700.sjc1.3crowd.com) (74.115.26.24) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jul 2011 07:38:05 +0000 Received: from [173.249.30.30] by u20.r0101.c1700.sjc1.3crowd.com with esmtpsa (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.69) (envelope-from ) id 1QgXWz-0006Pj-RJ for user@cassandra.apache.org; Tue, 12 Jul 2011 07:37:22 +0000 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1084) Subject: Re: ReplicateOnWrite issues From: David Hawthorne In-Reply-To: Date: Tue, 12 Jul 2011 00:37:42 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <07A4E597-27BF-49D8-9A78-48361A4036D6@gmx.3crowd.com> References: <81EBDDED-0C20-4C83-9D23-C7DFCA564DBB@gmx.3crowd.com> To: user@cassandra.apache.org X-Mailer: Apple Mail (2.1084) It's definitely for counters, and some of the rows I'm inserting are = long-ish, if 1.3MB is long. Maybe it would help if I said I was using counter super columns. I'm = also writing to only a handful of rows at a time, until they are full. = It looks like the counter super column code in = addReadCommandFromColumnFamily, in the ReplicateOnWrite handling, might = not be optimized for this kind of load. It looks like I may be able to turn off replicate_on_write for the CFs = as long as I use CL.ONE to get around this in the short term, but I'd = like to know what happens when replicate on write is true and when it is = false so I can make a more informed choice. On Jul 12, 2011, at 12:20 AM, Yang wrote: > interesting, >=20 > first just to make sure: since replicateOnWrite is for Counters, you > are using counters (you use the word "insert" instead of > "add/increment" ) right? >=20 > if you are using counters, supposedly the leader runs > replicateOnWrite, somehow all your adds find the one box as leader, > that's probably something worth investigating. >=20 > finally, u could try debugging into those replicateOnWrite tasks and > see what they are doing exactly, maybe there is something immediately > wrong >=20 >=20 >=20 > On Mon, Jul 11, 2011 at 11:57 PM, David Hawthorne = wrote: >> This is now my fourth attempt to get the message through. Apologies = if you see multiple copies. >>=20 >> I've tried to give as much relevant data as I can think of, but = please let me know if you need any other info. I spent the day getting = jmxtrans to talk to statsd with the cassandra JMX data, so I can provide = graphs of just about anything you want. I can also re-run the tests to = replicate the problem if need be. As always, thanks for taking a look! >>=20 >> I have a 4 host test cluster that I'm writing counters into running = 0.8.1, and I'm writing to it with hector with CL.ONE or CL.QUORUM (test = was run with each and both gave the same results). Partitioner is the = RandomPartitioner, and RF=3D3. nodetool ring shows that actual data is = well balanced: >>=20 >> Address DC Rack Status State Load = Owns Token >> = 127605887595351923798765477786913079296 >> 10.0.0.57 datacenter1 rack1 Up Normal 1.59 GB = 25.00% 0 >> 10.0.0.56 datacenter1 rack1 Up Normal 1.39 GB = 25.00% 42535295865117307932921825928971026432 >> 10.0.0.55 datacenter1 rack1 Up Normal 1.24 GB = 25.00% 85070591730234615865843651857942052864 >> 10.0.0.54 datacenter1 rack1 Up Normal 1.34 GB = 25.00% 127605887595351923798765477786913079296 >>=20 >> I also tested against a 1-box test cluster with RF=3D1. >>=20 >> Both show the same results: high performance for a while, and then = ReplicateOnWrite backs way up (I've seen it as high as a million), the = cassandra process becomes unresponsive, and the hector client starts = throwing exceptions. High performance in this case equates to about = 5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting one = long row at a time, so the performance is bounded by the one box owning = the row). The cassandra process never seems to recover, even if left = for 24 hours. It's still chewing through those ReplicateOnWrite pending = tasks. >>=20 >> GC doesn't seem to be a factor. Logs show GCs completing in < 1s on = all 4 boxes in the test cluster, as well as on the 1 standalone server. >>=20 >> I'm also seeing highly erratic performance from the cluster in = general: inserts/sec usually start at around 3k, but then maybe creep = up towards 6k, then drop back down to below 1k for an extended period of = time, then maybe 1% of the time they'll spike up to 8k or even 15k, but = that doesn't last more than 30 seconds. >>=20 >> Here's the tpstats output from all 4 boxes: >>=20 >> Slowest/most problematic/highest cpu usage one: >>=20 >> This tpstats output is also representative of the one box cluster = about 20 minutes into my insert run. >>=20 >> 4 $ ./nodetool -h localhost tpstats >> Pool Name Active Pending Completed >> ReadStage 0 0 0 >> RequestResponseStage 0 0 59095840 >> MutationStage 0 0 31182942 >> ReadRepairStage 0 0 0 >> ReplicateOnWriteStage 32 928042 28246716 >> GossipStage 1 1 449464 >> AntiEntropyStage 0 0 0 >> MigrationStage 0 0 4 >> MemtablePostFlusher 0 0 445 >> StreamStage 0 0 0 >> FlushWriter 0 0 445 >> FILEUTILS-DELETE-POOL 0 0 38 >> MiscStage 0 0 0 >> FlushSorter 0 0 0 >> InternalResponseStage 0 0 4 >> HintedHandoff 0 0 0 >>=20 >> Second highest cpu usage one: >>=20 >> 3 $ ./nodetool -h localhost tpstats >> Pool Name Active Pending Completed >> ReadStage 0 0 0 >> RequestResponseStage 0 0 27413910 >> MutationStage 0 0 27523094 >> ReadRepairStage 0 0 0 >> ReplicateOnWriteStage 0 0 11111971 >> GossipStage 0 0 335422 >> AntiEntropyStage 0 0 0 >> MigrationStage 0 0 4 >> MemtablePostFlusher 0 0 66 >> StreamStage 0 0 0 >> FlushWriter 0 0 66 >> FILEUTILS-DELETE-POOL 0 0 31 >> MiscStage 0 0 0 >> FlushSorter 0 0 0 >> InternalResponseStage 0 0 4 >> HintedHandoff 0 0 0 >>=20 >> The other 2 that see minimal cpu usage: >>=20 >> 2 $ ./nodetool -h localhost tpstats >> Pool Name Active Pending Completed >> ReadStage 0 0 0 >> RequestResponseStage 0 0 9213999 >> MutationStage 0 0 33233114 >> ReadRepairStage 0 0 0 >> ReplicateOnWriteStage 0 0 0 >> GossipStage 0 0 336016 >> AntiEntropyStage 0 0 0 >> MigrationStage 0 0 4 >> MemtablePostFlusher 0 0 77 >> StreamStage 0 0 0 >> FlushWriter 0 0 77 >> FILEUTILS-DELETE-POOL 0 0 40 >> MiscStage 0 0 0 >> FlushSorter 0 0 0 >> InternalResponseStage 0 0 4 >> HintedHandoff 0 0 0 >>=20 >>=20 >> 1 $ ./nodetool -h localhost tpstats >> Pool Name Active Pending Completed >> ReadStage 0 0 0 >> RequestResponseStage 0 0 13796875 >> MutationStage 0 0 36116662 >> ReadRepairStage 0 0 0 >> ReplicateOnWriteStage 0 0 0 >> GossipStage 0 0 336496 >> AntiEntropyStage 0 0 0 >> MigrationStage 0 0 4 >> MemtablePostFlusher 0 0 91 >> StreamStage 0 0 0 >> FlushWriter 0 0 91 >> FILEUTILS-DELETE-POOL 0 0 43 >> MiscStage 0 0 0 >> FlushSorter 0 0 0 >> InternalResponseStage 0 0 4 >> HintedHandoff 0 0 0