Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 448346658 for ; Tue, 12 Jul 2011 21:43:11 +0000 (UTC) Received: (qmail 74196 invoked by uid 500); 12 Jul 2011 21:43:08 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 74154 invoked by uid 500); 12 Jul 2011 21:43:07 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 74146 invoked by uid 99); 12 Jul 2011 21:43:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jul 2011 21:43:07 +0000 X-ASF-Spam-Status: No, hits=4.3 required=5.0 tests=FS_REPLICA,SPF_NEUTRAL,TO_NO_BRKTS_PCNT X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.115.26.24] (HELO u20.r0101.c1700.sjc1.3crowd.com) (74.115.26.24) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jul 2011 21:42:58 +0000 Received: from [173.249.30.30] by u20.r0101.c1700.sjc1.3crowd.com with esmtpsa (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.69) (envelope-from ) id 1Qgkib-00036S-Ph for user@cassandra.apache.org; Tue, 12 Jul 2011 21:42:13 +0000 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1084) Subject: Re: ReplicateOnWrite issues From: David Hawthorne In-Reply-To: Date: Tue, 12 Jul 2011 14:42:35 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <2873AEF0-B0AD-4F24-A704-B14EF2513B9C@gmx.3crowd.com> References: <81EBDDED-0C20-4C83-9D23-C7DFCA564DBB@gmx.3crowd.com> To: user@cassandra.apache.org X-Mailer: Apple Mail (2.1084) X-Virus-Checked: Checked by ClamAV on apache.org Well, I was using a large number of clients: I tried configuring a = hector pool of 20-200 to see what affect that had on throughput. = There's definitely a point after which there's no gain, so I dialed it = back down. To clarify a few other things, when I say inserts I mean = increments, as this test only involves counters. When I say a row is = full, I mean I've written all of the data into it that belongs in it = from our application's perspective. Not all rows are the same length, = and the average number of subcolumns per supercolumn is higher for some = rows than for others. I'm curious, is it necessary for a read to happen for replication when = using RF=3D1? I am currently running a test against a single node cluster, RF=3D1, = CL.ONE, 4 connections/concurrent increments, and graphing various data = points. If something interesting happens, I'll send a graph. When you say stagger writes, do you mean just lower the overall number = of increments/sec my client is attempting, or do you mean I should = stagger the load amongst many CFs/rows at once? Right now I'm writing = heavily into a small set of rows, each going into 3 CFs, before moving = on to the next small set of rows. On Jul 12, 2011, at 12:26 PM, Sylvain Lebresne wrote: > When you do counter increment at CL.ONE, the write is acknowledged as > soon as the first replica getting the the write has pushed the > increment into his memtable. However, there is a read happening for > the replication to the other replicas (this is necessary to the > counter design). What is happening is that our nodes are not able to > do those reads fast enough. Or, another way to see it is that you are > writing faster than you cluster can handle. >=20 > I've opened https://issues.apache.org/jira/browse/CASSANDRA-2889 to do > something about that, but for now you'll have to stagger your writes. > I'm more surprised that tasks on the replicate on write stage would > stack up at QUORUM, unless you're using a very large number of > clients. >=20 > -- > Sylvain >=20 > On Tue, Jul 12, 2011 at 8:57 AM, David Hawthorne = wrote: >> This is now my fourth attempt to get the message through. Apologies = if you see multiple copies. >>=20 >> I've tried to give as much relevant data as I can think of, but = please let me know if you need any other info. I spent the day getting = jmxtrans to talk to statsd with the cassandra JMX data, so I can provide = graphs of just about anything you want. I can also re-run the tests to = replicate the problem if need be. As always, thanks for taking a look! >>=20 >> I have a 4 host test cluster that I'm writing counters into running = 0.8.1, and I'm writing to it with hector with CL.ONE or CL.QUORUM (test = was run with each and both gave the same results). Partitioner is the = RandomPartitioner, and RF=3D3. nodetool ring shows that actual data is = well balanced: >>=20 >> Address DC Rack Status State Load = Owns Token >> = 127605887595351923798765477786913079296 >> 10.0.0.57 datacenter1 rack1 Up Normal 1.59 GB = 25.00% 0 >> 10.0.0.56 datacenter1 rack1 Up Normal 1.39 GB = 25.00% 42535295865117307932921825928971026432 >> 10.0.0.55 datacenter1 rack1 Up Normal 1.24 GB = 25.00% 85070591730234615865843651857942052864 >> 10.0.0.54 datacenter1 rack1 Up Normal 1.34 GB = 25.00% 127605887595351923798765477786913079296 >>=20 >> I also tested against a 1-box test cluster with RF=3D1. >>=20 >> Both show the same results: high performance for a while, and then = ReplicateOnWrite backs way up (I've seen it as high as a million), the = cassandra process becomes unresponsive, and the hector client starts = throwing exceptions. High performance in this case equates to about = 5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting one = long row at a time, so the performance is bounded by the one box owning = the row). The cassandra process never seems to recover, even if left = for 24 hours. It's still chewing through those ReplicateOnWrite pending = tasks. >>=20 >> GC doesn't seem to be a factor. Logs show GCs completing in < 1s on = all 4 boxes in the test cluster, as well as on the 1 standalone server. >>=20 >> I'm also seeing highly erratic performance from the cluster in = general: inserts/sec usually start at around 3k, but then maybe creep = up towards 6k, then drop back down to below 1k for an extended period of = time, then maybe 1% of the time they'll spike up to 8k or even 15k, but = that doesn't last more than 30 seconds. >>=20 >> Here's the tpstats output from all 4 boxes: >>=20 >> Slowest/most problematic/highest cpu usage one: >>=20 >> This tpstats output is also representative of the one box cluster = about 20 minutes into my insert run. >>=20 >> 4 $ ./nodetool -h localhost tpstats >> Pool Name Active Pending Completed >> ReadStage 0 0 0 >> RequestResponseStage 0 0 59095840 >> MutationStage 0 0 31182942 >> ReadRepairStage 0 0 0 >> ReplicateOnWriteStage 32 928042 28246716 >> GossipStage 1 1 449464 >> AntiEntropyStage 0 0 0 >> MigrationStage 0 0 4 >> MemtablePostFlusher 0 0 445 >> StreamStage 0 0 0 >> FlushWriter 0 0 445 >> FILEUTILS-DELETE-POOL 0 0 38 >> MiscStage 0 0 0 >> FlushSorter 0 0 0 >> InternalResponseStage 0 0 4 >> HintedHandoff 0 0 0 >>=20 >> Second highest cpu usage one: >>=20 >> 3 $ ./nodetool -h localhost tpstats >> Pool Name Active Pending Completed >> ReadStage 0 0 0 >> RequestResponseStage 0 0 27413910 >> MutationStage 0 0 27523094 >> ReadRepairStage 0 0 0 >> ReplicateOnWriteStage 0 0 11111971 >> GossipStage 0 0 335422 >> AntiEntropyStage 0 0 0 >> MigrationStage 0 0 4 >> MemtablePostFlusher 0 0 66 >> StreamStage 0 0 0 >> FlushWriter 0 0 66 >> FILEUTILS-DELETE-POOL 0 0 31 >> MiscStage 0 0 0 >> FlushSorter 0 0 0 >> InternalResponseStage 0 0 4 >> HintedHandoff 0 0 0 >>=20 >> The other 2 that see minimal cpu usage: >>=20 >> 2 $ ./nodetool -h localhost tpstats >> Pool Name Active Pending Completed >> ReadStage 0 0 0 >> RequestResponseStage 0 0 9213999 >> MutationStage 0 0 33233114 >> ReadRepairStage 0 0 0 >> ReplicateOnWriteStage 0 0 0 >> GossipStage 0 0 336016 >> AntiEntropyStage 0 0 0 >> MigrationStage 0 0 4 >> MemtablePostFlusher 0 0 77 >> StreamStage 0 0 0 >> FlushWriter 0 0 77 >> FILEUTILS-DELETE-POOL 0 0 40 >> MiscStage 0 0 0 >> FlushSorter 0 0 0 >> InternalResponseStage 0 0 4 >> HintedHandoff 0 0 0 >>=20 >>=20 >> 1 $ ./nodetool -h localhost tpstats >> Pool Name Active Pending Completed >> ReadStage 0 0 0 >> RequestResponseStage 0 0 13796875 >> MutationStage 0 0 36116662 >> ReadRepairStage 0 0 0 >> ReplicateOnWriteStage 0 0 0 >> GossipStage 0 0 336496 >> AntiEntropyStage 0 0 0 >> MigrationStage 0 0 4 >> MemtablePostFlusher 0 0 91 >> StreamStage 0 0 0 >> FlushWriter 0 0 91 >> FILEUTILS-DELETE-POOL 0 0 43 >> MiscStage 0 0 0 >> FlushSorter 0 0 0 >> InternalResponseStage 0 0 4 >> HintedHandoff 0 0 0