Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5DE1CDF54 for ; Wed, 28 Nov 2012 19:42:25 +0000 (UTC) Received: (qmail 75656 invoked by uid 500); 28 Nov 2012 19:42:22 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 75390 invoked by uid 500); 28 Nov 2012 19:42:22 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 75378 invoked by uid 500); 28 Nov 2012 19:42:22 -0000 Delivered-To: apmail-incubator-cassandra-user@incubator.apache.org Received: (qmail 75374 invoked by uid 99); 28 Nov 2012 19:42:22 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Nov 2012 19:42:22 +0000 X-ASF-Spam-Status: No, hits=2.0 required=5.0 tests=SPF_NEUTRAL,URI_HEX X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: 216.139.250.139 is neither permitted nor denied by domain of solf.lists@gmail.com) Received: from [216.139.250.139] (HELO joe.nabble.com) (216.139.250.139) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Nov 2012 19:42:17 +0000 Received: from jim.nabble.com ([192.168.236.80]) by joe.nabble.com with esmtp (Exim 4.72) (envelope-from ) id 1TdnW8-0000dd-Sr for cassandra-user@incubator.apache.org; Wed, 28 Nov 2012 11:41:56 -0800 Date: Wed, 28 Nov 2012 11:41:56 -0800 (PST) From: Sergey Olefir To: cassandra-user@incubator.apache.org Message-ID: <1354131716881-7584052.post@n2.nabble.com> In-Reply-To: References: <50B5491C.9020101@mailchannels.com> <1354059513956-7584011.post@n2.nabble.com> <1354062178577-7584014.post@n2.nabble.com> <1354084440405-7584031.post@n2.nabble.com> Subject: Re: counters + replication = awful performance? MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Well, those are sad news then. I don't think I can consider 20k increments per second for a two node cluster (with RF=2) a reasonable performance (cost vs. benefit). I might have to look into other storage solutions or perhaps experiment with duplicate clusters with RF=1 or replicate_on_write=false. Although yes, I probably should try that row cache you mentioned -- I saw that key cache was going unused (so saw no reason to try to enable row cache), but I think it was on RF=1, it might be different on RF=2. Sylvain Lebresne-3 wrote > Counters replication works in different ways than the one of "normal" > writes. Namely, a counter update is written to a first replica, then a > read > is perform and the result of that is replicated to the other nodes. With > RF=1, since there is only one replica no read is involved but in a way > it's > a degenerate case. So there is two reason why RF>2 is much slower than > RF=1: > 1) it involves a read to replicate and that read takes times. Especially > if > that read hits the disk, it may even dominate the insertion time. > 2) the replication to the first replica and the one to the res of the > replica are not done in parallel but sequentially. Note that this is only > true for the first replica versus the othere. In other words, from RF=2 to > RF=3 you should see a significant performance degradation. > > Note that while there is nothing you can do for 2), you can try to speed > up > 1) by using row cache for instance (in case you weren't). > > In other words, with counters, it is expected that RF=1 be potentially > much > faster than RF>1. That is the way counters works. > > And don't get me wrong, I'm not suggesting you should use RF=1 at all. > What > I am saying is that the performance you see with RF=2 is the performance > of > counters in Cassandra. > > -- > Sylvain > > > On Wed, Nov 28, 2012 at 7:34 AM, Sergey Olefir < > solf.lists@ > > wrote: > >> I think there might be a misunderstanding as to the nature of the >> problem. >> >> Say, I have test set T. And I have two identical servers A and B. >> - I tested that server A (singly) is able to handle load of T. >> - I tested that server B (singly) is able to handle load of T. >> - I then join A and B in the cluster and set replication=2 -- this means >> that each server in effect has to handle full test load individually >> (because there are two servers and replication=2 it means that each >> server >> effectively has to handle all the data written to the cluster). Under >> these >> circumstances it is reasonable to assume that cluster A+B shall be able >> to >> handle load T because each server is able to do so individually. >> >> HOWEVER, this is not the case. In fact, A+B together are only able to >> handle >> less than 1/3 of T DESPITE the fact that A and B individually are able to >> handle T just fine. >> >> I think there's something wrong with Cassandra replication (possibly as >> simple as me misconfiguring something) -- it shouldn't be three times >> faster >> to write to two separate nodes in parallel as compared to writing to >> 2-node >> Cassandra cluster with replication=2. >> >> >> Edward Capriolo wrote >> > Say you are doing 100 inserts rf1 on two nodes. That is 50 inserts a >> node. >> > If you go to rf2 that is 100 inserts a node. If you were at 75 % >> capacity >> > on each mode your now at 150% which is not possible so things bog down. >> > >> > To figure out what is going on we would need to see tpstat, iostat , >> and >> > top information. >> > >> > I think your looking at the performance the wrong way. Starting off at >> rf >> > 1 >> > is not the way to understand cassandra performance. >> > >> > You do not get the benefits of "scala out" don't happen until you fix >> your >> > rf and increment your nodecount. Ie 5 nodes at rf 3 is fast 10 nodes at >> rf >> > 3 even better. >> > On Tuesday, November 27, 2012, Sergey Olefir < >> >> > solf.lists@ >> >> > > wrote: >> >> I already do a lot of in-memory aggregation before writing to >> Cassandra. >> >> >> >> The question here is what is wrong with Cassandra (or its >> configuration) >> >> that causes huge performance drop when moving from 1-replication to >> >> 2-replication for counters -- and more importantly how to resolve the >> >> problem. 2x-3x drop when moving from 1-replication to 2-replication on >> >> two >> >> nodes is reasonable. 6x is not. Like I said, with this kind of >> >> performance >> >> degradation it makes more sense to run two clusters with replication=1 >> in >> >> parallel rather than rely on Cassandra replication. >> >> >> >> And yes, Rainbird was the inspiration for what we are trying to do >> here >> >> :) >> >> >> >> >> >> >> >> Edward Capriolo wrote >> >>> Cassandra's counters read on increment. Additionally they are >> >>> distributed >> >>> so that can be multiple reads on increment. If they are not fast >> enough >> >>> and >> >>> you have avoided all tuning options add more servers to handle the >> load. >> >>> >> >>> In many cases incrementing the same counter n times can be avoided. >> >>> >> >>> Twitter's rainbird did just that. It avoided multiple counter >> increments >> >>> by >> >>> batching them. >> >>> >> >>> I have done a similar think using cassandra and Kafka. >> >>> >> >>> >> > >> https://github.com/edwardcapriolo/IronCount/blob/master/src/test/java/com/jointhegrid/ironcount/mockingbird/MockingBirdMessageHandler.java >> >>> >> >>> >> >>> On Tuesday, November 27, 2012, Sergey Olefir < >> >> >> >>> solf.lists@ >> >> >> >>> > wrote: >> >>>> Hi, thanks for your suggestions. >> >>>> >> >>>> Regarding replicate=2 vs replicate=1 performance: I expected that >> below >> >>>> configurations will have similar performance: >> >>>> - single node, replicate = 1 >> >>>> - two nodes, replicate = 2 (okay, this probably should be a bit >> slower >> >>>> due >> >>>> to additional overhead). >> >>>> >> >>>> However what I'm seeing is that second option (replicate=2) is about >> >>>> THREE >> >>>> times slower than single node. >> >>>> >> >>>> >> >>>> Regarding replicate_on_write -- it is, in fact, a dangerous option. >> As >> >>> JIRA >> >>>> discusses, if you make changes to your ring (moving tokens and such) >> >>>> you >> >>>> will *silently* lose data. That is on top of whatever data you might >> >>>> end >> >>> up >> >>>> losing if you run replicate_on_write=false and the only node that >> got >> > the >> >>>> data fails. >> >>>> >> >>>> But what is much worse -- with replicate_on_write being false the >> data >> >>> will >> >>>> NOT be replicated (in my tests) ever unless you explicitly request >> the >> >>> cell. >> >>>> Then it will return the wrong result. And only on subsequent reads >> it >> >>>> will >> >>>> return adequate results. I haven't tested it, but documentation >> states >> >>> that >> >>>> range query will NOT do 'read repair' and thus will not force >> >>>> replication. >> >>>> The test I did went like this: >> >>>> - replicate_on_write = false >> >>>> - write something to node A (which should in theory replicate to >> node >> >>>> B) >> >>>> - wait for a long time (longest was on the order of 5 hours) >> >>>> - read from node B (and here I was getting null / wrong result) >> >>>> - read from node B again (here you get what you'd expect after read >> >>> repair) >> >>>> >> >>>> In essence, using replicate_on_write=false with rarely read data >> will >> >>>> practically defeat the purpose of having replication in the first >> place >> >>>> (failover, data redundancy). >> >>>> >> >>>> >> >>>> Or, in other words, this option doesn't look to be applicable to my >> >>>> situation. >> >>>> >> >>>> It looks like I will get much better performance by simply writing >> to >> > two >> >>>> separate clusters rather than using single cluster with replicate=2. >> >>>> Which >> >>>> is kind of stupid :) I think something's fishy with counters and >> >>>> replication. >> >>>> >> >>>> >> >>>> >> >>>> Edward Capriolo wrote >> >>>>> I mispoke really. It is not dangerous you just have to understand >> what >> >>>>> it >> >>>>> means. this jira discusses it. >> >>>>> >> >>>>> https://issues.apache.org/jira/browse/CASSANDRA-3868 >> >>>>> >> >>>>> On Tue, Nov 27, 2012 at 6:13 PM, Scott McKay < >> >>>> >> >>>>> scottm@ >> >>>> >> >>>>> >wrote: >> >>>>> >> >>>>>> We're having a similar performance problem. Setting >> >>>>>> 'replicate_on_write: >> >>>>>> false' fixes the performance issue in our tests. >> >>>>>> >> >>>>>> How dangerous is it? What exactly could go wrong? >> >>>>>> >> >>>>>> On 12-11-27 01:44 PM, Edward Capriolo wrote: >> >>>>>> >> >>>>>> The difference between Replication factor =1 and replication >> factor >> > >> > 1 >> >>>>>> is >> >>>>>> significant. Also it sounds like your cluster is 2 node so going >> from >> >>>>>> RF=1 >> >>>>>> to RF=2 means double the load on both nodes. >> >>>>>> >> >>>>>> You may want to experiment with the very dangerous column family >> >>>>>> attribute: >> >>>>>> >> >>>>>> - replicate_on_write: Replicate every counter update from the >> leader >> >>>>>> to >> >>>>>> the >> >>>>>> follower replicas. Accepts the values true and false. >> >>>>>> >> >>>>>> Edward >> >>>>>> On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman < >> >>>>>> >> >>>> >> >>>>> mkjellman@ >> >>>> >> >>>>>> wrote: >> >>>>>> >> >>>>>>> Are you writing with QUORUM consistency or ONE? >> >>>>>>> >> >>>>>>> On 11/27/12 9:52 AM, "Sergey Olefir" < >> >>>> >> >>>>> solf.lists@ >> >>>> >> >>>>> > wrote: >> >>>>>>> >> >>>>>>> >Hi Juan, >> >>>>> cassandra-user@.apache >> >> >> >>> mailing list archive at >> >>> Nabble.com. >> >>>> >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> View this message in context: >> > >> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584014.html >> >> Sent from the >> >> > cassandra-user@.apache >> >> > mailing list archive at >> > Nabble.com. >> >> >> >> >> >> >> >> -- >> View this message in context: >> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584031.html >> Sent from the > cassandra-user@.apache > mailing list archive at >> Nabble.com. >> -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584052.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.