Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C08CDE10B for ; Wed, 28 Nov 2012 01:26:34 +0000 (UTC) Received: (qmail 11103 invoked by uid 500); 28 Nov 2012 01:26:31 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 11071 invoked by uid 500); 28 Nov 2012 01:26:31 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 11038 invoked by uid 99); 28 Nov 2012 01:26:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Nov 2012 01:26:31 +0000 X-ASF-Spam-Status: No, hits=2.8 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of edlinuxguru@gmail.com designates 209.85.223.172 as permitted sender) Received: from [209.85.223.172] (HELO mail-ie0-f172.google.com) (209.85.223.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Nov 2012 01:26:24 +0000 Received: by mail-ie0-f172.google.com with SMTP id c13so12889238ieb.31 for ; Tue, 27 Nov 2012 17:26:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=wqlYl/nO8J5EWbx1Faf3QmT0HW5Ozozb4TrvqzQ7rCg=; b=NHit2IZWC07BgEJctGiqkAVVnwaBc3FjPYcwjfoy4X84SxWf6xeTwz6El7NDI5hT1T uQ2p0430uxBiCwjs7BDMWcLcDmXGdrXfS3WZguya1LfOtXguoat4bYtSQcqOsDUB75GI +U7ZGdy3Bu0EYDyiDwG/g39yUiQLOSxzgx2yisdRj1FUqZNvZpKAFak1arQ5sPpz8lPI 0tIei5/+ix5X4OI/zRsAyHreI54XvInDP0D8ZQ7bAZDXqTPOCi0D/gAbAtlL+/RAzbQ/ iwHfRmYbEgfqco+zdQAnAOFlBhtv2FLz7Rkyhis8jLqtJ99eli/uSjCkMtU1W+gB4GKh QPCw== MIME-Version: 1.0 Received: by 10.50.188.136 with SMTP id ga8mr17602930igc.24.1354065963721; Tue, 27 Nov 2012 17:26:03 -0800 (PST) Received: by 10.64.97.106 with HTTP; Tue, 27 Nov 2012 17:26:03 -0800 (PST) In-Reply-To: <1354062178577-7584014.post@n2.nabble.com> References: <1354034910271-7583993.post@n2.nabble.com> <1354038762579-7583996.post@n2.nabble.com> <50B5491C.9020101@mailchannels.com> <1354059513956-7584011.post@n2.nabble.com> <1354062178577-7584014.post@n2.nabble.com> Date: Tue, 27 Nov 2012 20:26:03 -0500 Message-ID: Subject: Re: counters + replication = awful performance? From: Edward Capriolo To: "user@cassandra.apache.org" Cc: "cassandra-user@incubator.apache.org" Content-Type: multipart/alternative; boundary=14dae93410f7c523a304cf840cb5 X-Virus-Checked: Checked by ClamAV on apache.org --14dae93410f7c523a304cf840cb5 Content-Type: text/plain; charset=ISO-8859-1 Say you are doing 100 inserts rf1 on two nodes. That is 50 inserts a node. If you go to rf2 that is 100 inserts a node. If you were at 75 % capacity on each mode your now at 150% which is not possible so things bog down. To figure out what is going on we would need to see tpstat, iostat , and top information. I think your looking at the performance the wrong way. Starting off at rf 1 is not the way to understand cassandra performance. You do not get the benefits of "scala out" don't happen until you fix your rf and increment your nodecount. Ie 5 nodes at rf 3 is fast 10 nodes at rf 3 even better. On Tuesday, November 27, 2012, Sergey Olefir wrote: > I already do a lot of in-memory aggregation before writing to Cassandra. > > The question here is what is wrong with Cassandra (or its configuration) > that causes huge performance drop when moving from 1-replication to > 2-replication for counters -- and more importantly how to resolve the > problem. 2x-3x drop when moving from 1-replication to 2-replication on two > nodes is reasonable. 6x is not. Like I said, with this kind of performance > degradation it makes more sense to run two clusters with replication=1 in > parallel rather than rely on Cassandra replication. > > And yes, Rainbird was the inspiration for what we are trying to do here :) > > > > Edward Capriolo wrote >> Cassandra's counters read on increment. Additionally they are distributed >> so that can be multiple reads on increment. If they are not fast enough >> and >> you have avoided all tuning options add more servers to handle the load. >> >> In many cases incrementing the same counter n times can be avoided. >> >> Twitter's rainbird did just that. It avoided multiple counter increments >> by >> batching them. >> >> I have done a similar think using cassandra and Kafka. >> >> https://github.com/edwardcapriolo/IronCount/blob/master/src/test/java/com/jointhegrid/ironcount/mockingbird/MockingBirdMessageHandler.java >> >> >> On Tuesday, November 27, 2012, Sergey Olefir < > >> solf.lists@ > >> > wrote: >>> Hi, thanks for your suggestions. >>> >>> Regarding replicate=2 vs replicate=1 performance: I expected that below >>> configurations will have similar performance: >>> - single node, replicate = 1 >>> - two nodes, replicate = 2 (okay, this probably should be a bit slower >>> due >>> to additional overhead). >>> >>> However what I'm seeing is that second option (replicate=2) is about >>> THREE >>> times slower than single node. >>> >>> >>> Regarding replicate_on_write -- it is, in fact, a dangerous option. As >> JIRA >>> discusses, if you make changes to your ring (moving tokens and such) you >>> will *silently* lose data. That is on top of whatever data you might end >> up >>> losing if you run replicate_on_write=false and the only node that got the >>> data fails. >>> >>> But what is much worse -- with replicate_on_write being false the data >> will >>> NOT be replicated (in my tests) ever unless you explicitly request the >> cell. >>> Then it will return the wrong result. And only on subsequent reads it >>> will >>> return adequate results. I haven't tested it, but documentation states >> that >>> range query will NOT do 'read repair' and thus will not force >>> replication. >>> The test I did went like this: >>> - replicate_on_write = false >>> - write something to node A (which should in theory replicate to node B) >>> - wait for a long time (longest was on the order of 5 hours) >>> - read from node B (and here I was getting null / wrong result) >>> - read from node B again (here you get what you'd expect after read >> repair) >>> >>> In essence, using replicate_on_write=false with rarely read data will >>> practically defeat the purpose of having replication in the first place >>> (failover, data redundancy). >>> >>> >>> Or, in other words, this option doesn't look to be applicable to my >>> situation. >>> >>> It looks like I will get much better performance by simply writing to two >>> separate clusters rather than using single cluster with replicate=2. >>> Which >>> is kind of stupid :) I think something's fishy with counters and >>> replication. >>> >>> >>> >>> Edward Capriolo wrote >>>> I mispoke really. It is not dangerous you just have to understand what >>>> it >>>> means. this jira discusses it. >>>> >>>> https://issues.apache.org/jira/browse/CASSANDRA-3868 >>>> >>>> On Tue, Nov 27, 2012 at 6:13 PM, Scott McKay < >>> >>>> scottm@ >>> >>>> >wrote: >>>> >>>>> We're having a similar performance problem. Setting >>>>> 'replicate_on_write: >>>>> false' fixes the performance issue in our tests. >>>>> >>>>> How dangerous is it? What exactly could go wrong? >>>>> >>>>> On 12-11-27 01:44 PM, Edward Capriolo wrote: >>>>> >>>>> The difference between Replication factor =1 and replication factor > 1 >>>>> is >>>>> significant. Also it sounds like your cluster is 2 node so going from >>>>> RF=1 >>>>> to RF=2 means double the load on both nodes. >>>>> >>>>> You may want to experiment with the very dangerous column family >>>>> attribute: >>>>> >>>>> - replicate_on_write: Replicate every counter update from the leader >>>>> to >>>>> the >>>>> follower replicas. Accepts the values true and false. >>>>> >>>>> Edward >>>>> On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman < >>>>> >>> >>>> mkjellman@ >>> >>>>> wrote: >>>>> >>>>>> Are you writing with QUORUM consistency or ONE? >>>>>> >>>>>> On 11/27/12 9:52 AM, "Sergey Olefir" < >>> >>>> solf.lists@ >>> >>>> > wrote: >>>>>> >>>>>> >Hi Juan, >>>> cassandra-user@.apache > >> mailing list archive at >> Nabble.com. >>> > > > > > > -- > View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584014.html > Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com. > --14dae93410f7c523a304cf840cb5 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Say you are doing 100 inserts rf1 on two nodes. That is 50 inserts a node. = If you go to rf2 that is 100 inserts a node. =A0If you were at 75 % capacit= y on each mode your now at 150% which is not possible so things bog down.
To figure out what is going on we would need to see tpstat, iostat , an= d top information.

I think your looking at the performance the wrong= way. Starting off at rf 1 is not the way to understand cassandra performan= ce.

You do not get the benefits of "scala out" don't happen u= ntil you fix your rf and increment your nodecount. Ie 5 nodes at rf 3 is fa= st 10 nodes at rf 3 even better.
On Tuesday, November 27, 2012, Sergey O= lefir <solf.lists@gmail.com&= gt; wrote:
> I already do a lot of in-memory aggregation before writing to Cassandr= a.
>
> The question here is what is wrong with Cassandra (or it= s configuration)
> that causes huge performance drop when moving from= 1-replication to
> 2-replication for counters -- and more importantly how to resolve the<= br>> problem. 2x-3x drop when moving from 1-replication to 2-replication= on two
> nodes is reasonable. 6x is not. Like I said, with this kind= of performance
> degradation it makes more sense to run two clusters with replication= =3D1 in
> parallel rather than rely on Cassandra replication.
>=
> And yes, Rainbird was the inspiration for what we are trying to do= here :)
>
>
>
> Edward Capriolo wrote
>> Cassandra= 9;s counters read on increment. Additionally they are distributed
>&g= t; so that can be multiple reads on increment. If they are not fast enough<= br> >> and
>> you have avoided all tuning options add more serve= rs to handle the load.
>>
>> In many cases incrementing t= he same counter n times can be avoided.
>>
>> Twitter'= ;s rainbird did just that. It avoided multiple counter increments
>> by
>> batching them.
>>
>> I have done = a similar think using cassandra and Kafka.
>>
>> https:= //github.com/edwardcapriolo/IronCount/blob/master/src/test/java/com/jointhe= grid/ironcount/mockingbird/MockingBirdMessageHandler.java
>>
>>
>> On Tuesday, November 27, 2012, Sergey Olef= ir &lt;
>
>> solf.lists@
>
>> &gt; wr= ote:
>>> Hi, thanks for your suggestions.
>>>
>>> Regarding replicate=3D2 vs replicate=3D1 performance: I expect= ed that below
>>> configurations will have similar performance:=
>>> - single node, replicate =3D 1
>>> - two nodes= , replicate =3D 2 (okay, this probably should be a bit slower
>>> due
>>> to additional overhead).
>>>>>> However what I'm seeing is that second option (replicate= =3D2) is about
>>> THREE
>>> times slower than sing= le node.
>>>
>>>
>>> Regarding replicate_on_write -= - it is, in fact, a dangerous option. As
>> JIRA
>>> d= iscusses, if you make changes to your ring (moving tokens and such) you
>>> will *silently* lose data. That is on top of whatever data you= might end
>> up
>>> losing if you run replicate_on_wr= ite=3Dfalse and the only node that got the
>>> data fails.
>>>
>>> But what is much worse -- with replicate_on_wr= ite being false the data
>> will
>>> NOT be replicated= (in my tests) ever unless you explicitly request the
>> cell.
>>> Then it will return the wrong result. And only on subsequent r= eads it
>>> will
>>> return adequate results. I hav= en't tested it, but documentation states
>> that
>>&g= t; range query will NOT do 'read repair' and thus will not force >>> replication.
>>> The test I did went like this:>>> - replicate_on_write =3D false
>>> - write someth= ing to node A (which should in theory replicate to node B)
>>> = - wait for a long time (longest was on the order of 5 hours)
>>> - read from node B (and here I was getting null / wrong result= )
>>> - read from node B again (here you get what you'd exp= ect after read
>> repair)
>>>
>>> In essen= ce, using replicate_on_write=3Dfalse with rarely read data will
>>> practically defeat the purpose of having replication in the fi= rst place
>>> (failover, data redundancy).
>>>
&= gt;>>
>>> Or, in other words, this option doesn't loo= k to be applicable to my
>>> situation.
>>>
>>> It looks like I wil= l get much better performance by simply writing to two
>>> sepa= rate clusters rather than using single cluster with replicate=3D2.
>&= gt;> Which
>>> is kind of stupid :) I think something's fishy with counte= rs and
>>> replication.
>>>
>>>
>= >>
>>> Edward Capriolo wrote
>>>> I mispok= e really. It is not dangerous you just have to understand what
>>>> it
>>>> means. this jira discusses it.
&= gt;>>>
>>>> https://issues.apache.org/jira/browse/CASSANDRA-3= 868
>>>>
>>>> On Tue, Nov 27, 2012 at 6:13 PM, Scott= McKay &lt;
>>>
>>>> scottm@
>>>=
>>>> &gt;wrote:
>>>>
>>>>= > =A0We're having a similar performance problem. =A0Setting
>>>>> 'replicate_on_write:
>>>>> false= ' fixes the performance issue in our tests.
>>>>>
= >>>>> How dangerous is it? =A0What exactly could go wrong? >>>>>
>>>>> On 12-11-27 01:44 PM, Edward C= apriolo wrote:
>>>>>
>>>>> The differen= ce between Replication factor =3D1 and replication factor > 1
>>= ;>>> is
>>>>> significant. Also it sounds like your cluster is 2 nod= e so going from
>>>>> RF=3D1
>>>>> to R= F=3D2 means double the load on both nodes.
>>>>>
>&= gt;>>> =A0You may want to experiment with the very dangerous colum= n family
>>>>> attribute:
>>>>>
>>>>= > =A0- replicate_on_write: Replicate every counter update from the leade= r
>>>>> to
>>>>> the
>>>>= ;> follower replicas. Accepts the values true and false.
>>>>>
>>>>> =A0Edward
>>>>&= gt; =A0On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman <
>>&g= t;>>
>>>
>>>> mkjellman@
>>> >>>>> wrote:
>>>>>
>>>>>= > Are you writing with QUORUM consistency or ONE?
>>>>>= ;>
>>>>>> On 11/27/12 9:52 AM, "Sergey Olefir&= quot; &lt;
>>>
>>>> solf.lists@
>>>
>>>= ;> &gt; wrote:
>>>>>>
>>>>>&g= t; >Hi Juan,
>>>> cassandra-user@.apache
>
>&= gt; =A0mailing list archive at
>> Nabble.com.
>>>
>
>
>
>
>= ;
> --
> View this message in context: http://cassandra-user-incubator-apa= che-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583= 993p7584014.html
> Sent from the c= assandra-user@incubator.apache.org mailing list archive at Nabble.com.<= br>> --14dae93410f7c523a304cf840cb5--