Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of edlinuxguru@gmail.com
 designates 209.85.223.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <1354062178577-7584014.post@n2.nabble.com>
References: <1354034910271-7583993.post@n2.nabble.com>
	<CANZjK9jGUQFFF6hVVqmA2yXEWq7DM8nAm_hP9DDQ08mZjLaA2Q@mail.gmail.com>
	<1354038762579-7583996.post@n2.nabble.com>
	<CCDA4026.60C7%mkjellman@barracuda.com>
	<CAENxBwyjepNm=t7=00SzzzoC=OgBdxUPE_PwTi5jw2DU8tcCxw@mail.gmail.com>
	<50B5491C.9020101@mailchannels.com>
	<CAENxBwww+ivh8kxTMzw7C_WVGfcfQSVprrrrqBEs3WOLRCo7Ww@mail.gmail.com>
	<1354059513956-7584011.post@n2.nabble.com>
	<CAENxBww9kC2QFLPhU714jzoBtmYEXByq=qGnuUj_Fqip+yieZg@mail.gmail.com>
	<1354062178577-7584014.post@n2.nabble.com>
Date: Tue, 27 Nov 2012 20:26:03 -0500
Message-ID: 
 <CAENxBwxqHMNNBoGQVCCAJDe6ZySm+YSe2BxfjBPjugrAVjQotQ@mail.gmail.com>
Subject: Re: counters + replication = awful performance?
From: Edward Capriolo <edlinuxguru@gmail.com>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Cc: "cassandra-user@incubator.apache.org"
 <cassandra-user@incubator.apache.org>
Content-Type: multipart/alternative; boundary=14dae93410f7c523a304cf840cb5

--14dae93410f7c523a304cf840cb5
Content-Type: text/plain; charset=ISO-8859-1

Say you are doing 100 inserts rf1 on two nodes. That is 50 inserts a node.
If you go to rf2 that is 100 inserts a node.  If you were at 75 % capacity
on each mode your now at 150% which is not possible so things bog down.

To figure out what is going on we would need to see tpstat, iostat , and
top information.

I think your looking at the performance the wrong way. Starting off at rf 1
is not the way to understand cassandra performance.

You do not get the benefits of "scala out" don't happen until you fix your
rf and increment your nodecount. Ie 5 nodes at rf 3 is fast 10 nodes at rf
3 even better.
On Tuesday, November 27, 2012, Sergey Olefir <solf.lists@gmail.com> wrote:
> I already do a lot of in-memory aggregation before writing to Cassandra.
>
> The question here is what is wrong with Cassandra (or its configuration)
> that causes huge performance drop when moving from 1-replication to
> 2-replication for counters -- and more importantly how to resolve the
> problem. 2x-3x drop when moving from 1-replication to 2-replication on two
> nodes is reasonable. 6x is not. Like I said, with this kind of performance
> degradation it makes more sense to run two clusters with replication=1 in
> parallel rather than rely on Cassandra replication.
>
> And yes, Rainbird was the inspiration for what we are trying to do here :)
>
>
>
> Edward Capriolo wrote
>> Cassandra's counters read on increment. Additionally they are distributed
>> so that can be multiple reads on increment. If they are not fast enough
>> and
>> you have avoided all tuning options add more servers to handle the load.
>>
>> In many cases incrementing the same counter n times can be avoided.
>>
>> Twitter's rainbird did just that. It avoided multiple counter increments
>> by
>> batching them.
>>
>> I have done a similar think using cassandra and Kafka.
>>
>>
https://github.com/edwardcapriolo/IronCount/blob/master/src/test/java/com/jointhegrid/ironcount/mockingbird/MockingBirdMessageHandler.java
>>
>>
>> On Tuesday, November 27, 2012, Sergey Olefir &lt;
>
>> solf.lists@
>
>> &gt; wrote:
>>> Hi, thanks for your suggestions.
>>>
>>> Regarding replicate=2 vs replicate=1 performance: I expected that below
>>> configurations will have similar performance:
>>> - single node, replicate = 1
>>> - two nodes, replicate = 2 (okay, this probably should be a bit slower
>>> due
>>> to additional overhead).
>>>
>>> However what I'm seeing is that second option (replicate=2) is about
>>> THREE
>>> times slower than single node.
>>>
>>>
>>> Regarding replicate_on_write -- it is, in fact, a dangerous option. As
>> JIRA
>>> discusses, if you make changes to your ring (moving tokens and such) you
>>> will *silently* lose data. That is on top of whatever data you might end
>> up
>>> losing if you run replicate_on_write=false and the only node that got
the
>>> data fails.
>>>
>>> But what is much worse -- with replicate_on_write being false the data
>> will
>>> NOT be replicated (in my tests) ever unless you explicitly request the
>> cell.
>>> Then it will return the wrong result. And only on subsequent reads it
>>> will
>>> return adequate results. I haven't tested it, but documentation states
>> that
>>> range query will NOT do 'read repair' and thus will not force
>>> replication.
>>> The test I did went like this:
>>> - replicate_on_write = false
>>> - write something to node A (which should in theory replicate to node B)
>>> - wait for a long time (longest was on the order of 5 hours)
>>> - read from node B (and here I was getting null / wrong result)
>>> - read from node B again (here you get what you'd expect after read
>> repair)
>>>
>>> In essence, using replicate_on_write=false with rarely read data will
>>> practically defeat the purpose of having replication in the first place
>>> (failover, data redundancy).
>>>
>>>
>>> Or, in other words, this option doesn't look to be applicable to my
>>> situation.
>>>
>>> It looks like I will get much better performance by simply writing to
two
>>> separate clusters rather than using single cluster with replicate=2.
>>> Which
>>> is kind of stupid :) I think something's fishy with counters and
>>> replication.
>>>
>>>
>>>
>>> Edward Capriolo wrote
>>>> I mispoke really. It is not dangerous you just have to understand what
>>>> it
>>>> means. this jira discusses it.
>>>>
>>>> https://issues.apache.org/jira/browse/CASSANDRA-3868
>>>>
>>>> On Tue, Nov 27, 2012 at 6:13 PM, Scott McKay &lt;
>>>
>>>> scottm@
>>>
>>>> &gt;wrote:
>>>>
>>>>>  We're having a similar performance problem.  Setting
>>>>> 'replicate_on_write:
>>>>> false' fixes the performance issue in our tests.
>>>>>
>>>>> How dangerous is it?  What exactly could go wrong?
>>>>>
>>>>> On 12-11-27 01:44 PM, Edward Capriolo wrote:
>>>>>
>>>>> The difference between Replication factor =1 and replication factor >
1
>>>>> is
>>>>> significant. Also it sounds like your cluster is 2 node so going from
>>>>> RF=1
>>>>> to RF=2 means double the load on both nodes.
>>>>>
>>>>>  You may want to experiment with the very dangerous column family
>>>>> attribute:
>>>>>
>>>>>  - replicate_on_write: Replicate every counter update from the leader
>>>>> to
>>>>> the
>>>>> follower replicas. Accepts the values true and false.
>>>>>
>>>>>  Edward
>>>>>  On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman <
>>>>>
>>>
>>>> mkjellman@
>>>
>>>>> wrote:
>>>>>
>>>>>> Are you writing with QUORUM consistency or ONE?
>>>>>>
>>>>>> On 11/27/12 9:52 AM, "Sergey Olefir" &lt;
>>>
>>>> solf.lists@
>>>
>>>> &gt; wrote:
>>>>>>
>>>>>> >Hi Juan,
>>>> cassandra-user@.apache
>
>>  mailing list archive at
>> Nabble.com.
>>>
>
>
>
>
>
> --
> View this message in context:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584014.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at
Nabble.com.
>

--14dae93410f7c523a304cf840cb5
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Say you are doing 100 inserts rf1 on two nodes. That is 50 inserts a node. =
If you go to rf2 that is 100 inserts a node. =A0If you were at 75 % capacit=
y on each mode your now at 150% which is not possible so things bog down.<b=
r>
<br>To figure out what is going on we would need to see tpstat, iostat , an=
d top information.<br><br>I think your looking at the performance the wrong=
 way. Starting off at rf 1 is not the way to understand cassandra performan=
ce.<br>
<br>You do not get the benefits of &quot;scala out&quot; don&#39;t happen u=
ntil you fix your rf and increment your nodecount. Ie 5 nodes at rf 3 is fa=
st 10 nodes at rf 3 even better.<br>On Tuesday, November 27, 2012, Sergey O=
lefir &lt;<a href=3D"mailto:solf.lists@gmail.com">solf.lists@gmail.com</a>&=
gt; wrote:<br>
&gt; I already do a lot of in-memory aggregation before writing to Cassandr=
a.<br>&gt;<br>&gt; The question here is what is wrong with Cassandra (or it=
s configuration)<br>&gt; that causes huge performance drop when moving from=
 1-replication to<br>
&gt; 2-replication for counters -- and more importantly how to resolve the<=
br>&gt; problem. 2x-3x drop when moving from 1-replication to 2-replication=
 on two<br>&gt; nodes is reasonable. 6x is not. Like I said, with this kind=
 of performance<br>
&gt; degradation it makes more sense to run two clusters with replication=
=3D1 in<br>&gt; parallel rather than rely on Cassandra replication.<br>&gt;=
<br>&gt; And yes, Rainbird was the inspiration for what we are trying to do=
 here :)<br>
&gt;<br>&gt;<br>&gt;<br>&gt; Edward Capriolo wrote<br>&gt;&gt; Cassandra=
9;s counters read on increment. Additionally they are distributed<br>&gt;&g=
t; so that can be multiple reads on increment. If they are not fast enough<=
br>
&gt;&gt; and<br>&gt;&gt; you have avoided all tuning options add more serve=
rs to handle the load.<br>&gt;&gt;<br>&gt;&gt; In many cases incrementing t=
he same counter n times can be avoided.<br>&gt;&gt;<br>&gt;&gt; Twitter&#39=
;s rainbird did just that. It avoided multiple counter increments<br>
&gt;&gt; by<br>&gt;&gt; batching them.<br>&gt;&gt;<br>&gt;&gt; I have done =
a similar think using cassandra and Kafka.<br>&gt;&gt;<br>&gt;&gt; <a href=
=3D"https://github.com/edwardcapriolo/IronCount/blob/master/src/test/java/c=
om/jointhegrid/ironcount/mockingbird/MockingBirdMessageHandler.java">https:=
//github.com/edwardcapriolo/IronCount/blob/master/src/test/java/com/jointhe=
grid/ironcount/mockingbird/MockingBirdMessageHandler.java</a><br>
&gt;&gt;<br>&gt;&gt;<br>&gt;&gt; On Tuesday, November 27, 2012, Sergey Olef=
ir &amp;lt;<br>&gt;<br>&gt;&gt; solf.lists@<br>&gt;<br>&gt;&gt; &amp;gt; wr=
ote:<br>&gt;&gt;&gt; Hi, thanks for your suggestions.<br>&gt;&gt;&gt;<br>
&gt;&gt;&gt; Regarding replicate=3D2 vs replicate=3D1 performance: I expect=
ed that below<br>&gt;&gt;&gt; configurations will have similar performance:=
<br>&gt;&gt;&gt; - single node, replicate =3D 1<br>&gt;&gt;&gt; - two nodes=
, replicate =3D 2 (okay, this probably should be a bit slower<br>
&gt;&gt;&gt; due<br>&gt;&gt;&gt; to additional overhead).<br>&gt;&gt;&gt;<b=
r>&gt;&gt;&gt; However what I&#39;m seeing is that second option (replicate=
=3D2) is about<br>&gt;&gt;&gt; THREE<br>&gt;&gt;&gt; times slower than sing=
le node.<br>
&gt;&gt;&gt;<br>&gt;&gt;&gt;<br>&gt;&gt;&gt; Regarding replicate_on_write -=
- it is, in fact, a dangerous option. As<br>&gt;&gt; JIRA<br>&gt;&gt;&gt; d=
iscusses, if you make changes to your ring (moving tokens and such) you<br>
&gt;&gt;&gt; will *silently* lose data. That is on top of whatever data you=
 might end<br>&gt;&gt; up<br>&gt;&gt;&gt; losing if you run replicate_on_wr=
ite=3Dfalse and the only node that got the<br>&gt;&gt;&gt; data fails.<br>
&gt;&gt;&gt;<br>&gt;&gt;&gt; But what is much worse -- with replicate_on_wr=
ite being false the data<br>&gt;&gt; will<br>&gt;&gt;&gt; NOT be replicated=
 (in my tests) ever unless you explicitly request the<br>&gt;&gt; cell.<br>
&gt;&gt;&gt; Then it will return the wrong result. And only on subsequent r=
eads it<br>&gt;&gt;&gt; will<br>&gt;&gt;&gt; return adequate results. I hav=
en&#39;t tested it, but documentation states<br>&gt;&gt; that<br>&gt;&gt;&g=
t; range query will NOT do &#39;read repair&#39; and thus will not force<br=
>
&gt;&gt;&gt; replication.<br>&gt;&gt;&gt; The test I did went like this:<br=
>&gt;&gt;&gt; - replicate_on_write =3D false<br>&gt;&gt;&gt; - write someth=
ing to node A (which should in theory replicate to node B)<br>&gt;&gt;&gt; =
- wait for a long time (longest was on the order of 5 hours)<br>
&gt;&gt;&gt; - read from node B (and here I was getting null / wrong result=
)<br>&gt;&gt;&gt; - read from node B again (here you get what you&#39;d exp=
ect after read<br>&gt;&gt; repair)<br>&gt;&gt;&gt;<br>&gt;&gt;&gt; In essen=
ce, using replicate_on_write=3Dfalse with rarely read data will<br>
&gt;&gt;&gt; practically defeat the purpose of having replication in the fi=
rst place<br>&gt;&gt;&gt; (failover, data redundancy).<br>&gt;&gt;&gt;<br>&=
gt;&gt;&gt;<br>&gt;&gt;&gt; Or, in other words, this option doesn&#39;t loo=
k to be applicable to my<br>
&gt;&gt;&gt; situation.<br>&gt;&gt;&gt;<br>&gt;&gt;&gt; It looks like I wil=
l get much better performance by simply writing to two<br>&gt;&gt;&gt; sepa=
rate clusters rather than using single cluster with replicate=3D2.<br>&gt;&=
gt;&gt; Which<br>
&gt;&gt;&gt; is kind of stupid :) I think something&#39;s fishy with counte=
rs and<br>&gt;&gt;&gt; replication.<br>&gt;&gt;&gt;<br>&gt;&gt;&gt;<br>&gt;=
&gt;&gt;<br>&gt;&gt;&gt; Edward Capriolo wrote<br>&gt;&gt;&gt;&gt; I mispok=
e really. It is not dangerous you just have to understand what<br>
&gt;&gt;&gt;&gt; it<br>&gt;&gt;&gt;&gt; means. this jira discusses it.<br>&=
gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt; <a href=3D"https://issues.apache.org/ji=
ra/browse/CASSANDRA-3868">https://issues.apache.org/jira/browse/CASSANDRA-3=
868</a><br>
&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt; On Tue, Nov 27, 2012 at 6:13 PM, Scott=
 McKay &amp;lt;<br>&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt; scottm@<br>&gt;&gt;&gt;=
<br>&gt;&gt;&gt;&gt; &amp;gt;wrote:<br>&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;=
&gt; =A0We&#39;re having a similar performance problem. =A0Setting<br>
&gt;&gt;&gt;&gt;&gt; &#39;replicate_on_write:<br>&gt;&gt;&gt;&gt;&gt; false=
&#39; fixes the performance issue in our tests.<br>&gt;&gt;&gt;&gt;&gt;<br>=
&gt;&gt;&gt;&gt;&gt; How dangerous is it? =A0What exactly could go wrong?<b=
r>
&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt; On 12-11-27 01:44 PM, Edward C=
apriolo wrote:<br>&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt; The differen=
ce between Replication factor =3D1 and replication factor &gt; 1<br>&gt;&gt=
;&gt;&gt;&gt; is<br>
&gt;&gt;&gt;&gt;&gt; significant. Also it sounds like your cluster is 2 nod=
e so going from<br>&gt;&gt;&gt;&gt;&gt; RF=3D1<br>&gt;&gt;&gt;&gt;&gt; to R=
F=3D2 means double the load on both nodes.<br>&gt;&gt;&gt;&gt;&gt;<br>&gt;&=
gt;&gt;&gt;&gt; =A0You may want to experiment with the very dangerous colum=
n family<br>
&gt;&gt;&gt;&gt;&gt; attribute:<br>&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;=
&gt; =A0- replicate_on_write: Replicate every counter update from the leade=
r<br>&gt;&gt;&gt;&gt;&gt; to<br>&gt;&gt;&gt;&gt;&gt; the<br>&gt;&gt;&gt;&gt=
;&gt; follower replicas. Accepts the values true and false.<br>
&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt; =A0Edward<br>&gt;&gt;&gt;&gt;&=
gt; =A0On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman &lt;<br>&gt;&gt;&g=
t;&gt;&gt;<br>&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt; mkjellman@<br>&gt;&gt;&gt;<b=
r>
&gt;&gt;&gt;&gt;&gt; wrote:<br>&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;=
&gt; Are you writing with QUORUM consistency or ONE?<br>&gt;&gt;&gt;&gt;&gt=
;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; On 11/27/12 9:52 AM, &quot;Sergey Olefir&=
quot; &amp;lt;<br>
&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt; solf.lists@<br>&gt;&gt;&gt;<br>&gt;&gt;&gt=
;&gt; &amp;gt; wrote:<br>&gt;&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&g=
t; &gt;Hi Juan,<br>&gt;&gt;&gt;&gt; cassandra-user@.apache<br>&gt;<br>&gt;&=
gt; =A0mailing list archive at<br>
&gt;&gt; Nabble.com.<br>&gt;&gt;&gt;<br>&gt;<br>&gt;<br>&gt;<br>&gt;<br>&gt=
;<br>&gt; --<br>&gt; View this message in context: <a href=3D"http://cassan=
dra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-aw=
ful-performance-tp7583993p7584014.html">http://cassandra-user-incubator-apa=
che-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583=
993p7584014.html</a><br>
&gt; Sent from the <a href=3D"mailto:cassandra-user@incubator.apache.org">c=
assandra-user@incubator.apache.org</a> mailing list archive at Nabble.com.<=
br>&gt;

--14dae93410f7c523a304cf840cb5--