Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (athena.apache.org: 216.139.250.139 is neither permitted
 nor denied by domain of solf.lists@gmail.com)
Date: Wed, 28 Nov 2012 11:41:56 -0800 (PST)
From: Sergey Olefir <solf.lists@gmail.com>
To: cassandra-user@incubator.apache.org
Message-ID: <1354131716881-7584052.post@n2.nabble.com>
In-Reply-To: 
 <CAKkz8Q03X6vKcHHKHK6_PGcR2me7YfB6h9t8vG47xuzP7Wpr2Q@mail.gmail.com>
References: <CCDA4026.60C7%mkjellman@barracuda.com>
 <CAENxBwyjepNm=t7=00SzzzoC=OgBdxUPE_PwTi5jw2DU8tcCxw@mail.gmail.com>
 <50B5491C.9020101@mailchannels.com>
 <CAENxBwww+ivh8kxTMzw7C_WVGfcfQSVprrrrqBEs3WOLRCo7Ww@mail.gmail.com>
 <1354059513956-7584011.post@n2.nabble.com>
 <CAENxBww9kC2QFLPhU714jzoBtmYEXByq=qGnuUj_Fqip+yieZg@mail.gmail.com>
 <1354062178577-7584014.post@n2.nabble.com>
 <CAENxBwxqHMNNBoGQVCCAJDe6ZySm+YSe2BxfjBPjugrAVjQotQ@mail.gmail.com>
 <1354084440405-7584031.post@n2.nabble.com>
 <CAKkz8Q03X6vKcHHKHK6_PGcR2me7YfB6h9t8vG47xuzP7Wpr2Q@mail.gmail.com>
Subject: Re: counters + replication = awful performance?
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Well, those are sad news then. I don't think I can consider 20k increments
per second for a two node cluster (with RF=2) a reasonable performance (cost
vs. benefit).

I might have to look into other storage solutions or perhaps experiment with
duplicate clusters with RF=1 or replicate_on_write=false.

Although yes, I probably should try that row cache you mentioned -- I saw
that key cache was going unused (so saw no reason to try to enable row
cache), but I think it was on RF=1, it might be different on RF=2.


Sylvain Lebresne-3 wrote
> Counters replication works in different ways than the one of "normal"
> writes. Namely, a counter update is written to a first replica, then a
> read
> is perform and the result of that is replicated to the other nodes. With
> RF=1, since there is only one replica no read is involved but in a way
> it's
> a degenerate case. So there is two reason why RF>2 is much slower than
> RF=1:
> 1) it involves a read to replicate and that read takes times. Especially
> if
> that read hits the disk, it may even dominate the insertion time.
> 2) the replication to the first replica and the one to the res of the
> replica are not done in parallel but sequentially. Note that this is only
> true for the first replica versus the othere. In other words, from RF=2 to
> RF=3 you should see a significant performance degradation.
> 
> Note that while there is nothing you can do for 2), you can try to speed
> up
> 1) by using row cache for instance (in case you weren't).
> 
> In other words, with counters, it is expected that RF=1 be potentially
> much
> faster than RF>1. That is the way counters works.
> 
> And don't get me wrong, I'm not suggesting you should use RF=1 at all.
> What
> I am saying is that the performance you see with RF=2 is the performance
> of
> counters in Cassandra.
> 
> --
> Sylvain
> 
> 
> On Wed, Nov 28, 2012 at 7:34 AM, Sergey Olefir &lt;

> solf.lists@

> &gt; wrote:
> 
>> I think there might be a misunderstanding as to the nature of the
>> problem.
>>
>> Say, I have test set T. And I have two identical servers A and B.
>> - I tested that server A (singly) is able to handle load of T.
>> - I tested that server B (singly) is able to handle load of T.
>> - I then join A and B in the cluster and set replication=2 -- this means
>> that each server in effect has to handle full test load individually
>> (because there are two servers and replication=2 it means that each
>> server
>> effectively has to handle all the data written to the cluster). Under
>> these
>> circumstances it is reasonable to assume that cluster A+B shall be able
>> to
>> handle load T because each server is able to do so individually.
>>
>> HOWEVER, this is not the case. In fact, A+B together are only able to
>> handle
>> less than 1/3 of T DESPITE the fact that A and B individually are able to
>> handle T just fine.
>>
>> I think there's something wrong with Cassandra replication (possibly as
>> simple as me misconfiguring something) -- it shouldn't be three times
>> faster
>> to write to two separate nodes in parallel as compared to writing to
>> 2-node
>> Cassandra cluster with replication=2.
>>
>>
>> Edward Capriolo wrote
>> > Say you are doing 100 inserts rf1 on two nodes. That is 50 inserts a
>> node.
>> > If you go to rf2 that is 100 inserts a node.  If you were at 75 %
>> capacity
>> > on each mode your now at 150% which is not possible so things bog down.
>> >
>> > To figure out what is going on we would need to see tpstat, iostat ,
>> and
>> > top information.
>> >
>> > I think your looking at the performance the wrong way. Starting off at
>> rf
>> > 1
>> > is not the way to understand cassandra performance.
>> >
>> > You do not get the benefits of "scala out" don't happen until you fix
>> your
>> > rf and increment your nodecount. Ie 5 nodes at rf 3 is fast 10 nodes at
>> rf
>> > 3 even better.
>> > On Tuesday, November 27, 2012, Sergey Olefir &lt;
>>
>> > solf.lists@
>>
>> > &gt; wrote:
>> >> I already do a lot of in-memory aggregation before writing to
>> Cassandra.
>> >>
>> >> The question here is what is wrong with Cassandra (or its
>> configuration)
>> >> that causes huge performance drop when moving from 1-replication to
>> >> 2-replication for counters -- and more importantly how to resolve the
>> >> problem. 2x-3x drop when moving from 1-replication to 2-replication on
>> >> two
>> >> nodes is reasonable. 6x is not. Like I said, with this kind of
>> >> performance
>> >> degradation it makes more sense to run two clusters with replication=1
>> in
>> >> parallel rather than rely on Cassandra replication.
>> >>
>> >> And yes, Rainbird was the inspiration for what we are trying to do
>> here
>> >> :)
>> >>
>> >>
>> >>
>> >> Edward Capriolo wrote
>> >>> Cassandra's counters read on increment. Additionally they are
>> >>> distributed
>> >>> so that can be multiple reads on increment. If they are not fast
>> enough
>> >>> and
>> >>> you have avoided all tuning options add more servers to handle the
>> load.
>> >>>
>> >>> In many cases incrementing the same counter n times can be avoided.
>> >>>
>> >>> Twitter's rainbird did just that. It avoided multiple counter
>> increments
>> >>> by
>> >>> batching them.
>> >>>
>> >>> I have done a similar think using cassandra and Kafka.
>> >>>
>> >>>
>> >
>> https://github.com/edwardcapriolo/IronCount/blob/master/src/test/java/com/jointhegrid/ironcount/mockingbird/MockingBirdMessageHandler.java
>> >>>
>> >>>
>> >>> On Tuesday, November 27, 2012, Sergey Olefir &lt;
>> >>
>> >>> solf.lists@
>> >>
>> >>> &gt; wrote:
>> >>>> Hi, thanks for your suggestions.
>> >>>>
>> >>>> Regarding replicate=2 vs replicate=1 performance: I expected that
>> below
>> >>>> configurations will have similar performance:
>> >>>> - single node, replicate = 1
>> >>>> - two nodes, replicate = 2 (okay, this probably should be a bit
>> slower
>> >>>> due
>> >>>> to additional overhead).
>> >>>>
>> >>>> However what I'm seeing is that second option (replicate=2) is about
>> >>>> THREE
>> >>>> times slower than single node.
>> >>>>
>> >>>>
>> >>>> Regarding replicate_on_write -- it is, in fact, a dangerous option.
>> As
>> >>> JIRA
>> >>>> discusses, if you make changes to your ring (moving tokens and such)
>> >>>> you
>> >>>> will *silently* lose data. That is on top of whatever data you might
>> >>>> end
>> >>> up
>> >>>> losing if you run replicate_on_write=false and the only node that
>> got
>> > the
>> >>>> data fails.
>> >>>>
>> >>>> But what is much worse -- with replicate_on_write being false the
>> data
>> >>> will
>> >>>> NOT be replicated (in my tests) ever unless you explicitly request
>> the
>> >>> cell.
>> >>>> Then it will return the wrong result. And only on subsequent reads
>> it
>> >>>> will
>> >>>> return adequate results. I haven't tested it, but documentation
>> states
>> >>> that
>> >>>> range query will NOT do 'read repair' and thus will not force
>> >>>> replication.
>> >>>> The test I did went like this:
>> >>>> - replicate_on_write = false
>> >>>> - write something to node A (which should in theory replicate to
>> node
>> >>>> B)
>> >>>> - wait for a long time (longest was on the order of 5 hours)
>> >>>> - read from node B (and here I was getting null / wrong result)
>> >>>> - read from node B again (here you get what you'd expect after read
>> >>> repair)
>> >>>>
>> >>>> In essence, using replicate_on_write=false with rarely read data
>> will
>> >>>> practically defeat the purpose of having replication in the first
>> place
>> >>>> (failover, data redundancy).
>> >>>>
>> >>>>
>> >>>> Or, in other words, this option doesn't look to be applicable to my
>> >>>> situation.
>> >>>>
>> >>>> It looks like I will get much better performance by simply writing
>> to
>> > two
>> >>>> separate clusters rather than using single cluster with replicate=2.
>> >>>> Which
>> >>>> is kind of stupid :) I think something's fishy with counters and
>> >>>> replication.
>> >>>>
>> >>>>
>> >>>>
>> >>>> Edward Capriolo wrote
>> >>>>> I mispoke really. It is not dangerous you just have to understand
>> what
>> >>>>> it
>> >>>>> means. this jira discusses it.
>> >>>>>
>> >>>>> https://issues.apache.org/jira/browse/CASSANDRA-3868
>> >>>>>
>> >>>>> On Tue, Nov 27, 2012 at 6:13 PM, Scott McKay &lt;
>> >>>>
>> >>>>> scottm@
>> >>>>
>> >>>>> &gt;wrote:
>> >>>>>
>> >>>>>>  We're having a similar performance problem.  Setting
>> >>>>>> 'replicate_on_write:
>> >>>>>> false' fixes the performance issue in our tests.
>> >>>>>>
>> >>>>>> How dangerous is it?  What exactly could go wrong?
>> >>>>>>
>> >>>>>> On 12-11-27 01:44 PM, Edward Capriolo wrote:
>> >>>>>>
>> >>>>>> The difference between Replication factor =1 and replication
>> factor
>> >
>> > 1
>> >>>>>> is
>> >>>>>> significant. Also it sounds like your cluster is 2 node so going
>> from
>> >>>>>> RF=1
>> >>>>>> to RF=2 means double the load on both nodes.
>> >>>>>>
>> >>>>>>  You may want to experiment with the very dangerous column family
>> >>>>>> attribute:
>> >>>>>>
>> >>>>>>  - replicate_on_write: Replicate every counter update from the
>> leader
>> >>>>>> to
>> >>>>>> the
>> >>>>>> follower replicas. Accepts the values true and false.
>> >>>>>>
>> >>>>>>  Edward
>> >>>>>>  On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman <
>> >>>>>>
>> >>>>
>> >>>>> mkjellman@
>> >>>>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> Are you writing with QUORUM consistency or ONE?
>> >>>>>>>
>> >>>>>>> On 11/27/12 9:52 AM, "Sergey Olefir" &lt;
>> >>>>
>> >>>>> solf.lists@
>> >>>>
>> >>>>> &gt; wrote:
>> >>>>>>>
>> >>>>>>> >Hi Juan,
>> >>>>> cassandra-user@.apache
>> >>
>> >>>  mailing list archive at
>> >>> Nabble.com.
>> >>>>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584014.html
>> >> Sent from the
>>
>> > cassandra-user@.apache
>>
>> >  mailing list archive at
>> > Nabble.com.
>> >>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584031.html
>> Sent from the 

> cassandra-user@.apache

>  mailing list archive at
>> Nabble.com.
>>


--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584052.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.