Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1084)
Subject: Re: ReplicateOnWrite issues
From: David Hawthorne <dhawth@gmx.3crowd.com>
In-Reply-To: 
 <CAKkz8Q3fPjDJTLch5W37g=v++6vO+iysugCs3oG3JKS1eb1rXA@mail.gmail.com>
Date: Tue, 12 Jul 2011 14:42:35 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <2873AEF0-B0AD-4F24-A704-B14EF2513B9C@gmx.3crowd.com>
References: <81EBDDED-0C20-4C83-9D23-C7DFCA564DBB@gmx.3crowd.com>
 <CAKkz8Q3fPjDJTLch5W37g=v++6vO+iysugCs3oG3JKS1eb1rXA@mail.gmail.com>
To: user@cassandra.apache.org

Well, I was using a large number of clients:  I tried configuring a =
hector pool of 20-200 to see what affect that had on throughput.  =
There's definitely a point after which there's no gain, so I dialed it =
back down.  To clarify a few other things, when I say inserts I mean =
increments, as this test only involves counters.  When I say a row is =
full, I mean I've written all of the data into it that belongs in it =
from our application's perspective.  Not all rows are the same length, =
and the average number of subcolumns per supercolumn is higher for some =
rows than for others.

I'm curious, is it necessary for a read to happen for replication when =
using RF=3D1?

I am currently running a test against a single node cluster, RF=3D1, =
CL.ONE, 4 connections/concurrent increments, and graphing various data =
points.  If something interesting happens, I'll send a graph.

When you say stagger writes, do you mean just lower the overall number =
of increments/sec my client is attempting, or do you mean I should =
stagger the load amongst many CFs/rows at once?  Right now I'm writing =
heavily into a small set of rows, each going into 3 CFs, before moving =
on to the next small set of rows.


On Jul 12, 2011, at 12:26 PM, Sylvain Lebresne wrote:

> When you do counter increment at CL.ONE, the write is acknowledged as
> soon as the first replica getting the the write has pushed the
> increment into his memtable. However, there is a read happening for
> the replication to the other replicas (this is necessary to the
> counter design). What is happening is that our nodes are not able to
> do those reads fast enough. Or, another way to see it is that you are
> writing faster than you cluster can handle.
>=20
> I've opened https://issues.apache.org/jira/browse/CASSANDRA-2889 to do
> something about that, but for now you'll have to stagger your writes.
> I'm more surprised that tasks on the replicate on write stage would
> stack up at QUORUM, unless you're using a very large number of
> clients.
>=20
> --
> Sylvain
>=20
> On Tue, Jul 12, 2011 at 8:57 AM, David Hawthorne =
<dhawth@gmx.3crowd.com> wrote:
>> This is now my fourth attempt to get the message through.  Apologies =
if you see multiple copies.
>>=20
>> I've tried to give as much relevant data as I can think of, but =
please let me know if you need any other info.  I spent the day getting =
jmxtrans to talk to statsd with the cassandra JMX data, so I can provide =
graphs of just about anything you want.  I can also re-run the tests to =
replicate the problem if need be.  As always, thanks for taking a look!
>>=20
>> I have a 4 host test cluster that I'm writing counters into running =
0.8.1, and I'm writing to it with hector with CL.ONE or CL.QUORUM (test =
was run with each and both gave the same results).  Partitioner is the =
RandomPartitioner, and RF=3D3.  nodetool ring shows that actual data is =
well balanced:
>>=20
>> Address         DC          Rack        Status State   Load           =
 Owns    Token
>>                                                                       =
       127605887595351923798765477786913079296
>> 10.0.0.57    datacenter1 rack1       Up     Normal  1.59 GB         =
25.00%  0
>> 10.0.0.56    datacenter1 rack1       Up     Normal  1.39 GB         =
25.00%  42535295865117307932921825928971026432
>> 10.0.0.55    datacenter1 rack1       Up     Normal  1.24 GB         =
25.00%  85070591730234615865843651857942052864
>> 10.0.0.54    datacenter1 rack1       Up     Normal  1.34 GB         =
25.00%  127605887595351923798765477786913079296
>>=20
>> I also tested against a 1-box test cluster with RF=3D1.
>>=20
>> Both show the same results:  high performance for a while, and then =
ReplicateOnWrite backs way up (I've seen it as high as a million), the =
cassandra process becomes unresponsive, and the hector client starts =
throwing exceptions.  High performance in this case equates to about =
5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting one =
long row at a time, so the performance is bounded by the one box owning =
the row).  The cassandra process never seems to recover, even if left =
for 24 hours.  It's still chewing through those ReplicateOnWrite pending =
tasks.
>>=20
>> GC doesn't seem to be a factor.  Logs show GCs completing in < 1s on =
all 4 boxes in the test cluster, as well as on the 1 standalone server.
>>=20
>> I'm also seeing highly erratic performance from the cluster in =
general:  inserts/sec usually start at around 3k, but then maybe creep =
up towards 6k, then drop back down to below 1k for an extended period of =
time, then maybe 1% of the time they'll spike up to 8k or even 15k, but =
that doesn't last more than 30 seconds.
>>=20
>> Here's the tpstats output from all 4 boxes:
>>=20
>> Slowest/most problematic/highest cpu usage one:
>>=20
>> This tpstats output is also representative of the one box cluster =
about 20 minutes into my insert run.
>>=20
>> 4 $ ./nodetool -h localhost tpstats
>> Pool Name                    Active   Pending      Completed
>> ReadStage                         0         0              0
>> RequestResponseStage              0         0       59095840
>> MutationStage                     0         0       31182942
>> ReadRepairStage                   0         0              0
>> ReplicateOnWriteStage            32    928042       28246716
>> GossipStage                       1         1         449464
>> AntiEntropyStage                  0         0              0
>> MigrationStage                    0         0              4
>> MemtablePostFlusher               0         0            445
>> StreamStage                       0         0              0
>> FlushWriter                       0         0            445
>> FILEUTILS-DELETE-POOL             0         0             38
>> MiscStage                         0         0              0
>> FlushSorter                       0         0              0
>> InternalResponseStage             0         0              4
>> HintedHandoff                     0         0              0
>>=20
>> Second highest cpu usage one:
>>=20
>> 3 $ ./nodetool -h localhost tpstats
>> Pool Name                    Active   Pending      Completed
>> ReadStage                         0         0              0
>> RequestResponseStage              0         0       27413910
>> MutationStage                     0         0       27523094
>> ReadRepairStage                   0         0              0
>> ReplicateOnWriteStage             0         0       11111971
>> GossipStage                       0         0         335422
>> AntiEntropyStage                  0         0              0
>> MigrationStage                    0         0              4
>> MemtablePostFlusher               0         0             66
>> StreamStage                       0         0              0
>> FlushWriter                       0         0             66
>> FILEUTILS-DELETE-POOL             0         0             31
>> MiscStage                         0         0              0
>> FlushSorter                       0         0              0
>> InternalResponseStage             0         0              4
>> HintedHandoff                     0         0              0
>>=20
>> The other 2 that see minimal cpu usage:
>>=20
>> 2 $ ./nodetool -h localhost tpstats
>> Pool Name                    Active   Pending      Completed
>> ReadStage                         0         0              0
>> RequestResponseStage              0         0        9213999
>> MutationStage                     0         0       33233114
>> ReadRepairStage                   0         0              0
>> ReplicateOnWriteStage             0         0              0
>> GossipStage                       0         0         336016
>> AntiEntropyStage                  0         0              0
>> MigrationStage                    0         0              4
>> MemtablePostFlusher               0         0             77
>> StreamStage                       0         0              0
>> FlushWriter                       0         0             77
>> FILEUTILS-DELETE-POOL             0         0             40
>> MiscStage                         0         0              0
>> FlushSorter                       0         0              0
>> InternalResponseStage             0         0              4
>> HintedHandoff                     0         0              0
>>=20
>>=20
>> 1 $ ./nodetool -h localhost tpstats
>> Pool Name                    Active   Pending      Completed
>> ReadStage                         0         0              0
>> RequestResponseStage              0         0       13796875
>> MutationStage                     0         0       36116662
>> ReadRepairStage                   0         0              0
>> ReplicateOnWriteStage             0         0              0
>> GossipStage                       0         0         336496
>> AntiEntropyStage                  0         0              0
>> MigrationStage                    0         0              4
>> MemtablePostFlusher               0         0             91
>> StreamStage                       0         0              0
>> FlushWriter                       0         0             91
>> FILEUTILS-DELETE-POOL             0         0             43
>> MiscStage                         0         0              0
>> FlushSorter                       0         0              0
>> InternalResponseStage             0         0              4
>> HintedHandoff                     0         0              0