Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1084)
Subject: Re: ReplicateOnWrite issues
From: David Hawthorne <dhawth@gmx.3crowd.com>
In-Reply-To: 
 <CAAnh3_--hWk_oJd32cOBofutVyv_rNG=moHr-0JOKZj2eENvfA@mail.gmail.com>
Date: Tue, 12 Jul 2011 00:37:42 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <07A4E597-27BF-49D8-9A78-48361A4036D6@gmx.3crowd.com>
References: <81EBDDED-0C20-4C83-9D23-C7DFCA564DBB@gmx.3crowd.com>
 <CAAnh3_--hWk_oJd32cOBofutVyv_rNG=moHr-0JOKZj2eENvfA@mail.gmail.com>
To: user@cassandra.apache.org

It's definitely for counters, and some of the rows I'm inserting are =
long-ish, if 1.3MB is long.

Maybe it would help if I said I was using counter super columns.  I'm =
also writing to only a handful of rows at a time, until they are full.  =
It looks like the counter super column code in =
addReadCommandFromColumnFamily, in the ReplicateOnWrite handling, might =
not be optimized for this kind of load.

It looks like I may be able to turn off replicate_on_write for the CFs =
as long as I use CL.ONE to get around this in the short term, but I'd =
like to know what happens when replicate on write is true and when it is =
false so I can make a more informed choice.

On Jul 12, 2011, at 12:20 AM, Yang wrote:

> interesting,
>=20
> first just to make sure: since replicateOnWrite is for Counters, you
> are using counters (you use the word "insert" instead of
> "add/increment" ) right?
>=20
> if you are using counters, supposedly the leader runs
> replicateOnWrite, somehow all your adds find the one box as leader,
> that's probably something worth investigating.
>=20
> finally, u could try debugging into those replicateOnWrite tasks and
> see what they are doing exactly, maybe there is something immediately
> wrong
>=20
>=20
>=20
> On Mon, Jul 11, 2011 at 11:57 PM, David Hawthorne =
<dhawth@gmx.3crowd.com> wrote:
>> This is now my fourth attempt to get the message through.  Apologies =
if you see multiple copies.
>>=20
>> I've tried to give as much relevant data as I can think of, but =
please let me know if you need any other info.  I spent the day getting =
jmxtrans to talk to statsd with the cassandra JMX data, so I can provide =
graphs of just about anything you want.  I can also re-run the tests to =
replicate the problem if need be.  As always, thanks for taking a look!
>>=20
>> I have a 4 host test cluster that I'm writing counters into running =
0.8.1, and I'm writing to it with hector with CL.ONE or CL.QUORUM (test =
was run with each and both gave the same results).  Partitioner is the =
RandomPartitioner, and RF=3D3.  nodetool ring shows that actual data is =
well balanced:
>>=20
>> Address         DC          Rack        Status State   Load           =
 Owns    Token
>>                                                                       =
       127605887595351923798765477786913079296
>> 10.0.0.57    datacenter1 rack1       Up     Normal  1.59 GB         =
25.00%  0
>> 10.0.0.56    datacenter1 rack1       Up     Normal  1.39 GB         =
25.00%  42535295865117307932921825928971026432
>> 10.0.0.55    datacenter1 rack1       Up     Normal  1.24 GB         =
25.00%  85070591730234615865843651857942052864
>> 10.0.0.54    datacenter1 rack1       Up     Normal  1.34 GB         =
25.00%  127605887595351923798765477786913079296
>>=20
>> I also tested against a 1-box test cluster with RF=3D1.
>>=20
>> Both show the same results:  high performance for a while, and then =
ReplicateOnWrite backs way up (I've seen it as high as a million), the =
cassandra process becomes unresponsive, and the hector client starts =
throwing exceptions.  High performance in this case equates to about =
5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting one =
long row at a time, so the performance is bounded by the one box owning =
the row).  The cassandra process never seems to recover, even if left =
for 24 hours.  It's still chewing through those ReplicateOnWrite pending =
tasks.
>>=20
>> GC doesn't seem to be a factor.  Logs show GCs completing in < 1s on =
all 4 boxes in the test cluster, as well as on the 1 standalone server.
>>=20
>> I'm also seeing highly erratic performance from the cluster in =
general:  inserts/sec usually start at around 3k, but then maybe creep =
up towards 6k, then drop back down to below 1k for an extended period of =
time, then maybe 1% of the time they'll spike up to 8k or even 15k, but =
that doesn't last more than 30 seconds.
>>=20
>> Here's the tpstats output from all 4 boxes:
>>=20
>> Slowest/most problematic/highest cpu usage one:
>>=20
>> This tpstats output is also representative of the one box cluster =
about 20 minutes into my insert run.
>>=20
>> 4 $ ./nodetool -h localhost tpstats
>> Pool Name                    Active   Pending      Completed
>> ReadStage                         0         0              0
>> RequestResponseStage              0         0       59095840
>> MutationStage                     0         0       31182942
>> ReadRepairStage                   0         0              0
>> ReplicateOnWriteStage            32    928042       28246716
>> GossipStage                       1         1         449464
>> AntiEntropyStage                  0         0              0
>> MigrationStage                    0         0              4
>> MemtablePostFlusher               0         0            445
>> StreamStage                       0         0              0
>> FlushWriter                       0         0            445
>> FILEUTILS-DELETE-POOL             0         0             38
>> MiscStage                         0         0              0
>> FlushSorter                       0         0              0
>> InternalResponseStage             0         0              4
>> HintedHandoff                     0         0              0
>>=20
>> Second highest cpu usage one:
>>=20
>> 3 $ ./nodetool -h localhost tpstats
>> Pool Name                    Active   Pending      Completed
>> ReadStage                         0         0              0
>> RequestResponseStage              0         0       27413910
>> MutationStage                     0         0       27523094
>> ReadRepairStage                   0         0              0
>> ReplicateOnWriteStage             0         0       11111971
>> GossipStage                       0         0         335422
>> AntiEntropyStage                  0         0              0
>> MigrationStage                    0         0              4
>> MemtablePostFlusher               0         0             66
>> StreamStage                       0         0              0
>> FlushWriter                       0         0             66
>> FILEUTILS-DELETE-POOL             0         0             31
>> MiscStage                         0         0              0
>> FlushSorter                       0         0              0
>> InternalResponseStage             0         0              4
>> HintedHandoff                     0         0              0
>>=20
>> The other 2 that see minimal cpu usage:
>>=20
>> 2 $ ./nodetool -h localhost tpstats
>> Pool Name                    Active   Pending      Completed
>> ReadStage                         0         0              0
>> RequestResponseStage              0         0        9213999
>> MutationStage                     0         0       33233114
>> ReadRepairStage                   0         0              0
>> ReplicateOnWriteStage             0         0              0
>> GossipStage                       0         0         336016
>> AntiEntropyStage                  0         0              0
>> MigrationStage                    0         0              4
>> MemtablePostFlusher               0         0             77
>> StreamStage                       0         0              0
>> FlushWriter                       0         0             77
>> FILEUTILS-DELETE-POOL             0         0             40
>> MiscStage                         0         0              0
>> FlushSorter                       0         0              0
>> InternalResponseStage             0         0              4
>> HintedHandoff                     0         0              0
>>=20
>>=20
>> 1 $ ./nodetool -h localhost tpstats
>> Pool Name                    Active   Pending      Completed
>> ReadStage                         0         0              0
>> RequestResponseStage              0         0       13796875
>> MutationStage                     0         0       36116662
>> ReadRepairStage                   0         0              0
>> ReplicateOnWriteStage             0         0              0
>> GossipStage                       0         0         336496
>> AntiEntropyStage                  0         0              0
>> MigrationStage                    0         0              4
>> MemtablePostFlusher               0         0             91
>> StreamStage                       0         0              0
>> FlushWriter                       0         0             91
>> FILEUTILS-DELETE-POOL             0         0             43
>> MiscStage                         0         0              0
>> FlushSorter                       0         0              0
>> InternalResponseStage             0         0              4
>> HintedHandoff                     0         0              0