cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Lebresne <sylv...@datastax.com>
Subject Re: ReplicateOnWrite issues
Date Tue, 12 Jul 2011 22:26:28 GMT
On Wed, Jul 13, 2011 at 12:18 AM, David Hawthorne <dhawth@gmx.3crowd.com> wrote:
>
> On Jul 12, 2011, at 3:02 PM, Sylvain Lebresne wrote:
>
>> On Tue, Jul 12, 2011 at 11:42 PM, David Hawthorne <dhawth@gmx.3crowd.com> wrote:
>>> Well, I was using a large number of clients:  I tried configuring a hector pool
of 20-200 to see what affect that had on throughput. There's definitely a point after which
there's no gain, so I dialed it back down.  To clarify a few other things, when I say inserts
I mean increments, as this test only involves counters.  When I say a row is full, I mean
I've written all of the data into it that belongs in it from our application's perspective.
 Not all rows are the same length, and the average number of subcolumns per supercolumn is
higher for some rows than for others.
>>>
>>
>> Oh yes, you're using supercolumns. That could help magnify the problem
>> a little bit, given that to read any subcolumn of a supercolumn, we
>> have to read the whole supercolumn. Depends how many counters per
>> supercolumns we're talking about.
>
> Does this imply that incrementing 50 counters in one supercolumn in one row in one CF
requires 50! reads of the supercolumn?  One read of the entire supercolumn for each insert?

Every increment of one of the 50 counters means reading the whole
superColumn, i.e, 50 counters. That is, there is not more reads, but
the reads themselves have to read more and so can take longer.

>
>
>>
>>> I'm curious, is it necessary for a read to happen for replication when using
RF=1?
>>
>> No, it's not necessary with RF=1 ... but we still do it :(. Ok, it's
>> basically a bug, we'll
>> fix that for 0.8.2.
>>
>>> I am currently running a test against a single node cluster, RF=1, CL.ONE, 4
connections/concurrent increments, and graphing various data points.  If something interesting
happens, I'll send a graph.
>>>
>>> When you say stagger writes, do you mean just lower the overall number of increments/sec
my client is attempting,
>>
>> Yes, this is what I mean.
>>
>>> or do you mean I should stagger the load amongst many CFs/rows at once?  Right
now I'm writing heavily into a small set of rows, each going into 3 CFs, before moving on
to the next small set of rows.
>>>
>>>
>>> On Jul 12, 2011, at 12:26 PM, Sylvain Lebresne wrote:
>>>
>>>> When you do counter increment at CL.ONE, the write is acknowledged as
>>>> soon as the first replica getting the the write has pushed the
>>>> increment into his memtable. However, there is a read happening for
>>>> the replication to the other replicas (this is necessary to the
>>>> counter design). What is happening is that our nodes are not able to
>>>> do those reads fast enough. Or, another way to see it is that you are
>>>> writing faster than you cluster can handle.
>>>>
>>>> I've opened https://issues.apache.org/jira/browse/CASSANDRA-2889 to do
>>>> something about that, but for now you'll have to stagger your writes.
>>>> I'm more surprised that tasks on the replicate on write stage would
>>>> stack up at QUORUM, unless you're using a very large number of
>>>> clients.
>>>>
>>>> --
>>>> Sylvain
>>>>
>>>> On Tue, Jul 12, 2011 at 8:57 AM, David Hawthorne <dhawth@gmx.3crowd.com>
wrote:
>>>>> This is now my fourth attempt to get the message through.  Apologies
if you see multiple copies.
>>>>>
>>>>> I've tried to give as much relevant data as I can think of, but please
let me know if you need any other info.  I spent the day getting jmxtrans to talk to statsd
with the cassandra JMX data, so I can provide graphs of just about anything you want.  I
can also re-run the tests to replicate the problem if need be.  As always, thanks for taking
a look!
>>>>>
>>>>> I have a 4 host test cluster that I'm writing counters into running 0.8.1,
and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run with each and both
gave the same results).  Partitioner is the RandomPartitioner, and RF=3.  nodetool ring
shows that actual data is well balanced:
>>>>>
>>>>> Address         DC          Rack        Status State  
Load            Owns    Token
>>>>>                                              
                               127605887595351923798765477786913079296
>>>>> 10.0.0.57    datacenter1 rack1       Up     Normal  1.59 GB  
      25.00%  0
>>>>> 10.0.0.56    datacenter1 rack1       Up     Normal  1.39 GB  
      25.00%  42535295865117307932921825928971026432
>>>>> 10.0.0.55    datacenter1 rack1       Up     Normal  1.24 GB  
      25.00%  85070591730234615865843651857942052864
>>>>> 10.0.0.54    datacenter1 rack1       Up     Normal  1.34 GB  
      25.00%  127605887595351923798765477786913079296
>>>>>
>>>>> I also tested against a 1-box test cluster with RF=1.
>>>>>
>>>>> Both show the same results:  high performance for a while, and then
ReplicateOnWrite backs way up (I've seen it as high as a million), the cassandra process becomes
unresponsive, and the hector client starts throwing exceptions.  High performance in this
case equates to about 5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting
one long row at a time, so the performance is bounded by the one box owning the row).  The
cassandra process never seems to recover, even if left for 24 hours.  It's still chewing
through those ReplicateOnWrite pending tasks.
>>>>>
>>>>> GC doesn't seem to be a factor.  Logs show GCs completing in < 1s
on all 4 boxes in the test cluster, as well as on the 1 standalone server.
>>>>>
>>>>> I'm also seeing highly erratic performance from the cluster in general:
 inserts/sec usually start at around 3k, but then maybe creep up towards 6k, then drop back
down to below 1k for an extended period of time, then maybe 1% of the time they'll spike up
to 8k or even 15k, but that doesn't last more than 30 seconds.
>>>>>
>>>>> Here's the tpstats output from all 4 boxes:
>>>>>
>>>>> Slowest/most problematic/highest cpu usage one:
>>>>>
>>>>> This tpstats output is also representative of the one box cluster about
20 minutes into my insert run.
>>>>>
>>>>> 4 $ ./nodetool -h localhost tpstats
>>>>> Pool Name                    Active   Pending      Completed
>>>>> ReadStage                         0         0      
       0
>>>>> RequestResponseStage              0         0       59095840
>>>>> MutationStage                     0         0      
31182942
>>>>> ReadRepairStage                   0         0        
     0
>>>>> ReplicateOnWriteStage            32    928042       28246716
>>>>> GossipStage                       1         1      
  449464
>>>>> AntiEntropyStage                  0         0        
     0
>>>>> MigrationStage                    0         0      
       4
>>>>> MemtablePostFlusher               0         0        
   445
>>>>> StreamStage                       0         0      
       0
>>>>> FlushWriter                       0         0      
     445
>>>>> FILEUTILS-DELETE-POOL             0         0          
  38
>>>>> MiscStage                         0         0      
       0
>>>>> FlushSorter                       0         0      
       0
>>>>> InternalResponseStage             0         0          
   4
>>>>> HintedHandoff                     0         0      
       0
>>>>>
>>>>> Second highest cpu usage one:
>>>>>
>>>>> 3 $ ./nodetool -h localhost tpstats
>>>>> Pool Name                    Active   Pending      Completed
>>>>> ReadStage                         0         0      
       0
>>>>> RequestResponseStage              0         0       27413910
>>>>> MutationStage                     0         0      
27523094
>>>>> ReadRepairStage                   0         0        
     0
>>>>> ReplicateOnWriteStage             0         0       11111971
>>>>> GossipStage                       0         0      
  335422
>>>>> AntiEntropyStage                  0         0        
     0
>>>>> MigrationStage                    0         0      
       4
>>>>> MemtablePostFlusher               0         0        
    66
>>>>> StreamStage                       0         0      
       0
>>>>> FlushWriter                       0         0      
      66
>>>>> FILEUTILS-DELETE-POOL             0         0          
  31
>>>>> MiscStage                         0         0      
       0
>>>>> FlushSorter                       0         0      
       0
>>>>> InternalResponseStage             0         0          
   4
>>>>> HintedHandoff                     0         0      
       0
>>>>>
>>>>> The other 2 that see minimal cpu usage:
>>>>>
>>>>> 2 $ ./nodetool -h localhost tpstats
>>>>> Pool Name                    Active   Pending      Completed
>>>>> ReadStage                         0         0      
       0
>>>>> RequestResponseStage              0         0        9213999
>>>>> MutationStage                     0         0      
33233114
>>>>> ReadRepairStage                   0         0        
     0
>>>>> ReplicateOnWriteStage             0         0          
   0
>>>>> GossipStage                       0         0      
  336016
>>>>> AntiEntropyStage                  0         0        
     0
>>>>> MigrationStage                    0         0      
       4
>>>>> MemtablePostFlusher               0         0        
    77
>>>>> StreamStage                       0         0      
       0
>>>>> FlushWriter                       0         0      
      77
>>>>> FILEUTILS-DELETE-POOL             0         0          
  40
>>>>> MiscStage                         0         0      
       0
>>>>> FlushSorter                       0         0      
       0
>>>>> InternalResponseStage             0         0          
   4
>>>>> HintedHandoff                     0         0      
       0
>>>>>
>>>>>
>>>>> 1 $ ./nodetool -h localhost tpstats
>>>>> Pool Name                    Active   Pending      Completed
>>>>> ReadStage                         0         0      
       0
>>>>> RequestResponseStage              0         0       13796875
>>>>> MutationStage                     0         0      
36116662
>>>>> ReadRepairStage                   0         0        
     0
>>>>> ReplicateOnWriteStage             0         0          
   0
>>>>> GossipStage                       0         0      
  336496
>>>>> AntiEntropyStage                  0         0        
     0
>>>>> MigrationStage                    0         0      
       4
>>>>> MemtablePostFlusher               0         0        
    91
>>>>> StreamStage                       0         0      
       0
>>>>> FlushWriter                       0         0      
      91
>>>>> FILEUTILS-DELETE-POOL             0         0          
  43
>>>>> MiscStage                         0         0      
       0
>>>>> FlushSorter                       0         0      
       0
>>>>> InternalResponseStage             0         0          
   4
>>>>> HintedHandoff                     0         0      
       0
>>>
>>>
>
>

Mime
View raw message