incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Hawthorne <dha...@gmx.3crowd.com>
Subject Re: Replicate On Write behavior
Date Fri, 09 Sep 2011 17:48:04 GMT
They are evenly distributed.  5 nodes * 40 connections each using hector, and I can confirm
that all 200 are active when this happened (from hector's perspective, from graphing the hector
jmx data), and all 5 nodes saw roughly 40 connections, and all were receiving traffic over
those connections.  (netstat + ntop + trafshow, etc)  I can also confirm that I changed my
insert strategy to break up the rows using composite row keys, which reduced the row lengths
and gave me an almost perfectly even data distribution among the nodes, and that was when
I started to really dig into why the ROWs were still backing up on one node specifically,
and why 2 nodes weren't seeing any.  It was the 20%, 20%, 60% ROW distribution that really
got me thinking, and when I took the 60% node out of the cluster, that ROW load jumped back
to the node with the next-lowest IP address, and the 2 nodes that weren't seeing any *still*
wheren't seeing any ROWs.  At that point I tore down the cluster, recreated it as a 3 node
cluster several times using various permutations of the 5 nodes available, and ROW load was
*always* on the node with the lowest IP address.

the theory might not be right, but it certainly represents the behavior I saw.


On Sep 9, 2011, at 12:17 AM, Sylvain Lebresne wrote:

> We'll solve #2890 and we should have done it sooner.
> 
> That being said, a quick question: how do you do your inserts from the
> clients ? Are you evenly
> distributing the inserts among the nodes ? Or are you always hitting
> the same coordinator ?
> 
> Because provided the nodes are correctly distributed on the ring, if
> you distribute the inserts
> (increment) requests across the nodes (again I'm talking of client
> requests), you "should" not
> see the behavior you observe.
> 
> --
> Sylvain
> 
> On Thu, Sep 8, 2011 at 9:48 PM, David Hawthorne <dhawth@gmx.3crowd.com> wrote:
>> It was exactly due to 2890, and the fact that the first replica is always the one
with the lowest value IP address.  I patched cassandra to pick a random node out of the replica
set in StorageProxy.java findSuitableEndpoint:
>> 
>> Random rng = new Random();
>> 
>> return endpoints.get(rng.nextInt(endpoints.size()));  // instead of return endpoints.get(0);
>> 
>> Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the inserts/sec
throughput.
>> 
>> Here's the behavior I saw, and "disk work" refers to the ReplicateOnWrite load of
a counter insert:
>> 
>> One node will get RF/n of the disk work.  Two nodes will always get 0 disk work.
>> 
>> in a 3 node cluster, 1 node gets disk hit really hard.  You get the performance of
a one-node cluster.
>> in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you the performance
of ~2 node cluster.
>> in a 10 node cluster, 1 node gets 30% of the disk work, giving you the performance
of a ~3 node cluster.
>> 
>> I confirmed this behavior with a 3, 4, and 5 node cluster size.
>> 
>> 
>>> 
>>>> On another note, on a 5-node cluster, I'm only seeing 3 nodes with ReplicateOnWrite
Completed tasks in nodetool tpstats output.  Is that normal?  I'm using RandomPartitioner...
>>>> 
>>>> Address         DC          Rack        Status State   Load            Owns
   Token
>>>>                                                                         
  136112946768375385385349842972707284580
>>>> 10.0.0.57    datacenter1 rack1       Up     Normal  2.26 GB         20.00%
 0
>>>> 10.0.0.56    datacenter1 rack1       Up     Normal  2.47 GB         20.00%
 34028236692093846346337460743176821145
>>>> 10.0.0.55    datacenter1 rack1       Up     Normal  2.52 GB         20.00%
 68056473384187692692674921486353642290
>>>> 10.0.0.54    datacenter1 rack1       Up     Normal  950.97 MB       20.00%
 102084710076281539039012382229530463435
>>>> 10.0.0.72    datacenter1 rack1       Up     Normal  383.25 MB       20.00%
 136112946768375385385349842972707284580
>>>> 
>>>> The nodes with ReplicateOnWrites are the 3 in the middle.  The first node
and last node both have a count of 0.  This is a clean cluster, and I've been doing 3k ...
2.5k (decaying performance) inserts/sec for the last 12 hours.  The last time this test ran,
it went all the way down to 500 inserts/sec before I killed it.
>>> 
>>> Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
>>> 
>>> --
>>> Sylvain
>> 
>> 


Mime
View raw message