incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rakesh Rajan <rakes...@gmail.com>
Subject Re: High loads only on one node in the cluster
Date Fri, 01 Nov 2013 10:07:23 GMT
@Tyler / @Rob,

As Ashish mentioned earlier, we have 9 nodes on AWS - 6 on EastCoast and 3
on Singapore. All 9 nodes uses EC2Snitch. The current ring ( across all
nodes in 2 DC ) looks like this:

ip11 - East Coast - m1.xlarge / us-east-1b         - Size: 83 GB - Token: 0
ip21 - Singapore  - m1.xlarge / ap-southeast-1a - Size: 88 GB - Token: 1001
ip12 - East Coast - m1.xlarge / us-east-1b         - Size: 45 GB -
Token: 28356863910078205288614550619314017621
ip13 - East Coast - m1.xlarge / us-east-1c         - Size: 93 GB -
Token: 56713727820156410577229101238628035241
ip22 - Singapore  - m1.xlarge / ap-southeast-1b - Size: 73 GB -
Token: 56713727820156410577229101238628036241
ip14 - East Coast - m1.xlarge / us-east-1c         - Size: 20 GB -
Token: 85070591730234615865843651857942052863
ip15 - East Coast - m1.xlarge / us-east-1d         - Size: 89 GB -
Token: 113427455640312821154458202477256070484
ip23 - Singapore  - m1.xlarge / ap-southeast-1b - Size: 56 GB -
Token: 113427455640312821154458202477256071484
ip16 - East Coast - m1.xlarge / us-east-1d         - Size: 25 GB -
Token: 141784319550391026443072753096570088105

Regarding alternating racks solution, I've the following queries:

1) By alternating racks, do you mean to alternate racks between all nodes
in a single DC v/s multiple DCs? AWS EastCoast has 4 AZs
and Singapore has 2 AZs. So is the final solution something like this:
ip11 - East Coast - m1.xlarge / us-east-1b         - Token: 0
ip21 - Singapore  - m1.xlarge / ap-southeast-1a - Token: 1001
ip12 - East Coast - m1.xlarge / us-east-*1c*         -
Token: 28356863910078205288614550619314017621
ip13 - East Coast - m1.xlarge / us-east-*1d*         -
Token: 56713727820156410577229101238628035241
ip22 - Singapore  - m1.xlarge / ap-southeast-1b -
Token: 56713727820156410577229101238628036241
ip14 - East Coast - m1.xlarge / us-east-*1a*         -
Token: 85070591730234615865843651857942052863
ip15 - East Coast - m1.xlarge / us-east-*1b*         -
Token: 113427455640312821154458202477256070484
ip23 - Singapore  - m1.xlarge / ap-southeast-*1a* -
Token: 113427455640312821154458202477256071484
ip16 - East Coast - m1.xlarge / us-east-*1c*         -
Token: 141784319550391026443072753096570088105

Is this what you had suggested?

2) How does dynamic_snitch_badness_threshold: 0.1 effect the CPU load? On
the node ( ip11 ) which was high CPU ( system load > 30 ), I checked the
attribute score ( via JMX
bean org.apache.cassandra.db:type=DynamicEndpointSnitch ) and saw the
following:
EastCoast:
    *ip11 = 1.6813321647677475*
    ip12 = 1.0003505696757231
    ip13 = 1.1324160525509974
    ip14 = 1.000350569675723
    ip15 = 1.0007011393514456
    ip16 = 1.0005258545135842
Singapore:
    ip21 = 1.095880806310253
    ip22 = 1.4100000000000001
    ip23 = 1.0953549517966696

So ip11 node is indeed having higher score - but not sure why traffic is
still going to that replica as opposed to some other node?

Thanks!



On Fri, Nov 1, 2013 at 3:13 PM, Ashish Tyagi <tyagi.iitr@gmail.com> wrote:

> Hi Evan,
>
> The clients connect to all nodes. We tried shutting the thrift server on
> the affected node. Loads did not come down.
>
>
>
> On Fri, Nov 1, 2013 at 12:59 AM, Evan Weaver <evan@fauna.org> wrote:
>
>> Are all your clients only connecting to your first node? I would
>> probably strace it and compare the trace to one from a lightly loaded
>> node.
>>
>> On Thu, Oct 31, 2013 at 7:12 PM, Ashish Tyagi <tyagi.iitr@gmail.com>
>> wrote:
>> > We have a 9 node cluster. 6 nodes are in one data-center and 3 nodes in
>> the
>> > other. All machines are Amazon M1.XLarge configuration.
>> >
>> > Datacenter: DC1
>> > ==========
>> > Address         Rack        Status State   Load            Owns
>> > Token
>> >
>> > ip11  1b          Up     Normal  76.46 GB        16.67%              0
>> > ip12  1b          Up     Normal  44.66 GB        16.67%
>> > 28356863910078205288614550619314017621
>> > ip13  1c          Up     Normal  85.94 GB        16.67%
>> > 56713727820156410577229101238628035241
>> > ip14  1c          Up     Normal  17.55 GB        16.67%
>> > 85070591730234615865843651857942052863
>> > ip15  1d          Up     Normal  80.74 GB        16.67%
>> > 113427455640312821154458202477256070484
>> > ip16  1d          Up     Normal  20.88 GB        16.67%
>> > 141784319550391026443072753096570088105
>> >
>> > Datacenter: DC2
>> > ==========
>> > Address         Rack        Status State   Load            Owns
>> > Token
>> >
>> > ip21  1a          Up     Normal  78.32 GB        0.00%
>> 1001
>> > ip22  1b          Up     Normal  71.23 GB        0.00%
>> > 56713727820156410577229101238628036241
>> > ip23  1b          Up     Normal  53.49 GB        0.00%
>> > 113427455640312821154458202477256071484
>> >
>> > Problem is that node with ip address: ip11 often has 5-10 times more
>> load
>> > than any other node. Most of the operations are on counters. The primary
>> > column family (which receives most writes) has a replication factor of
>> 2 in
>> > DataCenter DC1 and also in DataCenter DC2. The traffic is write heavy
>> (reads
>> > are less than 10% of total requests). We are using size-tiered
>> compaction.
>> > Both writes and reads happen with a consistency factor of LOCAL_QUORUM.
>> >
>> > More information:
>> >
>> > 1. cassandra.yaml - http://pastebin.com/u344fA6z
>> > 2. Jmap heap when node under high loads - http://pastebin.com/ib3D0Pa
>> > 3. Nodetool tpstats - http://pastebin.com/s0AS7bGd
>> > 4. Cassandra-env.sh - http://pastebin.com/ubp4cGUx
>> > 5. GC log lines -  http://pastebin.com/Y0TKphsm
>> >
>> > Am I doing anything wrong. Any pointers will be appreciated.
>> >
>> > Thanks in advance,
>> > Ashish
>>
>
>

Mime
View raw message