Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7C2FB8375 for ; Wed, 10 Aug 2011 09:13:36 +0000 (UTC) Received: (qmail 63804 invoked by uid 500); 10 Aug 2011 09:13:33 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 63428 invoked by uid 500); 10 Aug 2011 09:13:20 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 62885 invoked by uid 99); 10 Aug 2011 09:13:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Aug 2011 09:13:16 +0000 X-ASF-Spam-Status: No, hits=3.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a78.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Aug 2011 09:13:10 +0000 Received: from homiemail-a78.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a78.g.dreamhost.com (Postfix) with ESMTP id E4ADE15C064 for ; Wed, 10 Aug 2011 02:12:43 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; q=dns; s=thelastpickle.com; b=C27nHg2v/u 4fLK98Ch67D3hV6rgksgaj0YdbQyYu5J3xuN6a7eY/aLuLGFkEItZ2FIFyPf84/l +QYi/35UeIe8ReM/8B6EKsumoUKXXHFFbyeVAL9hoo2JhZxJGxgI5OaqGcjDE8gX Zxd6PQZ1/o7SxRVE0KYCscZZBMhlPMdNE= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; s=thelastpickle.com; bh=oXz5WmKnMPQeSYeX 0bPGDJozpxw=; b=YM7k8IfyTn3k+egpBEhqLvfGPqbf5FECbmEq8EAzptDt5ZQc egQhkJvyOry2c/yip0T+D6Sm8nRNnqI9t8KllSYU0P6OINhyGXW56ex5PTMX/wH0 OWIhoPm2GF+Qpo8Qbw5+4wkWEKIz7dhGRmI8psbb+olT386ppj3YDY1RHSM= Received: from aarons-mbp-2011.lan (219-89-250-213.adsl.xtra.co.nz [219.89.250.213]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a78.g.dreamhost.com (Postfix) with ESMTPSA id 1EFC415C059 for ; Wed, 10 Aug 2011 02:12:42 -0700 (PDT) From: aaron morton Mime-Version: 1.0 (Apple Message framework v1244.3) Content-Type: multipart/alternative; boundary="Apple-Mail=_8E0151BE-D99D-4BE5-A272-979DB6112FCB" Subject: Re: Peculiar imbalance affecting 2 machines in a 6 node cluster Date: Wed, 10 Aug 2011 21:12:41 +1200 In-Reply-To: <8620A665-E834-43A2-864F-713C6B2055A0@bloomdigital.com> To: user@cassandra.apache.org References: <8620A665-E834-43A2-864F-713C6B2055A0@bloomdigital.com> Message-Id: <234ABF6E-FDC1-4001-891B-77BD3FF34B22@thelastpickle.com> X-Mailer: Apple Mail (2.1244.3) --Apple-Mail=_8E0151BE-D99D-4BE5-A272-979DB6112FCB Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 WRT the load imbalance checking the basics: you've run cleanup after any = tokens moves? Repair is running ? Also sometimes nodes get a bit = bloated from repair and will settle down with compaction.=20 Your slightly odd tokens in the MTL DC are making it a little tricky to = understand whats going on. But I'm trying to check if you've followed = the multi DC token selection here = http://wiki.apache.org/cassandra/Operations#Token_selection . Background = about what can happen in a multi dc deployment if the tokens are not = right = http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Replica-d= ata-distributing-between-racks-td6324819.html This is what you currently have=85. DC:LA IPLA1 Up Normal 34.57 GB 11.11% 0 = =20 IPLA2 Up Normal 17.55 GB 11.11% = 56713727820156410577229101238628035242 =20 IPLA3 Up Normal 51.37 GB 11.11% = 113427455640312821154458202477256070485 =20 DC: MTL IPMTL1 Up Normal 34.43 GB 22.22% = 37809151880104273718152734159085356828 =20 IPMTL2 Up Normal 34.56 GB 22.22% = 94522879700260684295381835397713392071 =20 IPMTL3 Up Normal 34.71 GB 22.22% = 151236607520417094872610936636341427313 =20 Using the bump approach you would have=20 IPLA1 0 =20 IPLA2 56713727820156410577229101238628035242 =20 IPLA3 113427455640312821154458202477256070484 =20 IPMTL1 1 =20 IPMTL2 56713727820156410577229101238628035243 =20 IPMTL3 113427455640312821154458202477256070485 =20 Using the interleaving you would have=20 IPLA1 0 IPMTL1 28356863910078205288614550619314017621 IPLA2 56713727820156410577229101238628035242 IPMTL2 85070591730234615865843651857942052863 IPLA3 113427455640312821154458202477256070484 IPMTL3 141784319550391026443072753096570088105 The current setup in LA give each node in LA 33% of the LA local ring. = Which should be right, just checking. =20 If cleanup / repair / compaction is all good and you are confident the = tokens are right try poking around with nodetool getendpoints to see = which nodes keys are sent to. Like you I cannot see anything obvious in = NTS that would cause load to be imbalanced if they are all in the same = rack.=20 Cheers ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 10 Aug 2011, at 11:24, Mina Naguib wrote: > Hi everyone >=20 > I'm observing a very peculiar type of imbalance and I'd appreciate any = help or ideas to try. This is on cassandra 0.7.8. >=20 > The original cluster was 3 machines in the DCMTL, equally balanced at = 33.33% each and each holding roughly 34G. >=20 > Then, I added to it 3 machines in the LA data center. The ring is = currently as follows (IP addresses redacted for clarity): >=20 > Address Status State Load Owns Token = =20 > = 151236607520417094872610936636341427313 =20 > IPLA1 Up Normal 34.57 GB 11.11% 0 = =20 > IPMTL1 Up Normal 34.43 GB 22.22% = 37809151880104273718152734159085356828 =20 > IPLA2 Up Normal 17.55 GB 11.11% = 56713727820156410577229101238628035242 =20 > IPMTL2 Up Normal 34.56 GB 22.22% = 94522879700260684295381835397713392071 =20 > IPLA3 Up Normal 51.37 GB 11.11% = 113427455640312821154458202477256070485 =20 > IPMTL3 Up Normal 34.71 GB 22.22% = 151236607520417094872610936636341427313 =20 >=20 > The bump in the 3 MTL nodes (22.22%) is in anticipation of 3 more = machines in yet another data center, but they're not ready yet to join = the cluster. Once that third DC joins all nodes will be at 11.11%. = However, I don't think this is related. >=20 > The problem I'm currently observing is visible in the LA machines, = specifically IPLA2 and IPLA3. IPLA2 has 50% the expected volume, and = IPLA3 has 150% the expected volume. >=20 > Putting their load side by side shows the peculiar ratio of 2:1:3 = between the 3 LA nodes: > 34.57 17.55 51.37 > (the same 2:1:3 ratio is reflected in our internal tools trending = reads/second and writes/second) >=20 > I've tried several iterations of compactions/cleanups to no avail. In = terms of config this is the main keyspace: > Replication Strategy: = org.apache.cassandra.locator.NetworkTopologyStrategy > Options: [DCMTL:2, DCLA:2] > And this is the cassandra-topology.properties file (IPs again redacted = for clarity): > IPMTL1:DCMTL:RAC1 > IPMTL2:DCMTL:RAC1 > IPMTL3:DCMTL:RAC1 > IPLA1:DCLA:RAC1 > IPLA2:DCLA:RAC1 > IPLA3:DCLA::RAC1 > IPLON1:DCLON:RAC1 > IPLON2:DCLON:RAC1 > IPLON3:DCLON:RAC1 > # default for unknown nodes > default=3DDCBAD:RACBAD >=20 >=20 > One thing that did occur to me while reading the source code for the = NetworkTopologyStrategy's calculateNaturalEndpoints is that it prefers = placing data on different racks. Since all my machines are defined as = in the same rack, I believe that the 2-pass approach would still yield = balanced placement. >=20 > However, just to test, I modified live the topology file to specify = that IPLA1, IPLA2 and IPLA3 are in 3 different racks, and sure enough I = saw immediately that the reads/second and writes/second equalized to = expected fair volume (I quickly reverted that change). >=20 > So, it seems somehow related to rack awareness, but I've been raking = my head and I can't figure out how/why, or why the three MTL machines = are not affected the same way. >=20 > If the solution is to specify them in different racks and run repair = on everything, I'm okay with that - but I hate doing that without first = understanding *why* the current behavior is the way it is. >=20 > Any ideas would be hugely appreciated. >=20 > Thank you. >=20 --Apple-Mail=_8E0151BE-D99D-4BE5-A272-979DB6112FCB Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=windows-1252 http:= //wiki.apache.org/cassandra/Operations#Token_selection . = Background about what can happen in a multi dc deployment if the tokens = are not right http://cassandra-us= er-incubator-apache-org.3065146.n2.nabble.com/Replica-data-distributing-be= tween-racks-td6324819.html

This is what you = currently have=85.

DC:LA
IPLA1 =           Up =     Normal  34.57 GB =        11.11%  0     =                     =                 =   
IPLA2 =           Up =     Normal  17.55 GB =        11.11% =  56713727820156410577229101238628035242     =  
IPLA3 =           Up =     Normal  51.37 GB =        11.11% =  113427455640312821154458202477256070485   =   

DC: MTL
IPMTL1 =          Up =     Normal  34.43 GB =        22.22% =  37809151880104273718152734159085356828     =  
IPMTL2 =          Up =     Normal  34.56 GB =        22.22% =  94522879700260684295381835397713392071     =  
IPMTL3 =          Up =     Normal  34.71 GB =        22.22% =  151236607520417094872610936636341427313 =   

Using the = bump approach you would have 

IPLA1 0   =   
IPLA2   = 56713727820156410577229101238628035242  =   
IPLA3 = 113427455640312821154458202477256070484    =  

IPMTL1 1   =         
IPMTL2 = 56713727820156410577229101238628035243        =   
IPMTL3 = 113427455640312821154458202477256070485      =     

Using the = interleaving you would = have 

IPLA1  = 0
IPMTL1  = 28356863910078205288614550619314017621
IPLA2  = 56713727820156410577229101238628035242
IPMTL2  = 85070591730234615865843651857942052863
IPLA3  = 113427455640312821154458202477256070484
IPMTL3  = 141784319550391026443072753096570088105

<= div>The current setup in LA give each node in LA 33% of the LA local = ring. Which should be right, just checking. =  

If cleanup / repair / compaction is all = good and you are confident the tokens are right try poking around with = nodetool getendpoints to see which nodes keys are sent to.  Like = you I cannot see anything obvious in NTS that would cause load to be = imbalanced if they are all in the same = rack. 

Cheers


<= /div>
http://www.thelastpickle.com

On 10 Aug 2011, at 11:24, Mina Naguib wrote:

Hi = everyone

I'm observing a very peculiar type of imbalance and I'd = appreciate any help or ideas to try.  This is on cassandra = 0.7.8.

The original cluster was 3 machines in the DCMTL, equally = balanced at 33.33% each and each holding roughly 34G.

Then, I = added to it 3 machines in the LA data center.  The ring is = currently as follows (IP addresses redacted for clarity):

Address =         Status State =   Load =            Owns =    Token =             &n= bsp;           &nbs= p;            =  
=             &n= bsp;           &nbs= p;            =             &n= bsp;    151236607520417094872610936636341427313 =     
IPLA1 =           Up =     Normal  34.57 GB =        11.11%  0 =             &n= bsp;           &nbs= p;            =      
IPMTL1 =          Up =     Normal  34.43 GB =        22.22% =  37809151880104273718152734159085356828 =      
IPLA2 =           Up =     Normal  17.55 GB =        11.11% =  56713727820156410577229101238628035242 =      
IPMTL2 =          Up =     Normal  34.56 GB =        22.22% =  94522879700260684295381835397713392071 =      
IPLA3 =           Up =     Normal  51.37 GB =        11.11% =  113427455640312821154458202477256070485 =     
IPMTL3 =          Up =     Normal  34.71 GB =        22.22% =  151236607520417094872610936636341427313 =     

The bump in the 3 MTL nodes (22.22%) is = in anticipation of 3 more machines in yet another data center, but = they're not ready yet to join the cluster.  Once that third DC = joins all nodes will be at 11.11%. However, I don't think this is = related.

The problem I'm currently observing is visible in the LA = machines, specifically IPLA2 and IPLA3.  IPLA2 has 50% the expected = volume, and IPLA3 has 150% the expected volume.

Putting their = load side by side shows the peculiar ratio of 2:1:3 between the 3 LA = nodes:
34.57 17.55 51.37
(the same 2:1:3 ratio is reflected in our = internal tools trending reads/second and writes/second)

I've = tried several iterations of compactions/cleanups to no avail.  In = terms of config this is the main keyspace:
 Replication = Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
=    Options: [DCMTL:2, DCLA:2]
And this is the = cassandra-topology.properties file (IPs again redacted for clarity):
=  IPMTL1:DCMTL:RAC1
 IPMTL2:DCMTL:RAC1
=  IPMTL3:DCMTL:RAC1
 IPLA1:DCLA:RAC1
=  IPLA2:DCLA:RAC1
 IPLA3:DCLA::RAC1
=  IPLON1:DCLON:RAC1
 IPLON2:DCLON:RAC1
=  IPLON3:DCLON:RAC1
 # default for unknown nodes
=  default=3DDCBAD:RACBAD


One thing that did occur to me = while reading the source code for the NetworkTopologyStrategy's = calculateNaturalEndpoints is that it prefers placing data on different = racks.  Since all my machines are defined as in the same rack, I = believe that the 2-pass approach would still yield balanced = placement.

However, just to test, I modified live the topology = file to specify that IPLA1, IPLA2 and IPLA3 are in 3 different racks, = and sure enough I saw immediately that the reads/second and = writes/second equalized to expected fair volume (I quickly reverted that = change).

So, it seems somehow related to rack awareness, but I've = been raking my head and I can't figure out how/why, or why the three MTL = machines are not affected the same way.

If the solution is to = specify them in different racks and run repair on everything, I'm okay = with that - but I hate doing that without first understanding *why* the = current behavior is the way it is.

Any ideas would be hugely = appreciated.

Thank = you.


= --Apple-Mail=_8E0151BE-D99D-4BE5-A272-979DB6112FCB--