cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anubhav Kale <Anubhav.K...@microsoft.com>
Subject RE: Rack aware question.
Date Wed, 23 Mar 2016 21:03:50 GMT
Thanks.

To test what happens when rack of a node changes in a running cluster without doing a decommission,
I did the following.

The cluster looks like below (this was run through Eclipse, therefore the IP address hack)

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R1

R1

R2


A table was created and a row inserted as follows:

Cqlsh 127.0.0.1
>create keyspace racktest with replication = { 'class' : 'NetworkTopologyStrategy', 'datacenter1'
: 2 };
>create table racktest.racktable(id int, PRIMARY KEY(id));
>insert into racktest.racktable(id) values(1);

nodetool getendpoints racktest racktable 1

127.0.0.2
127.0.0.3

Nodetool ring > ring_1.txt (attached)

So far so good.

Then I changed the racks to below and restarted DSE with –Dcassandra.ignore_rack=true.
This option from my finding simply avoids the check on startup that compares the rack in system.local
with the one in rack-dc.properties.

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R1

R2

R1


nodetool getendpoints racktest racktable 1

127.0.0.2
127.0.0.3

So far so good, cqlsh returns the queries fine.

Nodetool ring > ring_2.txt (attached)

Now comes the interesting part.

I changed the racks to below and restarted DSE.

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R2

R1

R1


nodetool getendpoints racktest racktable 1

127.0.0.1
127.0.0.3

This is very interesting, cqlsh returns the queries fine. With tracing on, it’s clear that
the 127.0.0.1 is being asked for data as well.

Nodetool ring > ring_3.txt (attached)

There is no change in token information in ring_* files. The token under question for id=1
(from select token(id) from racktest.racktable) is -4069959284402364209.

So, few questions because things don’t add up:


  1.  How come 127.0.0.1 is shown as an endpoint holding the ID when its token range doesn’t
contain it ? Does “nodetool ring” shows all token-ranges for a node or just the primary
range ? I am thinking its only primary. Can someone confirm ?
  2.  How come queries contact 127.0.0.1 ?
  3.  Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ? To prove
/ disprove that, I stopped 127.0.0.2 and ran a query with CONSISTENCY ALL, and it came back
just fine meaning 127.0.0.1 indeed hold the data (SS Tables also show it).
  4.  So, does this mean that the data actually gets moved around when racks change ?

Thanks !


From: Robert Coli [mailto:rcoli@eventbrite.com]
Sent: Wednesday, March 23, 2016 11:59 AM
To: user@cassandra.apache.org
Subject: Re: Rack aware question.

On Wed, Mar 23, 2016 at 8:07 AM, Anubhav Kale <Anubhav.Kale@microsoft.com<mailto:Anubhav.Kale@microsoft.com>>
wrote:
Suppose we change the racks on VMs on a running cluster. (We need to do this while running
on Azure, because sometimes when the VM gets moved its rack changes).

In this situation, new writes will be laid out based on new rack info on appropriate replicas.
What happens for existing data ? Is that data moved around as well and does it happen if we
run repair or on its own ?

First, you should understand this ticket if relying on rack awareness :

https://issues.apache.org/jira/browse/CASSANDRA-3810<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-3810&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=PIEK5w9ZycRYTymQXBCQOHQ9a1BuurGDFc6J3C%2fWvwQ%3d>

Second, in general nodes cannot move between racks.

https://issues.apache.org/jira/browse/CASSANDRA-10242<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10242&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=nHX51ahp3SyGKouKb2WFtYmMQSjSNVzH%2fzvN%2fNPJzPw%3d>

Has some detailed explanations of what blows up if they do.

Note that if you want to preserve any of the data on the node, you need to :

1) bring it and have it join the ring in its new rack (during which time it will serve incorrect
reads due to missing data)
2) stop it
3) run cleanup
4) run repair
5) start it again

Can't really say that I recommend this practice, but it's better than "rebootstrap it" which
is the official advice. If you "rebootstrap it" you decrease unique replica count by 1, which
has a nonzero chance of data-loss. The Coli Conjecture says that in practice you probably
don't care about this nonzero chance of data loss if you are running your application in CL.ONE,
which should be all cases where it matters.

=Rob

Mime
View raw message