cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Haddad <...@jonhaddad.com>
Subject Re: Rack aware question.
Date Thu, 24 Mar 2016 01:19:23 GMT
Agreed with Jack.

I don't think there's ever a reason to use CL=ALL in an application in
production.  I would only use it if I was debugging certain types of
consistency problems.

On Wed, Mar 23, 2016 at 4:56 PM Jack Krupansky <jack.krupansky@gmail.com>
wrote:

> CL=ALL also means that you won't have HA (High Availability) - if even a
> single node goes down, you're out of business. I mean, HA is the
> fundamental reason for using the rack-aware policy - to assure that each
> replica is on a separate power supply and network connection so that data
> can be retrieved even when a rack-level failure occurs.
>
> In short, if CL=ALL is acceptable, then you might as well dump the
> rack-aware approach, which was how you got into this situation in the first
> place.
>
> -- Jack Krupansky
>
> On Wed, Mar 23, 2016 at 7:31 PM, Anubhav Kale <Anubhav.Kale@microsoft.com>
> wrote:
>
>> I ran into the following detail from :
>> https://wiki.apache.org/cassandra/ReadRepair
>>
>>
>>
>> “If a lower ConsistencyLevel than ALL was specified, this is done in the
>> background after returning the data from the closest replica to the client;
>> otherwise, it is done before returning the data.”
>>
>>
>>
>> I set consistency to ALL, and now I can get data all the time.
>>
>>
>>
>> *From:* Anubhav Kale [mailto:Anubhav.Kale@microsoft.com]
>> *Sent:* Wednesday, March 23, 2016 4:14 PM
>>
>> *To:* user@cassandra.apache.org
>> *Subject:* RE: Rack aware question.
>>
>>
>>
>> Thanks, Read repair is what I thought must be causing this, so I
>> experimented some more with setting read_repair_chance and
>> dc_local_read_repair_chance on the table to 0, and then 1.
>>
>>
>>
>> Unfortunately, the results were somewhat random depending on which node I
>> ran the queries from. For example, when chance = 1, running query from
>> 127.0.0.3 would sometimes return 0 results and sometimes 1. I do see
>> digest-mismatch-kicking-off-read-repair in traces in both cases, so running
>> out of ideas here.  If you / someone can shed light on why this could be
>> happening, that would be great !
>>
>>
>>
>> That said, is it expected that “read repair” or a regular “nodetool
>> repair” will shift the data around based on new replica placement ? And, if
>> so is the recommendation to “rebootstrap” to mainly avoid this humongous
>> data movement ?
>>
>>
>>
>> The rationale behind ignore_rack flag makes sense, thanks. Maybe, we
>> should document it better ?
>>
>>
>>
>> Thanks !
>>
>>
>>
>> *From:* Paulo Motta [mailto:pauloricardomg@gmail.com
>> <pauloricardomg@gmail.com>]
>> *Sent:* Wednesday, March 23, 2016 3:40 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Rack aware question.
>>
>>
>>
>> > How come 127.0.0.1 is shown as an endpoint holding the ID when its
>> token range doesn’t contain it ? Does “nodetool ring” shows all
>> token-ranges for a node or just the primary range ? I am thinking its only
>> primary. Can someone confirm ?
>>
>> The primary replica of id=1 is always 127.0.0.3. What changes when you
>> change racks is that the secondary replica will move to the next replica
>> from a different rack, either 127.0.0.1 or 127.0.0.2.
>>
>> > How come queries contact 127.0.0.1 ?
>>
>> in the last case, 127.0.0.1 is the next node after the primary replica
>> from a different rack (R2), so it should be contacted
>>
>> > Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ?
>> To prove / disprove that, I stopped 127.0.0.2 and ran a query with
>> CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed hold
>> the data (SS Tables also show it). So, does this mean that the data
>> actually gets moved around when racks change ?
>>
>> probably during some of your queries 127.0.0.3 (the primary replica)
>> replicated data to 127.0.0.1 with read repair. There is no automatic data
>> move when rack is changed (at least in OSS C*, not sure if DSE has this
>> ability)
>>
>> > If we don’t want to support this ever, I’d think the ignore_rack flag
>> should just be deprecated.
>>
>> ignore_rack flag can be useful if you move your data manually, with rsync
>> or sstableloader.
>>
>>
>>
>> 2016-03-23 19:09 GMT-03:00 Anubhav Kale <Anubhav.Kale@microsoft.com>:
>>
>> Thanks for the pointer – appreciate it.
>>
>>
>>
>> My test is on the latest trunk and slightly different.
>>
>>
>>
>> I am not exactly sure if the behavior I see is expected (in which case,
>> is the recommendation to re-bootstrap just to avoid data movement?) or is
>> the behavior not expected and is a bug.
>>
>>
>>
>> If we don’t want to support this ever, I’d think the ignore_rack flag
>> should just be deprecated.
>>
>>
>>
>> *From:* Robert Coli [mailto:rcoli@eventbrite.com]
>> *Sent:* Wednesday, March 23, 2016 2:54 PM
>>
>>
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Rack aware question.
>>
>>
>>
>> Actually, I believe you are seeing the behavior described in the ticket I
>> meant to link to, with the detailed exploration :
>>
>>
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-10238
>> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10238&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7741553cdb7c4ce7ee1f08d3536599a0%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=3PY62w9X94T3fCkPZVJzN2dl8eda44Yj3zBvk83faWk%3d>
>>
>>
>>
>> =Rob
>>
>>
>>
>>
>>
>> On Wed, Mar 23, 2016 at 2:06 PM, Anubhav Kale <Anubhav.Kale@microsoft.com>
>> wrote:
>>
>> Oh, and the query I ran was “select * from racktest.racktable where id=1”
>>
>>
>>
>> *From:* Anubhav Kale [mailto:Anubhav.Kale@microsoft.com]
>> *Sent:* Wednesday, March 23, 2016 2:04 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* RE: Rack aware question.
>>
>>
>>
>> Thanks.
>>
>>
>>
>> To test what happens when rack of a node changes in a running cluster
>> without doing a decommission, I did the following.
>>
>>
>>
>> The cluster looks like below (this was run through Eclipse, therefore the
>> IP address hack)
>>
>>
>>
>> *IP*
>>
>> 127.0.0.1
>>
>> 127.0.0.2
>>
>> 127.0.0.3
>>
>> *Rack*
>>
>> R1
>>
>> R1
>>
>> R2
>>
>>
>>
>> A table was created and a row inserted as follows:
>>
>>
>>
>> Cqlsh 127.0.0.1
>>
>> >create keyspace racktest with replication = { 'class' :
>> 'NetworkTopologyStrategy', 'datacenter1' : 2 };
>>
>> >create table racktest.racktable(id int, PRIMARY KEY(id));
>>
>> >insert into racktest.racktable(id) values(1);
>>
>>
>>
>> nodetool getendpoints racktest racktable 1
>>
>>
>>
>> 127.0.0.2
>>
>> 127.0.0.3
>>
>>
>>
>> Nodetool ring > ring_1.txt (attached)
>>
>>
>>
>> So far so good.
>>
>>
>>
>> Then I changed the racks to below and restarted DSE with
>> –Dcassandra.ignore_rack=true.
>>
>> This option from my finding simply avoids the check on startup that
>> compares the rack in system.local with the one in rack-dc.properties.
>>
>>
>>
>> *IP*
>>
>> 127.0.0.1
>>
>> 127.0.0.2
>>
>> 127.0.0.3
>>
>> *Rack*
>>
>> R1
>>
>> R2
>>
>> R1
>>
>>
>>
>> nodetool getendpoints racktest racktable 1
>>
>>
>>
>> 127.0.0.2
>>
>> 127.0.0.3
>>
>>
>>
>> So far so good, cqlsh returns the queries fine.
>>
>>
>>
>> Nodetool ring > ring_2.txt (attached)
>>
>>
>>
>> Now comes the interesting part.
>>
>>
>>
>> I changed the racks to below and restarted DSE.
>>
>>
>>
>> *IP*
>>
>> 127.0.0.1
>>
>> 127.0.0.2
>>
>> 127.0.0.3
>>
>> *Rack*
>>
>> R2
>>
>> R1
>>
>> R1
>>
>>
>>
>> nodetool getendpoints racktest racktable 1
>>
>>
>>
>> 127.0.0.*1*
>>
>> 127.0.0.3
>>
>>
>>
>> This is *very* interesting, cqlsh returns the queries fine. With tracing
>> on, it’s clear that the 127.0.0.1 is being asked for data as well.
>>
>>
>>
>> Nodetool ring > ring_3.txt (attached)
>>
>>
>>
>> There is no change in token information in ring_* files. The token under
>> question for id=1 (from select token(id) from racktest.racktable) is
>> -4069959284402364209.
>>
>>
>>
>> So, few questions because things don’t add up:
>>
>>
>>
>>    1. How come 127.0.0.1 is shown as an endpoint holding the ID when its
>>    token range doesn’t contain it ? Does “nodetool ring” shows all
>>    token-ranges for a node or just the primary range ? I am thinking its only
>>    primary. Can someone confirm ?
>>    2. How come queries contact 127.0.0.1 ?
>>    3. Is “getendpoints” acting odd here and the data really is on
>>    127.0.0.2 ? To prove / disprove that, I stopped 127.0.0.2 and ran a query
>>    with CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed
>>    hold the data (SS Tables also show it).
>>    4. So, does this mean that the data actually gets moved around when
>>    racks change ?
>>
>>
>>
>> Thanks !
>>
>>
>>
>>
>>
>> *From:* Robert Coli [mailto:rcoli@eventbrite.com <rcoli@eventbrite.com>]
>> *Sent:* Wednesday, March 23, 2016 11:59 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Rack aware question.
>>
>>
>>
>> On Wed, Mar 23, 2016 at 8:07 AM, Anubhav Kale <Anubhav.Kale@microsoft.com>
>> wrote:
>>
>> Suppose we change the racks on VMs on a running cluster. (We need to do
>> this while running on Azure, because sometimes when the VM gets moved its
>> rack changes).
>>
>>
>>
>> In this situation, new writes will be laid out based on new rack info on
>> appropriate replicas. What happens for existing data ? Is that data moved
>> around as well and does it happen if we run repair or on its own ?
>>
>>
>>
>> First, you should understand this ticket if relying on rack awareness :
>>
>>
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-3810
>> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-3810&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=PIEK5w9ZycRYTymQXBCQOHQ9a1BuurGDFc6J3C%2fWvwQ%3d>
>>
>>
>>
>> Second, in general nodes cannot move between racks.
>>
>>
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-10242
>> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10242&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=nHX51ahp3SyGKouKb2WFtYmMQSjSNVzH%2fzvN%2fNPJzPw%3d>
>>
>>
>>
>> Has some detailed explanations of what blows up if they do.
>>
>>
>>
>> Note that if you want to preserve any of the data on the node, you need
>> to :
>>
>>
>>
>> 1) bring it and have it join the ring in its new rack (during which time
>> it will serve incorrect reads due to missing data)
>>
>> 2) stop it
>>
>> 3) run cleanup
>>
>> 4) run repair
>>
>> 5) start it again
>>
>>
>>
>> Can't really say that I recommend this practice, but it's better than
>> "rebootstrap it" which is the official advice. If you "rebootstrap it" you
>> decrease unique replica count by 1, which has a nonzero chance of
>> data-loss. The Coli Conjecture says that in practice you probably don't
>> care about this nonzero chance of data loss if you are running your
>> application in CL.ONE, which should be all cases where it matters.
>>
>>
>>
>> =Rob
>>
>>
>>
>>
>>
>>
>>
>
>

Mime
View raw message