cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuji Ito <y...@imagine-orb.com>
Subject Re: failure node rejoin
Date Sun, 23 Oct 2016 23:29:06 GMT
Hi Ben,

The test without killing nodes has been working well without data lost.
I've repeated my test about 200 times after removing data and
rebuild/repair.

Regards,


On Fri, Oct 21, 2016 at 3:14 PM, Yuji Ito <yuji@imagine-orb.com> wrote:

> > Just to confirm, are you saying:
> > a) after operation 2, you select all and get 1000 rows
> > b) after operation 3 (which only does updates and read) you select and
> only get 953 rows?
>
> That's right!
>
> I've started the test without killing nodes.
> I'll report the result to you next Monday.
>
> Thanks
>
>
> On Fri, Oct 21, 2016 at 3:05 PM, Ben Slater <ben.slater@instaclustr.com>
> wrote:
>
>> Just to confirm, are you saying:
>> a) after operation 2, you select all and get 1000 rows
>> b) after operation 3 (which only does updates and read) you select and
>> only get 953 rows?
>>
>> If so, that would be very unexpected. If you run your tests without
>> killing nodes do you get the expected (1,000) rows?
>>
>> Cheers
>> Ben
>>
>> On Fri, 21 Oct 2016 at 17:00 Yuji Ito <yuji@imagine-orb.com> wrote:
>>
>>> > Are you certain your tests don’t generate any overlapping inserts (by
>>> PK)?
>>>
>>> Yes. The operation 2) also checks the number of rows just after all
>>> insertions.
>>>
>>>
>>> On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater <ben.slater@instaclustr.com>
>>> wrote:
>>>
>>> OK. Are you certain your tests don’t generate any overlapping inserts
>>> (by PK)? Cassandra basically treats any inserts with the same primary key
>>> as updates (so 1000 insert operations may not necessarily result in 1000
>>> rows in the DB).
>>>
>>> On Fri, 21 Oct 2016 at 16:30 Yuji Ito <yuji@imagine-orb.com> wrote:
>>>
>>> thanks Ben,
>>>
>>> > 1) At what stage did you have (or expect to have) 1000 rows (and have
>>> the mismatch between actual and expected) - at that end of operation (2) or
>>> after operation (3)?
>>>
>>> after operation 3), at operation 4) which reads all rows by cqlsh with
>>> CL.SERIAL
>>>
>>> > 2) What replication factor and replication strategy is used by the
>>> test keyspace? What consistency level is used by your operations?
>>>
>>> - create keyspace testkeyspace WITH REPLICATION =
>>> {'class':'SimpleStrategy','replication_factor':3};
>>> - consistency level is SERIAL
>>>
>>>
>>> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater <ben.slater@instaclustr.com
>>> > wrote:
>>>
>>>
>>> A couple of questions:
>>> 1) At what stage did you have (or expect to have) 1000 rows (and have
>>> the mismatch between actual and expected) - at that end of operation (2) or
>>> after operation (3)?
>>> 2) What replication factor and replication strategy is used by the test
>>> keyspace? What consistency level is used by your operations?
>>>
>>>
>>> Cheers
>>> Ben
>>>
>>> On Fri, 21 Oct 2016 at 13:57 Yuji Ito <yuji@imagine-orb.com> wrote:
>>>
>>> Thanks Ben,
>>>
>>> I tried to run a rebuild and repair after the failure node rejoined the
>>> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
>>> The failure node could rejoined and I could read all rows successfully.
>>> (Sometimes a repair failed because the node cannot access other node. If
>>> it failed, I retried a repair)
>>>
>>> But some rows were lost after my destructive test repeated (after about
>>> 5-6 hours).
>>> After the test inserted 1000 rows, there were only 953 rows at the end
>>> of the test.
>>>
>>> My destructive test:
>>> - each C* node is killed & restarted at the random interval (within
>>> about 5 min) throughout this test
>>> 1) truncate all tables
>>> 2) insert initial rows (check if all rows are inserted successfully)
>>> 3) request a lot of read/write to random rows for about 30min
>>> 4) check all rows
>>> If operation 1), 2) or 4) fail due to C* failure, the test retry the
>>> operation.
>>>
>>> Does anyone have the similar problem?
>>> What causes data lost?
>>> Does the test need any operation when C* node is restarted? (Currently,
>>> I just restarted C* process)
>>>
>>> Regards,
>>>
>>>
>>> On Tue, Oct 18, 2016 at 2:18 PM, Ben Slater <ben.slater@instaclustr.com>
>>> wrote:
>>>
>>> OK, that’s a bit more unexpected (to me at least) but I think the
>>> solution of running a rebuild or repair still applies.
>>>
>>> On Tue, 18 Oct 2016 at 15:45 Yuji Ito <yuji@imagine-orb.com> wrote:
>>>
>>> Thanks Ben, Jeff
>>>
>>> Sorry that my explanation confused you.
>>>
>>> Only node1 is the seed node.
>>> Node2 whose C* data is deleted is NOT a seed.
>>>
>>> I restarted the failure node(node2) after restarting the seed
>>> node(node1).
>>> The restarting node2 succeeded without the exception.
>>> (I couldn't restart node2 before restarting node1 as expected.)
>>>
>>> Regards,
>>>
>>>
>>> On Tue, Oct 18, 2016 at 1:06 PM, Jeff Jirsa <jeff.jirsa@crowdstrike.com>
>>> wrote:
>>>
>>> The unstated "problem" here is that node1 is a seed, which implies
>>> auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly
>>> setup to start without bootstrapping).
>>>
>>> That means once the data dir is wiped, it's going to start again without
>>> a bootstrap, and make a single node cluster or join an existing cluster if
>>> the seed list is valid
>>>
>>>
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> On Oct 17, 2016, at 8:51 PM, Ben Slater <ben.slater@instaclustr.com>
>>> wrote:
>>>
>>> OK, sorry - I think understand what you are asking now.
>>>
>>> However, I’m still a little confused by your description. I think your
>>> scenario is:
>>> 1) Stop C* on all nodes in a cluster (Nodes A,B,C)
>>> 2) Delete all data from Node A
>>> 3) Restart Node A
>>> 4) Restart Node B,C
>>>
>>> Is this correct?
>>>
>>> If so, this isn’t a scenario I’ve tested/seen but I’m not surprised Node
>>> A starts succesfully as there are no running nodes to tell it via gossip
>>> that it shouldn’t start up without the “replaces” flag.
>>>
>>> I think that right way to recover in this scenario is to run a nodetool
>>> rebuild on Node A after the other two nodes are running. You could
>>> theoretically also run a repair (which would be good practice after a weird
>>> failure scenario like this) but rebuild will probably be quicker given you
>>> know all the data needs to be re-streamed.
>>>
>>> Cheers
>>> Ben
>>>
>>> On Tue, 18 Oct 2016 at 14:03 Yuji Ito <yuji@imagine-orb.com> wrote:
>>>
>>> Thank you Ben, Yabin
>>>
>>> I understood the rejoin was illegal.
>>> I expected this rejoin would fail with the exception.
>>> But I could add the failure node to the cluster without the
>>> exception after 2) and 3).
>>> I want to know why the rejoin succeeds. Should the exception happen?
>>>
>>> Regards,
>>>
>>>
>>> On Tue, Oct 18, 2016 at 1:51 AM, Yabin Meng <yabinmeng@gmail.com> wrote:
>>>
>>> The exception you run into is expected behavior. This is because as Ben
>>> pointed out, when you delete everything (including system schemas), C*
>>> cluster thinks you're bootstrapping a new node. However,  node2's IP is
>>> still in gossip and this is why you see the exception.
>>>
>>> I'm not clear the reasoning why you need to delete C* data directory.
>>> That is a dangerous action, especially considering that you delete system
>>> schemas. If in any case the failure node is gone for a while, what you need
>>> to do is to is remove the node first before doing "rejoin".
>>>
>>> Cheers,
>>>
>>> Yabin
>>>
>>> On Mon, Oct 17, 2016 at 1:48 AM, Ben Slater <ben.slater@instaclustr.com>
>>> wrote:
>>>
>>> To cassandra, the node where you deleted the files looks like a brand
>>> new machine. It doesn’t automatically rebuild machines to prevent
>>> accidental replacement. You need to tell it to build the “new” machines as
>>> a replacement for the “old” machine with that IP by setting
>>> -Dcassandra.replace_address_first_boot=<dead_node_ip>. See
>>> http://cassandra.apache.org/doc/latest/operating/topo_changes.html
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__cassandra.apache.org_doc_latest_operating_topo-5Fchanges.html&d=DQMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KGo0EnUT-Bop-0OnyQJRuFvNOf99S9tWEgziATmNfJ8&s=YazqmnV8TuuQXt9PDn0kFe6C08b7tQQXrqouXBCVVXE&e=>
>>> .
>>>
>>> Cheers
>>> Ben
>>>
>>> On Mon, 17 Oct 2016 at 16:41 Yuji Ito <yuji@imagine-orb.com> wrote:
>>>
>>> Hi all,
>>>
>>> A failure node can rejoin a cluster.
>>> On the node, all data in /var/lib/cassandra were deleted.
>>> Is it normal?
>>>
>>> I can reproduce it as below.
>>>
>>> cluster:
>>> - C* 2.2.7
>>> - a cluster has node1, 2, 3
>>> - node1 is a seed
>>> - replication_factor: 3
>>>
>>> how to:
>>> 1) stop C* process and delete all data in /var/lib/cassandra on node2
>>> ($sudo rm -rf /var/lib/cassandra/*)
>>> 2) stop C* process on node1 and node3
>>> 3) restart C* on node1
>>> 4) restart C* on node2
>>>
>>> nodetool status after 4):
>>> Datacenter: datacenter1
>>> =======================
>>> Status=Up/Down
>>> |/ State=Normal/Leaving/Joining/Moving
>>> --  Address        Load       Tokens       Owns (effective)  Host ID
>>>                           Rack
>>> DN  [node3 IP]  ?                 256          100.0%
>>>  325553c6-3e05-41f6-a1f7-47436743816f  rack1
>>> UN  [node2 IP]  7.76 MB      256          100.0%
>>>  05bdb1d4-c39b-48f1-8248-911d61935925  rack1
>>> UN  [node1 IP]  416.13 MB  256          100.0%
>>>  a8ec0a31-cb92-44b0-b156-5bcd4f6f2c7b  rack1
>>>
>>> If I restart C* on node 2 when C* on node1 and node3 are running
>>> (without 2), 3)), a runtime exception happens.
>>> RuntimeException: "A node with address [node2 IP] already exists,
>>> cancelling join..."
>>>
>>> I'm not sure this causes data lost. All data can be read properly just
>>> after this rejoin.
>>> But some rows are lost when I kill&restart C* for destructive tests
>>> after this rejoin.
>>>
>>> Thanks.
>>>
>>> --
>>> ————————
>>> Ben Slater
>>> Chief Product Officer
>>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>>> +61 437 929 798
>>>
>>>
>>>
>>> --
>>> ————————
>>> Ben Slater
>>> Chief Product Officer
>>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>>> +61 437 929 798
>>>
>>> ____________________________________________________________________
>>> CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential
>>> and may be legally privileged. If you are not the intended recipient, do
>>> not disclose, copy, distribute, or use this email or any attachments. If
>>> you have received this in error please let the sender know and then delete
>>> the email and all attachments.
>>>
>>>
>>> --
>>> ————————
>>> Ben Slater
>>> Chief Product Officer
>>>
>>> --
>> ————————
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>
>

Mime
View raw message