Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3120F200BD9 for ; Fri, 25 Nov 2016 03:33:38 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 2F638160B1F; Fri, 25 Nov 2016 02:33:38 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5796B160B1E for ; Fri, 25 Nov 2016 03:33:35 +0100 (CET) Received: (qmail 86517 invoked by uid 500); 25 Nov 2016 02:33:32 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 86491 invoked by uid 99); 25 Nov 2016 02:33:31 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Nov 2016 02:33:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 5630AD3A0A for ; Fri, 25 Nov 2016 02:33:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.98 X-Spam-Level: * X-Spam-Status: No, score=1.98 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=imagine-orb-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id bIW0PDkrcwsK for ; Fri, 25 Nov 2016 02:33:26 +0000 (UTC) Received: from mail-wm0-f41.google.com (mail-wm0-f41.google.com [74.125.82.41]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id A9BDE5FADC for ; Fri, 25 Nov 2016 02:33:25 +0000 (UTC) Received: by mail-wm0-f41.google.com with SMTP id g23so130592516wme.1 for ; Thu, 24 Nov 2016 18:33:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=imagine-orb-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=onwaDpknxBw0w0QVfpeUDVavZJIo5d2sftutN8PLqb0=; b=nwTPMaL0BT+/+QoSL8vRkhZPdIe3laMF4lD4N6u+TXuxESn6Sy/ypyf/4b4QUpJKGl nW14UjsExK2fESqH6LsL0AFWgKjvOwlDV0HDe2FtdiL5RvBTcJ8xgEoXPXla8v9KdCCZ F84/FgeOmThoJr1jaOJc8jk+Iw1bbdZR5CquK0mN8oIP0aBal05EPnX1h04Ss/xogQo/ oB8CIeUuQyaJSeUvWdI4v+uZKl37Uh3S0mJ1sf1xGBSSm7XoFclltlIqQOaz4xJlHAEV g7aNy+QQgRD2ymArj3uNWJoC2bqs3wO3WFJWq0cETovZj1Mz+P65JDInVV1pdA56rfmA zRng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=onwaDpknxBw0w0QVfpeUDVavZJIo5d2sftutN8PLqb0=; b=NgW1vu1PcbAI6v9u3PKb0Z+ZclA7FiB/EbbZsYMvNNuo/dmj7grFkz2oxO3N0q40iE LHtESYRrH/DXbmnNXVzUnL/sQICCESuRpuQe6v1qAPB34OA94K8PKpQDnGyqsKpDFOGp 09JD89IF23oodgSyKj4/5sB7wrAqSPgwu+8WDYEhXs9tFQnj5/mPZxfTfyEuIfKVDrqP u56Rj+RzkYeZt36qdYtPqimt4Eqyzbi/I3wcSwtwtFSqnEgnfCGZa0Ch37GFwGY4zsNp zGIj9s43JR1Ps7mK1fa2TJC+Qo9MmQoDngDH2yBUhypf/+SihjvUGzD8nc6zLrpZttJy eJGw== X-Gm-Message-State: AKaTC00tQTM3hE++jL01KmVk/oO23mvUUW6o6D4UBk7pfExAjlM9X2GZ9CJDqMXkYY1aXJq9kaMXj+MZClIucQ== X-Received: by 10.28.182.4 with SMTP id g4mr5298408wmf.15.1480041180939; Thu, 24 Nov 2016 18:33:00 -0800 (PST) MIME-Version: 1.0 Received: by 10.28.74.214 with HTTP; Thu, 24 Nov 2016 18:33:00 -0800 (PST) In-Reply-To: References: <5C39A00F-4569-41E4-80F7-07651E66CAB5@crowdstrike.com> From: Yuji Ito Date: Fri, 25 Nov 2016 11:33:00 +0900 Message-ID: Subject: Re: failure node rejoin To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a1148f54ad7e14e054216f031 archived-at: Fri, 25 Nov 2016 02:33:38 -0000 --001a1148f54ad7e14e054216f031 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thanks Ben, I've reported this issue at https://issues.apache.org/j ira/browse/CASSANDRA-12955. I'll tell you if I find anything about the data loss issue. Regards, yuji On Thu, Nov 24, 2016 at 1:37 PM, Ben Slater wrote: > You could certainly log a JIRA for the =E2=80=9Cfailure node rejoin=E2=80= =9D issue ( > https://issues.apache.org/*jira*/browse/ > *cassandra > ). I*t sounds like > unexpected behaviour to me. However, I=E2=80=99m not sure it will be view= ed a high > priority to fix given there is a clear operational work-around. > > Cheers > Ben > > On Thu, 24 Nov 2016 at 15:14 Yuji Ito wrote: > >> Hi Ben, >> >> I continue to investigate the data loss issue. >> I'm investigating logs and source code and try to reproduce the data los= s >> issue with a simple test. >> I also try my destructive test with DROP instead of TRUNCATE. >> >> BTW, I want to discuss the issue of the title "failure node rejoin" agai= n. >> >> Will this issue be fixed? Other nodes should refuse this unexpected >> rejoin. >> Or should I be more careful to add failure nodes to the existing cluster= ? >> >> Thanks, >> yuji >> >> >> On Fri, Nov 11, 2016 at 1:00 PM, Ben Slater >> wrote: >> >> From a quick look I couldn=E2=80=99t find any defects other than the one= s you=E2=80=99ve >> found that seem potentially relevant to your issue (if any one else on t= he >> list knows of one please chime in). Maybe the next step, if you haven=E2= =80=99t >> done so already, is to check your Cassandra logs for any signs of issues >> (ie WARNING or ERROR logs) in the failing case. >> >> Cheers >> Ben >> >> On Fri, 11 Nov 2016 at 13:07 Yuji Ito wrote: >> >> Thanks Ben, >> >> I tried 2.2.8 and could reproduce the problem. >> So, I'm investigating some bug fixes of repair and commitlog between >> 2.2.8 and 3.0.9. >> >> - CASSANDRA-12508: "nodetool repair returns status code 0 for some error= s" >> >> - CASSANDRA-12436: "Under some races commit log may incorrectly think it >> has unflushed data" >> - related to CASSANDRA-9669, CASSANDRA-11828 (the fix of 2.2 is >> different from that of 3.0?) >> >> Do you know other bug fixes related to commitlog? >> >> Regards >> yuji >> >> On Wed, Nov 9, 2016 at 11:34 AM, Ben Slater >> wrote: >> >> There have been a few commit log bugs around in the last couple of month= s >> so perhaps you=E2=80=99ve hit something that was fixed recently. Would b= e >> interesting to know the problem is still occurring in 2.2.8. >> >> I suspect what is happening is that when you do your initial read >> (without flush) to check the number of rows, the data is in memtables an= d >> theoretically the commitlogs but not sstables. With the forced stop the >> memtables are lost and Cassandra should read the commitlog from disk at >> startup to reconstruct the memtables. However, it looks like that didn= =E2=80=99t >> happen for some (bad) reason. >> >> Good news that 3.0.9 fixes the problem so up to you if you want to >> investigate further and see if you can narrow it down to file a JIRA >> (although the first step of that would be trying 2.2.9 to make sure it= =E2=80=99s >> not already fixed there). >> >> Cheers >> Ben >> >> On Wed, 9 Nov 2016 at 12:56 Yuji Ito wrote: >> >> I tried C* 3.0.9 instead of 2.2. >> The data lost problem hasn't happen for now (without `nodetool flush`). >> >> Thanks >> >> On Fri, Nov 4, 2016 at 3:50 PM, Yuji Ito wrote: >> >> Thanks Ben, >> >> When I added `nodetool flush` on all nodes after step 2, the problem >> didn't happen. >> Did replay from old commit logs delete rows? >> >> Perhaps, the flush operation just detected that some nodes were down in >> step 2 (just after truncating tables). >> (Insertion and check in step2 would succeed if one node was down because >> consistency levels was serial. >> If the flush failed on more than one node, the test would retry step 2.) >> However, if so, the problem would happen without deleting Cassandra data= . >> >> Regards, >> yuji >> >> >> On Mon, Oct 24, 2016 at 8:37 AM, Ben Slater >> wrote: >> >> Definitely sounds to me like something is not working as expected but I >> don=E2=80=99t really have any idea what would cause that (other than the= fairly >> extreme failure scenario). A couple of things I can think of to try to >> narrow it down: >> 1) Run nodetool flush on all nodes after step 2 - that will make sure al= l >> data is written to sstables rather than relying on commit logs >> 2) Run the test with consistency level quorom rather than serial >> (shouldn=E2=80=99t be any different but quorom is more widely used so ma= ybe there >> is a bug that=E2=80=99s specific to serial) >> >> Cheers >> Ben >> >> On Mon, 24 Oct 2016 at 10:29 Yuji Ito wrote: >> >> Hi Ben, >> >> The test without killing nodes has been working well without data lost. >> I've repeated my test about 200 times after removing data and >> rebuild/repair. >> >> Regards, >> >> >> On Fri, Oct 21, 2016 at 3:14 PM, Yuji Ito wrote: >> >> > Just to confirm, are you saying: >> > a) after operation 2, you select all and get 1000 rows >> > b) after operation 3 (which only does updates and read) you select and >> only get 953 rows? >> >> That's right! >> >> I've started the test without killing nodes. >> I'll report the result to you next Monday. >> >> Thanks >> >> >> On Fri, Oct 21, 2016 at 3:05 PM, Ben Slater >> wrote: >> >> Just to confirm, are you saying: >> a) after operation 2, you select all and get 1000 rows >> b) after operation 3 (which only does updates and read) you select and >> only get 953 rows? >> >> If so, that would be very unexpected. If you run your tests without >> killing nodes do you get the expected (1,000) rows? >> >> Cheers >> Ben >> >> On Fri, 21 Oct 2016 at 17:00 Yuji Ito wrote: >> >> > Are you certain your tests don=E2=80=99t generate any overlapping inse= rts (by >> PK)? >> >> Yes. The operation 2) also checks the number of rows just after all >> insertions. >> >> >> On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater >> wrote: >> >> OK. Are you certain your tests don=E2=80=99t generate any overlapping in= serts (by >> PK)? Cassandra basically treats any inserts with the same primary key as >> updates (so 1000 insert operations may not necessarily result in 1000 ro= ws >> in the DB). >> >> On Fri, 21 Oct 2016 at 16:30 Yuji Ito wrote: >> >> thanks Ben, >> >> > 1) At what stage did you have (or expect to have) 1000 rows (and have >> the mismatch between actual and expected) - at that end of operation (2)= or >> after operation (3)? >> >> after operation 3), at operation 4) which reads all rows by cqlsh with >> CL.SERIAL >> >> > 2) What replication factor and replication strategy is used by the tes= t >> keyspace? What consistency level is used by your operations? >> >> - create keyspace testkeyspace WITH REPLICATION =3D >> {'class':'SimpleStrategy','replication_factor':3}; >> - consistency level is SERIAL >> >> >> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater >> wrote: >> >> >> A couple of questions: >> 1) At what stage did you have (or expect to have) 1000 rows (and have th= e >> mismatch between actual and expected) - at that end of operation (2) or >> after operation (3)? >> 2) What replication factor and replication strategy is used by the test >> keyspace? What consistency level is used by your operations? >> >> >> Cheers >> Ben >> >> ... > > [Message clipped] --001a1148f54ad7e14e054216f031 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks Ben,


I'll tell you if I find anything about= the data loss issue.

Regards,
yuji<= /div>


On Thu, Nov 24, 2016 at 1:37 PM, Ben Slater <= ben.slater@= instaclustr.com> wrote:
You could certainly log a JIRA for the =E2=80=9Cfailure node = rejoin=E2=80=9D issue (http= s://issues.apache.org/jira/br= owse/cass= andra). It sounds like unexpected behaviour to me. However, I=E2=80= =99m not sure it will be viewed a high priority to fix given there is a cle= ar operational work-around.

Cheers
Ben

On Thu, 24 Nov 2016 at 15:14 Yuji Ito <yuji@imagine-orb.com> wrote:
Hi Ben,

I continue to investigate the data loss issue.
I'm investigating logs and source c= ode and try to reproduce the data loss issue with a simple test.
I also try my destructive test wit= h DROP instead of TRUNCATE.

BTW, I want to discuss the issue of the title &qu= ot;failure node rejoin" again.

Will this issue be fixed? Other nodes s= hould refuse this unexpected rejoin.
Or should I be more careful to add failure nodes to the existi= ng cluster?

Thanks,
yuji


On Fri, Nov 11, 2016 at 1:00 PM, Ben Slater <= ben.slater@instaclustr.com> wrote:
From a quick= look I couldn=E2=80=99t find any defects other than the ones you=E2=80=99v= e found that seem potentially relevant to your issue (if any one else on th= e list knows of one please chime in). Maybe the next step, if you haven=E2= =80=99t done so already, is to check your Cassandra logs for any signs of i= ssues (ie WARNING or ERROR logs) in the failing case.

=
Cheers
Ben

On Fri, 11 Nov 2016 at 13:07 Yu= ji Ito <yuji@imagine-orb.com> wrote:
Thanks Ben,
I tried 2.2.8 and could reproduce the problem.
So, I'm investigating some bug fixes of repair and commitl= og between 2.2.8 and 3.0.9.

-=C2=A0CASSANDRA-12508: "nodetool repair ret= urns status code 0 for some errors"

-=C2=A0CASSANDRA-12436: "Under = some races commit log may incorrectly think it has unflushed data"
=C2=A0 - related to CASSAND= RA-9669,=C2=A0CASSANDRA-11828 (the fix of 2.2 is different from that o= f 3.0?)

Do you know other bug fixes related to commitlog?

Regards
yuji

On Wed, Nov 9, 2016 a= t 11:34 AM, Ben Slater <ben.slater@instaclustr.com>= wrote:
There have been a few commit log bugs around i= n the last couple of months so perhaps you=E2=80=99ve hit something that wa= s fixed recently. Would be interesting to know the problem is still occurri= ng in 2.2.8.

I suspect what is happening is that when you do your initial read (wit= hout flush) to check the number of rows, the data is in memtables and theor= etically the commitlogs but not sstables. With the forced stop the memtable= s are lost and Cassandra should read the commitlog from disk at startup to = reconstruct the memtables. However, it looks like that didn=E2=80=99t happe= n for some (bad) reason.

Good news that 3.0.9 fixes the problem so up to you = if you want to investigate further and see if you can narrow it down to fil= e a JIRA (although the first step of that would be trying 2.2.9 to make sur= e it=E2=80=99s not already fixed there).

Cheers
Ben=C2=A0

On Wed, 9 Nov 2016 at 12:56 Yuji Ito <yuji@imagine-orb.com> wrote:
I tried C* 3.0.9 i= nstead of 2.2.
The data lost problem hasn't happen = for now (without `nodetool flush`).

Thanks
=
On Fri, Nov 4, 2016 at 3:50 PM, Yu= ji Ito <yuji@imagine-orb.com> wrote:<= br class=3D"m_7003658954627734408m_-9000383105763583287m_-29109202014462267= 28m_-3908489035195591388m_-5694757249818248723gmail_msg m_70036589546277344= 08m_-9000383105763583287m_-2910920201446226728gmail_msg m_70036589546277344= 08gmail_msg">
Than= ks Ben,

When I added `node= tool flush` on all nodes after step 2, the problem didn't happen.
=
Did replay from old commit logs delete rows?

Perhaps, the flush operation just d= etected that some nodes were down in step 2 (just after truncating tables).=
(Insertion and check in step2 would succeed if one nod= e was down because consistency levels was serial.
If th= e flush failed on more than one node, the test would retry step 2.)
However, if so, the problem would happen without deleting Cass= andra data.

Regards,=
yuji

=
<= div class=3D"gmail_extra m_7003658954627734408m_-9000383105763583287m_-2910= 920201446226728m_-3908489035195591388m_-5694757249818248723gmail_msg m_7003= 658954627734408m_-9000383105763583287m_-2910920201446226728gmail_msg m_7003= 658954627734408gmail_msg">
On Mon, = Oct 24, 2016 at 8:37 AM, Ben Slater <ben.slat= er@instaclustr.com> wrote:
Definitely sounds to me like something= is not working as expected but I don=E2=80=99t really have any idea what w= ould cause that (other than the fairly extreme failure scenario). A couple = of things I can think of to try to narrow it down:
1) Run nod= etool flush on all nodes after step 2 - that will make sure all data is wri= tten to sstables rather than relying on commit logs
2) = Run the test with consistency level quorom rather than serial (shouldn=E2= =80=99t be any different but quorom is more widely used so maybe there is a= bug that=E2=80=99s specific to serial)

<= /div>
Cheers=C2=A0
Ben

On Mon, 24 Oct 2016 at 10:29 Yuji Ito <yuji@imag= ine-orb.com> wrote:
Hi Ben,

The= test without killing nodes has been working well without data lost.
<= div class=3D"m_7003658954627734408m_-9000383105763583287m_-2910920201446226= 728m_-3908489035195591388m_-5694757249818248723m_-2881913362202111342m_-719= 9009313694193469m_4845865294973104782gmail_msg m_7003658954627734408m_-9000= 383105763583287m_-2910920201446226728m_-3908489035195591388m_-5694757249818= 248723gmail_msg m_7003658954627734408m_-9000383105763583287m_-2910920201446= 226728gmail_msg m_7003658954627734408gmail_msg">I've repeated my test a= bout 200 times after removing data and rebuild/repair.

Regards,


On Fri, Oct 21, 2016= at 3:14 PM, Yuji Ito <yuji@imagine-orb.com>= wrote:
> Just to confirm, are you saying:
> a) after operation 2, you select all and get 100= 0 rows
> b) after = operation 3 (which only does updates and read) you select and only get 953 = rows?

That's right!

I've started the test without ki= lling nodes.
I'll= report the result to you next Monday.

Thanks


On Fri, Oct 21, 2016 at 3:05 PM, Ben S= later <ben.slater@instaclustr.com> wro= te:
Just to confirm, are you saying:
a) after operation 2, you select all and g= et 1000 rows
b) after= operation 3 (which only does updates and read) you select and only get 953= rows?

If so, that would be very unexpected. If you run your tests w= ithout killing nodes do you get the expected (1,000) rows?

Cheers
Ben

On Fri, 21 Oct 2016 at 17:00 Yuji Ito <yuji@imagine-orb.com> wrote:
>=C2=A0Are you certain your= tests don=E2=80=99t generate any overlapping inserts (by PK)?
=

Yes. The operation = 2) also checks the number of rows just after all insertions.


On Fri, Oct 21, 2016= at 2:51 PM, Ben Slater <ben.slater@instaclustr.com> wrote:
OK. Are you certain your tests don=E2=80=99t generate any overlap= ping inserts (by PK)? Cassandra basically treats any inserts with the same = primary key as updates (so 1000 insert operations may not necessarily resul= t in 1000 rows in the DB).

On Fri, 21 Oct 2016 at 16:30 Yuj= i Ito <yuji@imagine-or= b.com> wrote:
thanks Ben,

> 1) At what stage did you have (or expect to have) 1000 = rows (and have the mismatch between actual and expected) - at that end of o= peration (2) or after operation (3)?

after op= eration 3), at operation 4) which reads all rows by cqlsh with CL.SERIAL

> 2) What replication factor and replication= strategy is used by the test keyspace? What consistency level is used by y= our operations?

- create keyspace testkeyspace WITH REPLICATION =3D {= 9;class':'SimpleStrategy','replication_factor':3};=
- consi= stency level is SERIAL


On = Fri, Oct 21, 2016 at 12:04 PM, Ben Slater <ben.slater@instaclustr.com> wrote:

A couple of questions:
1) At what stage did you have (or expect to have) 1000= rows (and have the mismatch between actual and expected) - at that end of = operation (2) or after operation (3)?
2) What replication factor and replication st= rategy is used by the test keyspace? What consistency level is used by your= operations?


Cheers
Ben

...<= br>
[Message clipped]=C2=A0=C2=A0
--001a1148f54ad7e14e054216f031--