Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A1A3B200BC7 for ; Fri, 11 Nov 2016 05:01:16 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 9F67F160B10; Fri, 11 Nov 2016 04:01:16 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 91EB2160B01 for ; Fri, 11 Nov 2016 05:01:13 +0100 (CET) Received: (qmail 56175 invoked by uid 500); 11 Nov 2016 04:01:11 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 56165 invoked by uid 99); 11 Nov 2016 04:01:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Nov 2016 04:01:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 9FF021A9F0D for ; Fri, 11 Nov 2016 04:01:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.979 X-Spam-Level: * X-Spam-Status: No, score=1.979 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=instaclustr-com.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id NETFdUs2G_gR for ; Fri, 11 Nov 2016 04:01:05 +0000 (UTC) Received: from mail-qt0-f171.google.com (mail-qt0-f171.google.com [209.85.216.171]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 79E5C5F65A for ; Fri, 11 Nov 2016 04:01:05 +0000 (UTC) Received: by mail-qt0-f171.google.com with SMTP id w33so3153610qtc.3 for ; Thu, 10 Nov 2016 20:01:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=instaclustr-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=nGH0bVdkHVzMjE1a743ldM/+JlLpRlo1Tw3+2aENhf8=; b=gh5Xl53zlS176wRluokuXn6pS4aeQVK5rYfKYo/h5SmarNlD4u1H3S009CNjp08+SZ bLqscoaXb97TObrh7erHvxkqooq6kOVchX+iHpSw1u2c8XpZ+S7MNWJ9ZujMZT4Rhwsf NflNzuZRswbjrRVC+ds9gkmsst9tz6UoaaaTjMuv+Sa4wIEt3bDFdOiTTfI71tkvhOyu +pQAq3QPURox7L/1mfTv0iwy1grocrcwhQ83NyTSOcai3/yoBbai3yGfMzgpWa8CPpQC 2G3iTGJQo9FPSp5bw7fp4iHTg1iiA/isKnfmeXUhnmCQjvya5oCIDans4h2hCEqmv38G xE5w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=nGH0bVdkHVzMjE1a743ldM/+JlLpRlo1Tw3+2aENhf8=; b=Xsi2idurzNwT2n+L/YJ67jUD5F3OfytvYVPNksiKMlrNEz/nLJbNm4em8j/hGf9ysL pknMFwjAsdKUIgAF7yw8Whw87NmpEPBPCCNtq6KjFUttRh606a0Nb41TaCijjjGkaq55 GlN0TlPCmrEuebd2FU0UOZrUnlqgKFTrw/CEOCN+8KaaTNoJ5cdtYnZ3Uq0GZg6aidEg 1GdOXDY40jdInVdT4KKryV1mZqvcKv7h5UxlO/4h9ANxgAkVnuxeI9Ha0ZCM0u4y5DTx OvuaDkVzAc4Blwpc8yiOL9XnLSjsi5VVqN3ofzr2yWRIjYJfMPPxE+M8j6VfkuwmREp7 whcQ== X-Gm-Message-State: ABUngvcUWY93uJoZFhpr/lJB9K1Tfe9NMX6lluuHCin549ksBxHQQx4UBM8z/kWH3JoXxzaD5vA3qiXVv9nkXxPL X-Received: by 10.200.46.117 with SMTP id s50mr1166661qta.103.1478836864679; Thu, 10 Nov 2016 20:01:04 -0800 (PST) MIME-Version: 1.0 References: <5C39A00F-4569-41E4-80F7-07651E66CAB5@crowdstrike.com> In-Reply-To: From: Ben Slater Date: Fri, 11 Nov 2016 04:00:53 +0000 Message-ID: Subject: Re: failure node rejoin To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a113cc4ce0013500540fe8afe archived-at: Fri, 11 Nov 2016 04:01:16 -0000 --001a113cc4ce0013500540fe8afe Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable From a quick look I couldn=E2=80=99t find any defects other than the ones y= ou=E2=80=99ve found that seem potentially relevant to your issue (if any one else on the list knows of one please chime in). Maybe the next step, if you haven=E2=80= =99t done so already, is to check your Cassandra logs for any signs of issues (ie WARNING or ERROR logs) in the failing case. Cheers Ben On Fri, 11 Nov 2016 at 13:07 Yuji Ito wrote: > Thanks Ben, > > I tried 2.2.8 and could reproduce the problem. > So, I'm investigating some bug fixes of repair and commitlog between 2.2.= 8 > and 3.0.9. > > - CASSANDRA-12508: "nodetool repair returns status code 0 for some errors= " > > - CASSANDRA-12436: "Under some races commit log may incorrectly think it > has unflushed data" > - related to CASSANDRA-9669, CASSANDRA-11828 (the fix of 2.2 is > different from that of 3.0?) > > Do you know other bug fixes related to commitlog? > > Regards > yuji > > On Wed, Nov 9, 2016 at 11:34 AM, Ben Slater > wrote: > > There have been a few commit log bugs around in the last couple of months > so perhaps you=E2=80=99ve hit something that was fixed recently. Would be > interesting to know the problem is still occurring in 2.2.8. > > I suspect what is happening is that when you do your initial read (withou= t > flush) to check the number of rows, the data is in memtables and > theoretically the commitlogs but not sstables. With the forced stop the > memtables are lost and Cassandra should read the commitlog from disk at > startup to reconstruct the memtables. However, it looks like that didn=E2= =80=99t > happen for some (bad) reason. > > Good news that 3.0.9 fixes the problem so up to you if you want to > investigate further and see if you can narrow it down to file a JIRA > (although the first step of that would be trying 2.2.9 to make sure it=E2= =80=99s > not already fixed there). > > Cheers > Ben > > On Wed, 9 Nov 2016 at 12:56 Yuji Ito wrote: > > I tried C* 3.0.9 instead of 2.2. > The data lost problem hasn't happen for now (without `nodetool flush`). > > Thanks > > On Fri, Nov 4, 2016 at 3:50 PM, Yuji Ito wrote: > > Thanks Ben, > > When I added `nodetool flush` on all nodes after step 2, the problem > didn't happen. > Did replay from old commit logs delete rows? > > Perhaps, the flush operation just detected that some nodes were down in > step 2 (just after truncating tables). > (Insertion and check in step2 would succeed if one node was down because > consistency levels was serial. > If the flush failed on more than one node, the test would retry step 2.) > However, if so, the problem would happen without deleting Cassandra data. > > Regards, > yuji > > > On Mon, Oct 24, 2016 at 8:37 AM, Ben Slater > wrote: > > Definitely sounds to me like something is not working as expected but I > don=E2=80=99t really have any idea what would cause that (other than the = fairly > extreme failure scenario). A couple of things I can think of to try to > narrow it down: > 1) Run nodetool flush on all nodes after step 2 - that will make sure all > data is written to sstables rather than relying on commit logs > 2) Run the test with consistency level quorom rather than serial > (shouldn=E2=80=99t be any different but quorom is more widely used so may= be there > is a bug that=E2=80=99s specific to serial) > > Cheers > Ben > > On Mon, 24 Oct 2016 at 10:29 Yuji Ito wrote: > > Hi Ben, > > The test without killing nodes has been working well without data lost. > I've repeated my test about 200 times after removing data and > rebuild/repair. > > Regards, > > > On Fri, Oct 21, 2016 at 3:14 PM, Yuji Ito wrote: > > > Just to confirm, are you saying: > > a) after operation 2, you select all and get 1000 rows > > b) after operation 3 (which only does updates and read) you select and > only get 953 rows? > > That's right! > > I've started the test without killing nodes. > I'll report the result to you next Monday. > > Thanks > > > On Fri, Oct 21, 2016 at 3:05 PM, Ben Slater > wrote: > > Just to confirm, are you saying: > a) after operation 2, you select all and get 1000 rows > b) after operation 3 (which only does updates and read) you select and > only get 953 rows? > > If so, that would be very unexpected. If you run your tests without > killing nodes do you get the expected (1,000) rows? > > Cheers > Ben > > On Fri, 21 Oct 2016 at 17:00 Yuji Ito wrote: > > > Are you certain your tests don=E2=80=99t generate any overlapping inser= ts (by > PK)? > > Yes. The operation 2) also checks the number of rows just after all > insertions. > > > On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater > wrote: > > OK. Are you certain your tests don=E2=80=99t generate any overlapping ins= erts (by > PK)? Cassandra basically treats any inserts with the same primary key as > updates (so 1000 insert operations may not necessarily result in 1000 row= s > in the DB). > > On Fri, 21 Oct 2016 at 16:30 Yuji Ito wrote: > > thanks Ben, > > > 1) At what stage did you have (or expect to have) 1000 rows (and have > the mismatch between actual and expected) - at that end of operation (2) = or > after operation (3)? > > after operation 3), at operation 4) which reads all rows by cqlsh with > CL.SERIAL > > > 2) What replication factor and replication strategy is used by the test > keyspace? What consistency level is used by your operations? > > - create keyspace testkeyspace WITH REPLICATION =3D > {'class':'SimpleStrategy','replication_factor':3}; > - consistency level is SERIAL > > > On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater > wrote: > > > A couple of questions: > 1) At what stage did you have (or expect to have) 1000 rows (and have the > mismatch between actual and expected) - at that end of operation (2) or > after operation (3)? > 2) What replication factor and replication strategy is used by the test > keyspace? What consistency level is used by your operations? > > > Cheers > Ben > > On Fri, 21 Oct 2016 at 13:57 Yuji Ito wrote: > > Thanks Ben, > > I tried to run a rebuild and repair after the failure node rejoined the > cluster as a "new" node with -Dcassandra.replace_address_first_boot. > The failure node could rejoined and I could read all rows successfully. > (Sometimes a repair failed because the node cannot access other node. If > it failed, I retried a repair) > > But some rows were lost after my destructive test repeated (after about > 5-6 hours). > After the test inserted 1000 rows, there were only 953 rows at the end of > the test. > > My destructive test: > - each C* node is killed & restarted at the random interval (within about > 5 min) throughout this test > 1) truncate all tables > 2) insert initial rows (check if all rows are inserted successfully) > 3) request a lot of read/write to random rows for about 30min > 4) check all rows > If operation 1), 2) or 4) fail due to C* failure, the test retry the > operation. > > Does anyone have the similar problem? > What causes data lost? > Does the test need any operation when C* node is restarted? (Currently, I > just restarted C* process) > > Regards, > > > On Tue, Oct 18, 2016 at 2:18 PM, Ben Slater > wrote: > > OK, that=E2=80=99s a bit more unexpected (to me at least) but I think the= solution > of running a rebuild or repair still applies. > > On Tue, 18 Oct 2016 at 15:45 Yuji Ito wrote: > > Thanks Ben, Jeff > > Sorry that my explanation confused you. > > --001a113cc4ce0013500540fe8afe Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
From a quick look I couldn=E2=80=99t find any defects othe= r than the ones you=E2=80=99ve found that seem potentially relevant to your= issue (if any one else on the list knows of one please chime in). Maybe th= e next step, if you haven=E2=80=99t done so already, is to check your Cassa= ndra logs for any signs of issues (ie WARNING or ERROR logs) in the failing= case.

Cheers
Ben

On Fri, 11 Nov 2016 at 13:07 Yuji Ito <yuji@imagine-orb.com> wrote:
Thanks Ben,

I tried 2.2.8 and could reproduce the problem.
So, I'm investigating some b= ug fixes of repair and commitlog between 2.2.8 and 3.0.9.

-=C2= =A0CASSANDRA-12508: "nodetool repair returns status code 0 for some er= rors"

-=C2=A0CASSANDRA-12436: "Under some races commit = log may incorrectly think it has unflushed data"
=C2=A0 - related to CASSANDRA-9669,=C2=A0CASSANDRA-11828 (the fix = of 2.2 is different from that of 3.0?)

Do you know other bug fixe= s related to commitlog?

Regards
yuji

On Wed, Nov 9, 2016 at 11:34 AM, Ben Slater <ben.slater@instaclustr.com> wrote:
There have been a few commit log bugs around in the= last couple of months so perhaps you=E2=80=99ve hit something that was fix= ed recently. Would be interesting to know the problem is still occurring in= 2.2.8.

I suspect what is happening is that when you do your initial re= ad (without flush) to check the number of rows, the data is in memtables an= d theoretically the commitlogs but not sstables. With the forced stop the m= emtables are lost and Cassandra should read the commitlog from disk at star= tup to reconstruct the memtables. However, it looks like that didn=E2=80=99= t happen for some (bad) reason.

Good news that 3.0.9 fixes the pr= oblem so up to you if you want to investigate further and see if you can na= rrow it down to file a JIRA (although the first step of that would be tryin= g 2.2.9 to make sure it=E2=80=99s not already fixed there).

Cheer= s
Ben=C2=A0

On We= d, 9 Nov 2016 at 12:56 Yuji Ito <yuji@imagine-orb.com> wrote:<= br class=3D"gmail_msg">
I tried C* 3.0.9 instead of 2.2.
The data lost prob= lem hasn't happen for now (without `nodetool flush`).

Thanks

On Fri, Nov 4,= 2016 at 3:50 PM, Yuji Ito <yuji@imagine-orb.com> wrote:=
Thanks Ben,

When I adde= d `nodetool flush` on all nodes after step 2, the problem didn't happen= .
Did replay from old commit logs delete rows?

=
Perhaps, the flush operation just detected that some nodes were down in= step 2 (just after truncating tables).
(Insertion and check in st= ep2 would succeed if one node was down because consistency levels was seria= l.
If the flush failed on more than one node, the test would retry= step 2.)
However, if so, the problem would happen without deletin= g Cassandra data.

Regards,
yuji
=

On Mon, Oct 24, 2016 at 8:37= AM, Ben Slater <ben.slater@instaclustr.com> wrote:
Definitely sounds to me like something= is not working as expected but I don=E2=80=99t really have any idea what w= ould cause that (other than the fairly extreme failure scenario). A couple = of things I can think of to try to narrow it down:
1) Run nodetool flush= on all nodes after step 2 - that will make sure all data is written to sst= ables rather than relying on commit logs
2) Run the test with cons= istency level quorom rather than serial (shouldn=E2=80=99t be any different= but quorom is more widely used so maybe there is a bug that=E2=80=99s spec= ific to serial)

Cheers=C2=A0
Ben

On Mon, 2= 4 Oct 2016 at 10:29 Yuji Ito <yuji@imagine-orb.com> wrote:
Hi Ben,
The t= est without killing nodes has been working well without data lost.
I've repeated my test abou= t 200 times after removing data and rebuild/repair.

Regards,


On Fri, Oct 21, 2016 at 3:14 PM= , Yuji Ito <= ;yuji@imagine-orb.com> wrote:
> Just to= confirm, are you saying:
> a) after operation 2, you select all and get 1000 rows
> b) after operation 3 (which = only does updates and read) you select and only get 953 rows?

That'= s right!

I've started the test without killing nodes.
I'll report the result to you n= ext Monday.

Thanks

On Fri, Oct 21, 2016 at 3:05 PM, Ben Slater <ben.sl= ater@instaclustr.com> wrote:
Just to conf= irm, are you saying:
a) afte= r operation 2, you select all and get 1000 rows
b) after operation 3 (which only does updates and= read) you select and only get 953 rows?

If so, that would be very unex= pected. If you run your tests without killing nodes do you get the expected= (1,000) rows?

Cheers
Ben

On Fri, 21 Oct 2016 at 17:00 Yuji Ito= <yuji@imagine-orb.com> wrote:
>=C2=A0Are you certain your tests don=E2=80=99t generate any o= verlapping inserts (by PK)?

= Yes. The operation 2) also checks the number of rows just after all inserti= ons.

=

On Fri, Oct 21, 2016 a= t 2:51 PM, Ben Slater <ben.slater@instaclustr.com> wrote:
OK. Are you certain yo= ur tests don=E2=80=99t generate any overlapping inserts (by PK)? Cassandra = basically treats any inserts with the same primary key as updates (so 1000 = insert operations may not necessarily result in 1000 rows in the DB).
=

On Fri, 21 Oct 2016 at 16:30 Yuji Ito <yuji@imagine-orb.com> wrote:
than= ks Ben,

> 1) At what stage did you have = (or expect to have) 1000 rows (and have the mismatch between actual and exp= ected) - at that end of operation (2) or after operation (3)?

after operation 3), at operation 4) which reads all rows = by cqlsh with CL.SERIAL

> 2) What replica= tion factor and replication strategy is used by the test keyspace? What con= sistency level is used by your operations?

- create keyspace testkeyspa= ce WITH REPLICATION =3D {'class':'SimpleStrategy','repl= ication_factor':3};
- consistency level is SERIAL


On Fri, Oct 21, = 2016 at 12:04 PM, Ben Slater <ben.slater@instaclustr.com>= ; wrote:

A couple of questions:
1) At what stage did you hav= e (or expect to have) 1000 rows (and have the mismatch between actual and e= xpected) - at that end of operation (2) or after operation (3)?
= 2) What replication factor and replication strategy is used by the test key= space? What consistency level is used by your operations?


Cheers
Ben
<= div class=3D"m_-3908489035195591388m_-5694757249818248723m_-288191336220211= 1342m_-7199009313694193469m_4845865294973104782m_-1026336780598674391m_3498= 48552241567962m_-2702571167644245447m_7507193677341435861m_8997792290516328= 679m_-29665851870655295h5 m_-3908489035195591388m_-5694757249818248723m_-28= 81913362202111342m_-7199009313694193469m_4845865294973104782m_-102633678059= 8674391m_349848552241567962m_-2702571167644245447m_7507193677341435861m_899= 7792290516328679gmail_msg m_-3908489035195591388m_-5694757249818248723m_-28= 81913362202111342m_-7199009313694193469m_4845865294973104782m_-102633678059= 8674391m_349848552241567962m_-2702571167644245447gmail_msg m_-3908489035195= 591388m_-5694757249818248723m_-2881913362202111342m_-7199009313694193469m_4= 845865294973104782gmail_msg m_-3908489035195591388m_-5694757249818248723gma= il_msg gmail_msg">
On Fri, 21 Oct 2= 016 at 13:57 Yuji Ito <yuji@imagine-orb.com> wrote:
Thanks Ben= ,

= I tried to run a rebuild and repair after the failure node rejoined the clu= ster as a "new" node with -Dcassandra.replace_address_first_boot.
The failure = node could rejoined and I could read all rows successfully.
(Sometimes a repair failed because th= e node cannot access other node. If it failed, I retried a repair)

But = some rows were lost after my destructive test repeated (after about 5-6 hou= rs).
After the test in= serted 1000 rows, there were only 953 rows at the end of the test.

My destructive test:
- each C* node is killed & re= started at the random interval (within about 5 min)=C2=A0throughout this test
1) truncate all tables
2) insert initial rows (check if all rows are ins= erted successfully)
3)= request a lot of read/write to random rows for about 30min
4) check all rows
If operation 1), 2) or 4) fail due to C* = failure, the test retry the operation.

Does anyone have the similar pro= blem?
What causes data lost?
Does the test need any operation when C* no= de is restarted? (Currently, I just restarted C* process)

Regards,


On Tue, Oct 18, 2016 at= 2:18 PM, Ben Slater <ben.slater@instaclustr.com&g= t; wrote:
OK, that=E2=80=99s a bit more unexpec= ted (to me at least) but I think the solution of running a rebuild or repai= r still applies.

On Tue, 18 Oct 2016 at 15:45 Yuji Ito <yuji@imagine-orb.com> wrote:
Tha= nks Ben, Jeff

Sorry that my explanation co= nfused you.
<= br class=3D"m_-3908489035195591388m_-5694757249818248723m_-2881913362202111= 342m_-7199009313694193469m_4845865294973104782m_-1026336780598674391m_34984= 8552241567962m_-2702571167644245447m_7507193677341435861m_89977922905163286= 79m_-29665851870655295m_-1237882901188079400m_6004363168911021151m_71403144= 46020094232gmail_msg m_-3908489035195591388m_-5694757249818248723m_-2881913= 362202111342m_-7199009313694193469m_4845865294973104782m_-10263367805986743= 91m_349848552241567962m_-2702571167644245447m_7507193677341435861m_89977922= 90516328679m_-29665851870655295m_-1237882901188079400gmail_msg m_-390848903= 5195591388m_-5694757249818248723m_-2881913362202111342m_-719900931369419346= 9m_4845865294973104782m_-1026336780598674391m_349848552241567962m_-27025711= 67644245447m_7507193677341435861m_8997792290516328679gmail_msg m_-390848903= 5195591388m_-5694757249818248723m_-2881913362202111342m_-719900931369419346= 9m_4845865294973104782m_-1026336780598674391m_349848552241567962m_-27025711= 67644245447gmail_msg m_-3908489035195591388m_-5694757249818248723m_-2881913= 362202111342m_-7199009313694193469m_4845865294973104782gmail_msg m_-3908489= 035195591388m_-5694757249818248723gmail_msg gmail_msg">
--001a113cc4ce0013500540fe8afe--