From user-return-64022-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org Wed Jun 12 21:11:26 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 0582318061A for ; Wed, 12 Jun 2019 23:11:25 +0200 (CEST) Received: (qmail 17442 invoked by uid 500); 12 Jun 2019 21:11:18 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 17432 invoked by uid 99); 12 Jun 2019 21:11:17 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Jun 2019 21:11:17 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 64C751A44B2 for ; Wed, 12 Jun 2019 21:11:17 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.799 X-Spam-Level: * X-Spam-Status: No, score=1.799 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=smartthings.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id r4E6so1x50VU for ; Wed, 12 Jun 2019 21:11:15 +0000 (UTC) Received: from mail-ed1-f50.google.com (mail-ed1-f50.google.com [209.85.208.50]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id EEFC55F490 for ; Wed, 12 Jun 2019 21:11:14 +0000 (UTC) Received: by mail-ed1-f50.google.com with SMTP id a14so25910689edv.12 for ; Wed, 12 Jun 2019 14:11:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=nMkyL0pVPqkvLN5W7v0oKxpsBIRkR2FN9H/1OUDUvrg=; b=RNZRZW8aVXlIY7aJJpyBwKjpXijAVwAINNX37cwgTtkZf/FgkbaU9ZXF+BXVp2P0d0 R+xBkZ8/SsqF74JfVKf7N8JD0XzxPqCLAcObbNiEODIqK/EiXV8SFQ34h0Q6S/WDeRin YZ/xXlvoUKm7hFJqK7NvsrqxYknec1ScT9T4Z3Nx9ulAo81RvX7pn7JmiQen2gt7iAy7 D4Nw3Ba+ZePu/Ko4c3SYS7s5sS1bEvWl8SCmJ9p66++TOeJNW6kqTsJoC8Yw9WbpcyCh p+Jy1uiIX/GDl2O5h3ucrRcOnex6ZWplULi3Q/P6zjpBd7KaGW4UazZ54whG2xHsA3B4 KIkA== X-Gm-Message-State: APjAAAXgjfXfgDRtIRDUB/AnuOJnSES3idWIkaPhTY4I2Z/0LrDP/WbR RtuvBSh/508DpQnHP9ouPOFa2ynSnCS97v02HqRR2uzHXQI= X-Google-Smtp-Source: APXvYqz5cXGNC7KbDL/C5u7ehX1PPHyhL/ZLXHTZcZ1IIOcT0qCTLfp9fjt5WZIF25WmqRPM6ZBLthFr46xGnDj77qA= X-Received: by 2002:a50:a3ec:: with SMTP id t41mr31372009edb.43.1560373874391; Wed, 12 Jun 2019 14:11:14 -0700 (PDT) MIME-Version: 1.0 References: <053247A8CBB6754B8345743B8F18D68D525A16D6@MOSTLS1MSGUSRFA.ITServices.sbc.com> In-Reply-To: From: Carl Mueller Date: Wed, 12 Jun 2019 16:11:03 -0500 Message-ID: Subject: Re: postmortem on 2.2.13 scale out difficulties To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary="000000000000801661058b26dacd" --000000000000801661058b26dacd Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I posted a bug, cassandra-15155 : https://issues.apache.org/jira/browse/CASSANDRA-15155?jql=3Dproject%20%3D%2= 0CASSANDRA It seems VERY similar to https://issues.apache.org/jira/browse/CASSANDRA-6648 On Wed, Jun 12, 2019 at 12:14 PM Carl Mueller wrote: > And once the cluster token map formation is done, it starts bootstrap and > we get a ton of these: > > WARN [MessagingService-Incoming-/2406:da14:95b:4503:910e:23fd:dafa:9983] > 2019-06-12 15:22:04,760 IncomingTcpConnection.java:100 - > UnknownColumnFamilyException reading from socket; closing > org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find > cfId=3Ddf425400-c331-11e8-8b96-4b7f4d58af68 > > And then after LOTS of those > > INFO [main] 2019-06-12 15:23:25,515 StorageService.java:1142 - JOINING: > Starting to bootstrap... > INFO [main] 2019-06-12 15:23:25,525 StreamResultFuture.java:87 - [Stream > #05af9ee0-8d26-11e9-85c1-bd5476090c54] Executing streaming plan for > Bootstrap > INFO [main] 2019-06-12 15:23:25,526 StorageService.java:1199 - Bootstrap > completed! for the tokens [-7314981925085449175, ... bunch of tokens... > 5499447097629838103] > > > On Wed, Jun 12, 2019 at 12:07 PM Carl Mueller < > carl.mueller@smartthings.com> wrote: > >> One node at a time: yes that is what we are doing >> >> We have not tried the streaming_socket_timeout_in_ms. It is currently 24 >> hours. (```streaming_socket_timeout_in_ms=3D86400000```) which would cov= er >> the bootstrap timeframe we have seen before (1-2 hours per node) >> >> Since it joins with no data, it is serving erroneous data. We may try >> bootstrap rejoin and the JVM_OPT.... The node appears to think it has >> bootstrapped even though the gossipinfo shows the new node has a differe= nt >> schema version. >> >> We had scaled EU and US from 5 --> 25 without incident (one at a time), >> and since we increased ring_delay_ms worked haphazardly to get us four >> joins, and since then failure. >> >> The debug log shows: >> >> DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 - >> New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 920028618828749= 0229 >> DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 - >> New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 950856676715905= 899 >> DEBUG [GossipStage:1] 2019-06-12 15:20:08,563 MigrationManager.java:96 - >> Not pulling schema because versions match or shouldPullSchemaFrom return= ed >> false >> INFO [GossipStage:1] 2019-06-12 15:20:08,563 TokenMetadata.java:464 - >> Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983 >> INFO [GossipStage:1] 2019-06-12 15:20:08,564 TokenMetadata.java:464 - >> Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983 >> DEBUG [GossipStage:1] 2019-06-12 15:20:08,565 MigrationManager.java:96 - >> Not pulling schema because versions match or shouldPullSchemaFrom return= ed >> false >> INFO [GossipStage:1] 2019-06-12 15:20:08,565 Gossiper.java:1027 - Node >> /2600:1f18:4b4:5903:64af:955e:b65:8d83 is now part of the cluster >> DEBUG [GossipStage:1] 2019-06-12 15:20:08,587 StorageService.java:1928 - >> Node /2600:1f18:4b4:5903:64af:955e:b65:8d83 state NORMAL, token >> [-1028768087263234868, ......, 921670352349030554] >> DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 - >> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token >> -1028768087263234868 >> DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 - >> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token >> -1045740236536355596 >> DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 - >> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token >> -1184422937682103096 >> DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 - >> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token >> -1201924032068728250 >> >> All the nodes appear to be reporting "Not pulling schema becuase version= s >> match or shouldPullSchmeaFrom returned false. That code >> (MigrationManager.java) makes reference to a "gossip only" node, did we = get >> stuck in that somehow. >> >> On Wed, Jun 12, 2019 at 11:45 AM ZAIDI, ASAD A wrote: >> >>> >>> >>> >>> >>> Adding one node at a time =E2=80=93 is that successful? >>> >>> Check value of streaming_socket_timeout_in_ms parameter in >>> cassandra.yaml and increase if needed. >>> >>> Have you tried Nodetool bootstrap resume & jvm option i.e. >>> JVM_OPTS=3D"$JVM_OPTS -Dcassandra.consistent.rangemovement=3Dfalse" ? >>> >>> >>> >>> >>> >>> *From:* Carl Mueller [mailto:carl.mueller@smartthings.com.INVALID] >>> *Sent:* Wednesday, June 12, 2019 11:35 AM >>> *To:* user@cassandra.apache.org >>> *Subject:* Re: postmortem on 2.2.13 scale out difficulties >>> >>> >>> >>> We only were able to scale out four nodes and then failures started >>> occurring, including multiple instances of nodes joining a cluster with= out >>> streaming. >>> >>> >>> >>> Sigh. >>> >>> >>> >>> On Tue, Jun 11, 2019 at 3:11 PM Carl Mueller < >>> carl.mueller@smartthings.com> wrote: >>> >>> We had a three-DC (asia-tokyo/europe/us) cassandra 2.2.13 cluster, AWS, >>> IPV6 >>> >>> Needed to scale out the asia datacenter, which was 5 nodes, europe and >>> us were 25 nodes >>> >>> We were running into bootstrapping issues where the new node failed to >>> bootstrap/stream, it failed with >>> >>> >>> >>> "java.lang.RuntimeException: A node required to move the data >>> consistently is down" >>> >>> >>> >>> ...even though they were all up based on nodetool status prior to addin= g >>> the node. >>> >>> First we increased the phi_convict_threshold to 12, and that did not >>> help. >>> >>> CASSANDRA-12281 appeared similar to what we had problems with, but I >>> don't think we hit that. Somewhere in there someone wrote >>> >>> >>> >>> "For us, the workaround is either deleting the data (then bootstrap >>> again), or increasing the ring_delay_ms. And the larger the cluster is,= the >>> longer ring_delay_ms is needed. Based on our tests, for a 40 nodes clus= ter, >>> it requires ring_delay_ms to be >50seconds. For a 70 nodes cluster, >>> >100seconds. Default is 30seconds." >>> >>> Given the WAN nature or our DCs, we used ring_delay_ms to 100 seconds >>> and it finally worked. >>> >>> side note: >>> >>> During the rolling restarts for setting phi_convict_threshold we >>> observed quite a lot of status map variance between nodes (we have a >>> program to poll all of a datacenter or cluster's view of the gossipinfo= and >>> statuses. AWS appears to have variance in networking based on the >>> phi_convict_threshold advice, I'm not sure if our difficulties were typ= ical >>> in that regard and/or if our IPV6 and/or globally distributed datacente= rs >>> were exacerbating factors. >>> >>> We could not reproduce this in loadtest, although loadtest is only eu >>> and us (but is IPV6) >>> >>> --000000000000801661058b26dacd Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

On Wed, Jun 12, 2019= at 12:14 PM Carl Mueller <carl.mueller@smartthings.com> wrote:
<= blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-l= eft:1px solid rgb(204,204,204);padding-left:1ex">
And once = the cluster token map formation is done, it starts bootstrap and we get a t= on of these:

WARN =C2=A0[MessagingService-Incoming-/2406:da14:95b:45= 03:910e:23fd:dafa:9983] 2019-06-12 15:22:04,760 IncomingTcpConnection.java:= 100 - UnknownColumnFamilyException reading from socket; closing
org.apac= he.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=3Ddf42= 5400-c331-11e8-8b96-4b7f4d58af68

And then after LOTS of those
INFO =C2=A0[main] 2019-06-12 15:23:25,515 StorageService.java:= 1142 - JOINING: Starting to bootstrap...
INFO =C2=A0[main] 2019-06-12 15= :23:25,525 StreamResultFuture.java:87 - [Stream #05af9ee0-8d26-11e9-85c1-bd= 5476090c54] Executing streaming plan for Bootstrap
INFO =C2=A0[main] 201= 9-06-12 15:23:25,526 StorageService.java:1199 - Bootstrap completed! for th= e tokens [-7314981925085449175, ... bunch of tokens... 5499447097629838103]=


On Wed, Jun 12, 2019 at 12:07 PM Carl Mueller <= carl.muel= ler@smartthings.com> wrote:
One node at a time: yes that is what we= are doing

We have not tried the streaming_socket_timeout_in_ms. It = is currently 24 hours. (```streaming_socket_timeout_in_ms=3D86400000```) wh= ich would cover the bootstrap timeframe we have seen before (1-2 hours per = node)

Since it joins with no data, it is serving erroneous data. We = may try bootstrap rejoin and the JVM_OPT.... The node appears to think it h= as bootstrapped even though the gossipinfo shows the new node has a differe= nt schema version.=C2=A0

We had scaled EU and US from 5 --> 25 wi= thout incident (one at a time), and since we increased ring_delay_ms worked= haphazardly to get us four joins, and since then failure.=C2=A0

The= debug log shows:

DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 Stor= ageService.java:1998 - New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at t= oken 9200286188287490229
DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 S= torageService.java:1998 - New node /2a05:d018:af:1108:86f4:d628:6bca:6983 a= t token 950856676715905899
DEBUG [GossipStage:1] 2019-06-12 15:20:08,563= MigrationManager.java:96 - Not pulling schema because versions match or sh= ouldPullSchemaFrom returned false
INFO =C2=A0[GossipStage:1] 2019-06-12 = 15:20:08,563 TokenMetadata.java:464 - Updating topology for /2a05:d018:af:1= 108:86f4:d628:6bca:6983
INFO =C2=A0[GossipStage:1] 2019-06-12 15:20:08,5= 64 TokenMetadata.java:464 - Updating topology for /2a05:d018:af:1108:86f4:d= 628:6bca:6983
DEBUG [GossipStage:1] 2019-06-12 15:20:08,565 MigrationMan= ager.java:96 - Not pulling schema because versions match or shouldPullSchem= aFrom returned false
INFO =C2=A0[GossipStage:1] 2019-06-12 15:20:08,565 = Gossiper.java:1027 - Node /2600:1f18:4b4:5903:64af:955e:b65:8d83 is now par= t of the cluster
DEBUG [GossipStage:1] 2019-06-12 15:20:08,587 StorageSe= rvice.java:1928 - Node /2600:1f18:4b4:5903:64af:955e:b65:8d83 state NORMAL,= token [-1028768087263234868, ...<bunch of tokens>..., 92167035234903= 0554]
DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:= 1998 - New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token -1028768087= 263234868
DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.j= ava:1998 - New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token -104574= 0236536355596
DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageServi= ce.java:1998 - New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token -11= 84422937682103096
DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageS= ervice.java:1998 - New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token= -1201924032068728250

All the nodes appear to be reporting "Not= pulling schema becuase versions match or shouldPullSchmeaFrom returned fal= se. That code (MigrationManager.java) makes reference to a "gossip onl= y" node, did we get stuck in that somehow.=C2=A0

On Wed, Jun 12, 2019= at 11:45 AM ZAIDI, ASAD A <az192g@att.com> wrote:

=C2=A0

=C2=A0

Adding one node at a time =E2=80=93 is that = successful?

Check value of streaming_socket_timeout_in_m= s parameter in cassandra.yaml and increase if needed.<= /p>

Have you tried Nodetool bootstrap resume &am= p; jvm option i.e. JVM_OPTS=3D"$JVM_OPTS -Dcassandra.consistent.rangem= ovement=3Dfalse"=C2=A0 ?

=C2=A0

=C2=A0

From: Carl Mueller [mailto:carl.mueller@smartthings.com.INVALID]
Sent: Wednesday, June 12, 2019 11:35 AM
To:
u= ser@cassandra.apache.org
Subject: Re: postmortem on 2.2.13 scale out difficulties

=C2=A0

We only were able to scale out four nodes and then f= ailures started occurring, including multiple instances of nodes joining a = cluster without streaming.=C2=A0

=C2=A0

Sigh.

=C2=A0

On Tue, Jun 11, 2019 at 3:11 PM Carl Mueller <carl.mueller= @smartthings.com> wrote:

We had a three-DC (asia-tokyo/europe/us) cassandra 2= .2.13 cluster, AWS, IPV6

Needed to scale out the asia datacenter, which was 5 nodes, europe and us w= ere 25 nodes

We were running into bootstrapping issues where the new node failed to boot= strap/stream, it failed with=C2=A0

=C2=A0

"java.lang.RuntimeException: A node required to= move the data consistently is down"=C2=A0

=C2=A0

...even though they were all up based on nodetool st= atus prior to adding the node.

First we increased the phi_convict_threshold to 12, and that did not help.= =C2=A0

CASSANDRA-12281 appeared similar to what we had problems with, but I don= 9;t think we hit that. Somewhere in there someone wrote

=C2=A0

"For us, the worka= round is either deleting the data (then bootstrap again), or increasing the= ring_delay_ms. And the larger the cluster is, the longer ring_delay_ms is = needed. Based on our tests, for a 40 nodes cluster, it requires ring_delay_ms to be >50seconds. For a 70 nodes clu= ster, >100seconds. Default is 30seconds."

Given the WAN nature or our DCs, we used ring_delay_= ms to 100 seconds and it finally worked.

side note:

During the rolling restarts for setting phi_convict_threshold we observed q= uite a lot of status map variance between nodes (we have a program to poll = all of a datacenter or cluster's view of the gossipinfo and statuses. A= WS appears to have variance in networking based on the phi_convict_threshold advice, I'm not sure if our difficu= lties were typical in that regard and/or if our IPV6 and/or globally distri= buted datacenters were exacerbating factors.=C2=A0=C2=A0

We could not reproduce this in loadtest, although loadtest is only eu and u= s (but is IPV6)

--000000000000801661058b26dacd--