From user-return-63125-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org Thu Feb 7 07:13:40 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id CC0EB180600 for ; Thu, 7 Feb 2019 08:13:39 +0100 (CET) Received: (qmail 52358 invoked by uid 500); 7 Feb 2019 07:13:37 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 52347 invoked by uid 99); 7 Feb 2019 07:13:37 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Feb 2019 07:13:37 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 5FFE7C042A for ; Thu, 7 Feb 2019 07:13:37 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.538 X-Spam-Level: ** X-Spam-Status: No, score=2.538 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, KAM_LOTSOFHASH=0.25, RCVD_IN_DNSWL_NONE=-0.0001, REPTO_QUOTE_YAHOO=0.49, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=yahoo.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id lgRQ1cZ-bkTU for ; Thu, 7 Feb 2019 07:13:34 +0000 (UTC) Received: from sonic310-20.consmr.mail.gq1.yahoo.com (sonic310-20.consmr.mail.gq1.yahoo.com [98.137.69.146]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 6AC48623DF for ; Thu, 7 Feb 2019 07:13:34 +0000 (UTC) X-YMail-OSG: rhQvo2kVM1lvm19sOPOd7CEtggNjZ6leFKJXgvqBfctFVXxydec3w9GQYewLIXH EuCWECQ4Fww0NuEmBGk3iE83K.DhOiDPZeH6z8YDW7hLIS4vk_BscD5Iiyl3GDi4aNw4WXVYsCCb wYr3fIT9IaBzBCf4LUx2FrRvS6UNyoNgL9O.ih1M0NghGNe2ZcCuXYy247F62AONvP2mKmhZqg4W SKOfqjvpSylWY55uyEnknniVkcIao5Vq8.SmXZsr9sAgrypQBj1Z3imqaHEF8e_2Ux3lA_aXFIwg SV2dGf1h_qeUwz3VcvOZaKEZAkCdQWzbr6DkLaGrn7fJ_OF8r31.wOw1gtm2Z8Ms9mxRKYW.gicX AowIbt1avQlZdEJ5l_mGw9D_33eg4x_uKJWTGP_atK0vmGGIZjGomlOJrzvyiyFCu3Q5WgXSitKy YZbdEx3rd450eHU077Vaol2hpjO3KSSDn5ByabnbCofJKNfRefOpMiU_fOKPg46SF.8OigXNl1NC PPwEHL60V0tLykdzph_WHPAoYN1SE9dM700xbeZQxpWHXhp9G9D57nibnXdkxCV0CCjQuPqgLbgB VnwJMbyfYUlqdpi8TlwHvShHbzN_Sx9REDXDIOEeynbwJq1kcf6JNZTTdL6CuIrd.TTwyp4PWR3C gsswnNGYjdxH7M9N4fdCQQwSQGvTyg7x.hkU6kJJGwGjgQh_72Fhcj8F_FOblzjZ2H7qiSTRq26o wl7WG_UvW_IFukncHyYL587jhTe.ZPLA7jGJkVNSd4J5Gn14VIvRwnUuGANNIoFZxrJgw39DMh3b teLxadcuqsiEZu86IX3oaBaQsDctmrciF3.Rk0vYD9RwqlZHYEuIMzUariqTbkh4tXya.m64u0cs jTv8MOtYkXh3TH2EZuVm4aBrpbW14c8EAJc1_JkyvMCZJAOTeYT6jORVxbobz_k4ZUKmDMQhceLh pfCJgUmN0e9Pjtp5Jo6BbN1sbVfE3VdLbHql43IHLoZyCftdB6gElvfYV.PGxQ6eEjSrtENcGcxD 30wAaDXrLdZR99kZrUdflozm.KyWGbs1bSo9LyTpf0LcbnfdoFKrcLXQtS2g- Received: from sonic.gate.mail.ne1.yahoo.com by sonic310.consmr.mail.gq1.yahoo.com with HTTP; Thu, 7 Feb 2019 07:13:27 +0000 Date: Thu, 7 Feb 2019 07:13:24 +0000 (UTC) From: "dinesh.joshi@yahoo.com.INVALID" Reply-To: "dinesh.joshi@yahoo.com" To: user@cassandra.apache.org Message-ID: <1832977851.196436.1549523604477@mail.yahoo.com> In-Reply-To: References: Subject: Re: Bootstrap keeps failing MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_196435_20415692.1549523604475" X-Mailer: WebService/1.1.13027 YMailNorrin Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15 ------=_Part_196435_20415692.1549523604475 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Would it be possible for you to take a thread dump & logs and share them? Dinesh=20 On Wednesday, February 6, 2019, 10:09:11 AM PST, L=C3=A9o FERLIN SUTTON= wrote: =20 =20 Hello ! I am having a recurrent problem when trying to bootstrap a few new nodes. Some general info :=C2=A0 =20 - I am running cassandra 3.0.17 - We have about 30 nodes in our cluster - All healthy nodes have between 60% to 90% used disk space on /var/lib/= cassandra =20 So I create a new node and let auto_bootstrap do it's job. After a few days= the bootstrapping node stops streaming new data but is still not a member = of the cluster. `nodetool status` says the node is still joining,=C2=A0 When this happens I run `nodetool bootstrap resume`. This usually ends up i= n two different ways : =20 - The node fills up to 100% disk space and crashes. - The bootstrap resume finishes with errors When I look at `nodetool netstats -H` is=C2=A0 looks like `bootstrap resume= ` does not resume but restarts a full transfer of every data from every nod= e. This is the output I get from `nodetool resume` : [2019-02-06 01:39:14,369] received file /var/lib/cassandra/raw/raw_17930-d7= cc0590230d11e9bc0af381b0ee7ac6/mc-225-big-Data.db (progress: 2113%) [2019-02-06 01:39:16,821] received file /var/lib/cassandra/data/system_dist= ributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-88-big-Data.db (= progress: 2113%) [2019-02-06 01:39:17,003] received file /var/lib/cassandra/data/system_dist= ributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-89-big-Data.db (= progress: 2113%) [2019-02-06 01:39:17,032] session with /10.16.XX.YYY complete (progress: 21= 13%) [2019-02-06 01:41:15,160] received file /var/lib/cassandra/raw/raw_17930-d7= cc0590230d11e9bc0af381b0ee7ac6/mc-220-big-Data.db (progress: 2113%) [2019-02-06 01:42:02,864] received file /var/lib/cassandra/raw/raw_17930-d7= cc0590230d11e9bc0af381b0ee7ac6/mc-226-big-Data.db (progress: 2113%) [2019-02-06 01:42:09,284] received file /var/lib/cassandra/raw/raw_17930-d7= cc0590230d11e9bc0af381b0ee7ac6/mc-227-big-Data.db (progress: 2113%) [2019-02-06 01:42:10,522] received file /var/lib/cassandra/raw/raw_17930-d7= cc0590230d11e9bc0af381b0ee7ac6/mc-228-big-Data.db (progress: 2113%) [2019-02-06 01:42:10,622] received file /var/lib/cassandra/raw/raw_17930-d7= cc0590230d11e9bc0af381b0ee7ac6/mc-229-big-Data.db (progress: 2113%) [2019-02-06 01:42:11,925] received file /var/lib/cassandra/data/system_dist= ributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-90-big-Data.db (= progress: 2114%) [2019-02-06 01:42:14,887] received file /var/lib/cassandra/data/system_dist= ributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-91-big-Data.db (= progress: 2114%) [2019-02-06 01:42:14,980] session with /10.16.XX.ZZZ complete (progress: 21= 14%) [2019-02-06 01:42:14,980] Stream failed [2019-02-06 01:42:14,982] Error during bootstrap: Stream failed [2019-02-06 01:42:14,982] Resume bootstrap complete =C2=A0=C2=A0The bootstrap `progress` goes way over 100% and eventually fail= s. Right now I have a node with this output from `nodetool status` :=C2=A0`UJ= =C2=A0 10.16.XX.YYY=C2=A0 2.93 TB=C2=A0 =C2=A0 256=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 ?=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A05= 788f061-a3c0-46af-b712-ebeecd397bf7=C2=A0 c` It is almost filled with data, yet if I look at `nodetool netstats` : =C2=A0 =C2=A0 =C2=A0 =C2=A0 Receiving 480 files, 325.39 GB total. Already r= eceived 5 files, 68.32 MB total =C2=A0 =C2=A0 =C2=A0 =C2=A0 Receiving 499 files, 328.96 GB total. Already r= eceived 1 files, 1.32 GB total =C2=A0 =C2=A0 =C2=A0 =C2=A0 Receiving 506 files, 345.33 GB total. Already r= eceived 6 files, 24.19 MB total =C2=A0 =C2=A0 =C2=A0 =C2=A0 Receiving 362 files, 206.73 GB total. Already r= eceived 7 files, 34 MB total =C2=A0 =C2=A0 =C2=A0 =C2=A0 Receiving 424 files, 281.25 GB total. Already r= eceived 1 files, 1.3 GB total =C2=A0 =C2=A0 =C2=A0 =C2=A0 Receiving 581 files, 349.26 GB total. Already r= eceived 8 files, 45.96 MB total =C2=A0 =C2=A0 =C2=A0 =C2=A0 Receiving 443 files, 337.26 GB total. Already r= eceived 6 files, 96.15 MB total =C2=A0 =C2=A0 =C2=A0 =C2=A0 Receiving 424 files, 275.23 GB total. Already r= eceived 5 files, 42.67 MB total It is trying to pull all the data again. Am I missing something about the way `nodetool bootstrap resume` is suppose= d to be used ? Regards, Leo =20 ------=_Part_196435_20415692.1549523604475 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Wo= uld it be possible for you to take a thread dump & logs and share them?=

Dinesh


=20
=20
On Wednesday, February 6, 2019, 10:09:11 AM PST, L=C3= =A9o FERLIN SUTTON <lferlin@mailjet.com.INVALID> wrote:


Hello !

I am hav= ing a recurrent problem when trying to bootstrap a few new nodes.

Some general info : 
  • I am running ca= ssandra 3.0.17
  • We have about 30 nodes in our cluster
  • All he= althy nodes have between 60% to 90% used disk space on /var/lib/cassandra
So I create a new node and let auto_bootstrap do it'= s job. After a few days the bootstrapping node stops streaming new data but= is still not a member of the cluster.

`nodetool s= tatus` says the node is still joining, 

When = this happens I run `nodetool bootstrap resume`. This usually ends up in two= different ways :
  1. The node fills up to 100% disk space and crashes= .
  2. The bootstrap resume finishes with errors
When I lo= ok at `nodetool netstats -H` is  looks like `bootstrap resume` does no= t resume but restarts a full transfer of every data from every node.
<= div>
This is the output I get from `nodetool resume` :
<= blockquote class=3D"yiv2296455552gmail_quote" style=3D"margin:0px 0px 0px 0= .8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex;">
[2019-02-06 01:39:14,36= 9] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b= 0ee7ac6/mc-225-big-Data.db (progress: 2113%)
[2019-02-06 01:39:16,821] re= ceived file /var/lib/cassandra/data/system_distributed/repair_history-759ff= fad624b318180eefa9a52d1f627/mc-88-big-Data.db (progress: 2113%)
[2019-02-0= 6 01:39:17,003] received file /var/lib/cassandra/data/system_distributed/re= pair_history-759fffad624b318180eefa9a52d1f627/mc-89-big-Data.db (progress: = 2113%)
[2019-02-06 01:39:17,032] session with /10.16.XX.YYY complete (prog= ress: 2113%)
[2019-02-06 01:41:15,160] received file /var/lib/cassandra/ra= w/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-220-big-Data.db (progress: = 2113%)
[2019-02-06 01:42:02,864] received file /var/lib/cassandra/raw/raw_= 17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-226-big-Data.db (progress: 2113%)=
[201= 9-02-06 01:42:10,522] received file /var/lib/cassandra/raw/raw_17930-d7cc05= 90230d11e9bc0af381b0ee7ac6/mc-228-big-Data.db (progress: 2113%)
[2019-02-0= 6 01:42:10,622] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d= 11e9bc0af381b0ee7ac6/mc-229-big-Data.db (progress: 2113%)
[2019-02-06 01:4= 2:11,925] received file /var/lib/cassandra/data/system_distributed/repair_h= istory-759fffad624b318180eefa9a52d1f627/mc-90-big-Data.db (progress: 2114%)=
[2019-02-06 01:42:14,980] session with /10.16.XX.ZZZ co= mplete (progress: 2114%)
[2019-02-06 01:42:14,980] Stream failed
[2019-02= -06 01:42:14,982] Error during bootstrap: Stream failed
[2019-02-06 01:42:= 14,982] Resume bootstrap complete
  = ;
The bootstrap `progress` goes way over 100% and eventuall= y fails.


Right now I have a node wi= th this output from `nodetool status` : 
`UJ  10.16.XX.= YYY  2.93 TB    256          ? = ;                5788f061-a3c0-46af= -b712-ebeecd397bf7  c`

It is almost filled wi= th data, yet if I look at `nodetool netstats` :
        = Receiving 480 files, 325.39 GB total. Already received 5 files, 68.32 MB to= tal
        Receiving 499 files, 328.96 GB total. Al= ready received 1 files, 1.32 GB total
        Receiv= ing 506 files, 345.33 GB total. Already received 6 files, 24.19 MB total        Receiving 362 files, 206.73 GB total. Already = received 7 files, 34 MB total
        Receiving 424 = files, 281.25 GB total. Already received 1 files, 1.3 GB total
  &n= bsp;     Receiving 581 files, 349.26 GB total. Already received 8= files, 45.96 MB total
        Receiving 443 files, = 337.26 GB total. Already received 6 files, 96.15 MB total
    =     Receiving 424 files, 275.23 GB total. Already received 5 file= s, 42.67 MB total

It is trying to pul= l all the data again.

Am I missing something about= the way `nodetool bootstrap resume` is supposed to be used ?
Regards,

Leo

------=_Part_196435_20415692.1549523604475--