Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 177AA10470 for ; Wed, 19 Feb 2014 01:17:00 +0000 (UTC) Received: (qmail 2082 invoked by uid 500); 19 Feb 2014 01:16:54 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 1981 invoked by uid 500); 19 Feb 2014 01:16:54 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 1969 invoked by uid 99); 19 Feb 2014 01:16:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Feb 2014 01:16:53 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of abarua@247-inc.com designates 213.199.154.83 as permitted sender) Received: from [213.199.154.83] (HELO emea01-db3-obe.outbound.protection.outlook.com) (213.199.154.83) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Feb 2014 01:16:48 +0000 Received: from SIXPR03MB176.apcprd03.prod.outlook.com (10.242.62.152) by SIXPR03MB174.apcprd03.prod.outlook.com (10.242.62.141) with Microsoft SMTP Server (TLS) id 15.0.878.16; Wed, 19 Feb 2014 01:16:20 +0000 Received: from SIXPR03MB176.apcprd03.prod.outlook.com ([169.254.15.139]) by SIXPR03MB176.apcprd03.prod.outlook.com ([169.254.15.139]) with mapi id 15.00.0878.008; Wed, 19 Feb 2014 01:16:19 +0000 From: Arindam Barua To: "user@cassandra.apache.org" Subject: RE: Bootstrap stuck: vnode enabled 1.2.12 Thread-Topic: Bootstrap stuck: vnode enabled 1.2.12 Thread-Index: Ac8pY7M6TBK7AEIyQue/LLVwiENa4QDc3qSgAARECAAACacKwA== Date: Wed, 19 Feb 2014 01:16:19 +0000 Message-ID: <47632f11daff4ac7a72656879120e9b9@SIXPR03MB176.apcprd03.prod.outlook.com> References: <9da66398b0f347f184cb09f10c96c4d8@HKXPR03MB343.apcprd03.prod.outlook.com> <300d5d4649d748eaa235ce5e0cc8e4e1@SIXPR03MB176.apcprd03.prod.outlook.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [66.170.122.248] x-forefront-prvs: 012792EC17 x-forefront-antispam-report: SFV:NSPM;SFS:(10019001)(6039001)(164054003)(24454002)(5423002)(189002)(199002)(377454003)(81342001)(83072002)(51856001)(15202345003)(66066001)(74706001)(19300405004)(76796001)(74316001)(15975445006)(94946001)(56816005)(83322001)(74366001)(81542001)(76576001)(74876001)(54356001)(53806001)(81686001)(94316002)(56776001)(81816001)(46102001)(47446002)(59766001)(90146001)(77982001)(65816001)(95416001)(4396001)(87936001)(49866001)(63696002)(80976001)(77096001)(93516002)(93136001)(92566001)(47736001)(69226001)(16236675002)(33646001)(31966008)(2656002)(95666001)(54316002)(50986001)(19580395003)(19580405001)(47976001)(85306002)(86362001)(87266001)(19609705001)(79102001)(24736002);DIR:OUT;SFP:1102;SCL:1;SRVR:SIXPR03MB174;H:SIXPR03MB176.apcprd03.prod.outlook.com;CLIP:66.170.122.248;FPR:AC1FF227.6CE213D8.7DD3BD7C.8FD7D3BD.2050E;MLV:sfv;PTR:InfoNoRecords;MX:3;A:1;LANG:en; Content-Type: multipart/alternative; boundary="_000_47632f11daff4ac7a72656879120e9b9SIXPR03MB176apcprd03pro_" MIME-Version: 1.0 X-OriginatorOrg: 247-inc.com X-Virus-Checked: Checked by ClamAV on apache.org --_000_47632f11daff4ac7a72656879120e9b9SIXPR03MB176apcprd03pro_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable I believe you are talking about CASSANDRA-6685, which was introduced in 1.2= .15. I'm trying to add a node to a production ring. I have added nodes previousl= y just fine. However, this node had hardware issues during a previous boots= trap, and now even a clean bootstrap seems to be having problems. Does the = ring somehow remember about this node and if so can I make it forget about = it? Decommission/removenode does not work on a node that hasn't yet bootstr= apped. From: Edward Capriolo [mailto:edlinuxguru@gmail.com] Sent: Tuesday, February 18, 2014 12:30 PM To: user@cassandra.apache.org Subject: Re: Bootstrap stuck: vnode enabled 1.2.12 There is a bug where a node without schema can not bootstrap. Do you have s= chema? On Tue, Feb 18, 2014 at 1:29 PM, Arindam Barua > wrote: The node is still out of the ring. Any suggestions on how to get it in will= be very helpful. From: Arindam Barua [mailto:abarua@247-inc.com] Sent: Friday, February 14, 2014 1:04 AM To: user@cassandra.apache.org Subject: Bootstrap stuck: vnode enabled 1.2.12 After our otherwise successful upgrade procedure to enable vnodes, when add= ing back "new" hosts to our cluster, one non-seed host ran into a hardware = issue during bootstrap. By the time the hardware issue was fixed a week lat= er, all other nodes were added successfully, cleaned, repaired. The disks o= n this node were untouched, and when the node was started back up, it detec= ted an interrupted bootstrap, and attempted to bootstrap. However, after ~2= 4 hrs it was still stuck in the 'JOINING' state according to nodetool netst= ats on that node, even though no streams were flowing to/from it. Also, it = did not appear in nodetool status in any way/form (not even as JOINING). >From couple of observed thread dumps, the stack of the thread blocked durin= g bootstrap is at [1]. Since the node wasn't making any progress, I ended up stopping Cassandra, c= leaning up the data and commitlog directories, and attempted a fresh bootst= rap. Nodetool netstats immediately reported a whole bunch of streams queued= up, and data started streaming to the node. The data directory quickly gre= w to 18 GB (the other nodes had ~25GB, but we have lot of data with low TTL= s). However, the node ended up being in the earlier reported state, i.e. no= detool netstats doesn't have anything queued, but still reports the JOINING= state, even though it's been > 24 hrs. There are no other ERRORS in the lo= gs, and new data being written to the cluster makes it to this node just fi= ne, triggering compactions, etc from time to time. Any help is appreciated. Thanks, Arindam [1] Thread dump Thread 3708: (state =3D BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=3D0 (Compiled frame; informatio= n may be imprecise) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=3D14, line=3D156 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterr= upt() @bci=3D1, line=3D811 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInt= erruptibly(int) @bci=3D55, line=3D969 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInter= ruptibly(int) @bci=3D24, line=3D1281 (Interpreted frame) - java.util.concurrent.CountDownLatch.await() @bci=3D5, line=3D207 (Interp= reted frame) - org.apache.cassandra.dht.RangeStreamer.fetch() @bci=3D209, line=3D256 (Interpreted frame) - org.apache.cassandra.dht.BootStrapper.bootstrap() @bci=3D120, line=3D84 (Interpreted frame) - org.apache.cassandra.service.StorageService.bootstrap(java.util.Collecti= on) @bci=3D172, line=3D978 (Interpreted frame) - org.apache.cassandra.service.StorageService.joinTokenRing(int) @bci=3D82= 7, line=3D744 (Interpreted frame) - org.apache.cassandra.service.StorageService.initServer(int) @bci=3D363, line=3D585 (Interpreted frame) - org.apache.cassandra.service.StorageService.initServer() @bci=3D4, line= =3D482 (Interpreted frame) - org.apache.cassandra.service.CassandraDaemon.setup() @bci=3D1069, line= =3D348 (Interpreted frame) - org.apache.cassandra.service.CassandraDaemon.activate() @bci=3D59, line= =3D447 (Interpreted frame) - org.apache.cassandra.service.CassandraDaemon.main(java.lang.String[]) @b= ci=3D3, line=3D490 (Interpreted frame) --_000_47632f11daff4ac7a72656879120e9b9SIXPR03MB176apcprd03pro_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

 <= /p>

I believe you are talking= about CASSANDRA-6685, which was introduced in 1.2.15.

 <= /p>

I’m trying to add a= node to a production ring. I have added nodes previously just fine. Howeve= r, this node had hardware issues during a previous bootstrap, and now even a clean bootstrap seems to be having problems. Does the ring = somehow remember about this node and if so can I make it forget about it? D= ecommission/removenode does not work on a node that hasn’t yet bootst= rapped.

 <= /p>

From: Edward C= apriolo [mailto:edlinuxguru@gmail.com]
Sent: Tuesday, February 18, 2014 12:30 PM
To: user@cassandra.apache.org
Subject: Re: Bootstrap stuck: vnode enabled 1.2.12
=

 

There is a bug where a node without schema can not b= ootstrap. Do you have schema?

 

On Tue, Feb 18, 2014 at 1:29 PM, Arindam Barua <<= a href=3D"mailto:abarua@247-inc.com" target=3D"_blank">abarua@247-inc.com> wrote:

 

The node is still out of the ring. A= ny suggestions on how to get it in will be very helpful.<= /p>

 

From: Arindam Barua [mailto:= abarua@247-inc.com<= /a>]
Sent: Friday, February 14, 2014 1:04 AM
To:
u= ser@cassandra.apache.org
Subject: Bootstrap stuck: vnode enabled 1.2.12

 

 

After our otherwise successful upgrade procedure to enable vnodes,= when adding back “new” hosts to our cluster, one non-seed host= ran into a hardware issue during bootstrap. By the time the hardware issue was fixed a week later, all other nodes were a= dded successfully, cleaned, repaired. The disks on this node were untouched= , and when the node was started back up, it detected an interrupted bootstr= ap, and attempted to bootstrap. However, after ~24 hrs it was still stuck in the ‘JOINING’ sta= te according to nodetool netstats on that node, even though no streams were= flowing to/from it. Also, it did not appear in nodetool status in any way/= form (not even as JOINING).

 

From couple of observed thread dumps, the stack o= f the thread blocked during bootstrap is at [1].

 

Since the node wasn’t making any progress, = I ended up stopping Cassandra, cleaning up the data and commitlog directories, and attempted a fresh bootstrap. Nodetool netstats immediatel= y reported a whole bunch of streams queued up, and data started streaming t= o the node. The data directory quickly grew to 18 GB (the other nodes had ~= 25GB, but we have lot of data with low TTLs). However, the node ended up being in the earlier reported state,= i.e. nodetool netstats doesn’t have anything queued, but still repor= ts the JOINING state, even though it’s been > 24 hrs. There are no= other ERRORS in the logs, and new data being written to the cluster makes it to this node just fine, triggering compactions, et= c from time to time.

 

Any help is appreciated.

 

Thanks,

Arindam

[1] Thread dump
Thread 3708: (state =3D BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=3D0 (Compiled frame; i= nformation may
   be imprecise)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @= bci=3D14,
   line=3D156 (Interpreted frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCh= eckInterrupt()
   @bci=3D1, line=3D811 (Interpreted frame)
 -
   java.util.concurrent.locks.AbstractQueuedSynchronizer.doAc= quireSharedInterruptibly(int)
   @bci=3D55, line=3D969 (Interpreted frame)
 -
   java.util.concurrent.locks.AbstractQueuedSynchronizer.acqu= ireSharedInterruptibly(int)
   @bci=3D24, line=3D1281 (Interpreted frame)
 - java.util.concurrent.CountDownLatch.await() @bci=3D5, line=3D20= 7 (Interpreted
   frame)
 - org.apache.cassandra.dht.RangeStreamer.fetch() @bci=3D209, line= =3D256
   (Interpreted frame)
 - org.apache.cassandra.dht.BootStrapper.bootstrap() @bci=3D120, l= ine=3D84
   (Interpreted frame)
 - org.apache.cassandra.service.StorageService.bootstrap(java.util= .Collection)
   @bci=3D172, line=3D978 (Interpreted frame)
 - org.apache.cassandra.service.StorageService.joinTokenRing(int) = @bci=3D827,
   line=3D744 (Interpreted frame)
 - org.apache.cassandra.service.StorageService.initServer(int) @bc= i=3D363,
   line=3D585 (Interpreted frame)
 - org.apache.cassandra.service.StorageService.initServer() @bci= =3D4, line=3D482
   (Interpreted frame)
 - org.apache.cassandra.service.CassandraDaemon.setup() @bci=3D106= 9, line=3D348
   (Interpreted frame)
 - org.apache.cassandra.service.CassandraDaemon.activate() @bci=3D= 59, line=3D447
   (Interpreted frame)
 - org.apache.cassandra.service.CassandraDaemon.main(java.lang.Str= ing[]) @bci=3D3,
   line=3D490 (Interpreted frame)

 

--_000_47632f11daff4ac7a72656879120e9b9SIXPR03MB176apcprd03pro_--