Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B1EDA116C8 for ; Mon, 21 Apr 2014 14:32:49 +0000 (UTC) Received: (qmail 35306 invoked by uid 500); 21 Apr 2014 14:32:46 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 35258 invoked by uid 500); 21 Apr 2014 14:32:46 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 35235 invoked by uid 99); 21 Apr 2014 14:32:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Apr 2014 14:32:45 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of philburresseme@gmail.com designates 209.85.192.53 as permitted sender) Received: from [209.85.192.53] (HELO mail-qg0-f53.google.com) (209.85.192.53) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Apr 2014 14:32:41 +0000 Received: by mail-qg0-f53.google.com with SMTP id f51so3986577qge.40 for ; Mon, 21 Apr 2014 07:32:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=J819ocRnw/NFqk5icQhIo1gxdbIN6Yp0Bp7hwAARM8Y=; b=U34RteW1Rfmj6nwBL0DnK7w/sN14RcB8O4XfAsSzIbWgi+h26490d6rA6nJHfEkU/Z aVvUnJbcytErJ+2DezwmeJAae/fIiAAvAnjQL/9CZ8x2qh7dKqWHvojUPF3Cu/Km7BWd xFwAWUW9DYxzD/pRNyYIwXr2ce9CWvvJ3v5h0prJWs2gpig3pgHHA3ABKpuYqBTcY23c iJb2d5rLqfybqmu8VUfVwPbNjr2TTZdVOnelMcfkNt4Jklm3HtoyHEd59hd1FVrdJ8T5 mw4T8izh/WZ0zkOvR1SRcRAroOwg3wlf3IHP0WxCNsCMjVlycrPryEQ4fW3Yb9yuIhYs aatQ== MIME-Version: 1.0 X-Received: by 10.224.123.206 with SMTP id q14mr8174981qar.71.1398090740915; Mon, 21 Apr 2014 07:32:20 -0700 (PDT) Received: by 10.140.50.47 with HTTP; Mon, 21 Apr 2014 07:32:20 -0700 (PDT) In-Reply-To: References: Date: Mon, 21 Apr 2014 10:32:20 -0400 Message-ID: Subject: Re: Bootstrap Timing From: Phil Burress To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=047d7bdc8e10f9e27904f78e5d5d X-Virus-Checked: Checked by ClamAV on apache.org --047d7bdc8e10f9e27904f78e5d5d Content-Type: text/plain; charset=UTF-8 The new node has managed to stay up without dying for about 24 hours now... but it still is in JOINING state. A new concern has popped up. Disk usage is at 500GB on the new node. The three original nodes have about 40GB each. Any ideas why this is happening? On Sat, Apr 19, 2014 at 9:19 PM, Phil Burress wrote: > Thank you all for your advice and good info. The node has died a couple of > times with out of memory errors. I've restarted each time but it starts re > - running compaction and then dies again. > > Is there a better way to do this? > On Apr 18, 2014 6:06 PM, "Steven A Robenalt" > wrote: > >> That's what I'd be doing, but I wouldn't expect it to run for 3 days this >> time. My guess is that whatever was going wrong with the bootstrap when you >> had 3 nodes starting at once was interfering with the completion of the 1 >> remaining node of those 3. A clean bootstrap of a single node should >> complete eventually, and I would think it'll be a lot less than 3 days. Our >> database is much smaller than yours at the moment, so I can't really guide >> you on how long it should take, but I'd think that others on the list with >> similar database sizes might be able to give you a better idea. >> >> Steve >> >> >> >> On Fri, Apr 18, 2014 at 1:43 PM, Phil Burress wrote: >> >>> First, I just stopped 2 of the nodes and left one running. But this >>> morning, I stopped that third node, cleared out the data, restarted and let >>> it rejoin again. It appears streaming is done (according to netstats), >>> right now it appears to be running compaction and building secondary index >>> (according to compactionstats). Just sit and wait I guess? >>> >>> >>> On Fri, Apr 18, 2014 at 2:23 PM, Steven A Robenalt < >>> srobenal@stanford.edu> wrote: >>> >>>> Looking back through this email chain, it looks like Phil said he >>>> wasn't using vnodes. >>>> >>>> For the record, we are using vnodes since we brought up our first >>>> cluster, and have not seen any issues with bootstrapping new nodes either >>>> to replace existing nodes, or to grow/shrink the cluster. We did adhere to >>>> the caveats that new nodes should not be seed nodes, and that we should >>>> allow each node to join the cluster completely before making any other >>>> changes. >>>> >>>> Phil, when you dropped to adding just the single node to your cluster, >>>> did you start over with the newly added node (blowing away the database >>>> created on the previous startup), or did you shut down the other 2 added >>>> nodes and leave the remaining one in progress to continue? >>>> >>>> Steve >>>> >>>> >>>> On Fri, Apr 18, 2014 at 10:40 AM, Robert Coli wrote: >>>> >>>>> On Fri, Apr 18, 2014 at 5:05 AM, Phil Burress < >>>>> philburresseme@gmail.com> wrote: >>>>> >>>>>> nodetool netstats shows 84 files. They are all at 100%. Nothing >>>>>> showing in Pending or Active for Read Repair Stats. >>>>>> >>>>>> I'm assuming this means it's done. But it still shows "JOINING". Is >>>>>> there an undocumented step I'm missing here? This whole process seems >>>>>> broken to me. >>>>>> >>>>> >>>>> Lately it seems like a lot more people than usual are : >>>>> >>>>> 1) using vnodes >>>>> 2) unable to bootstrap new nodes >>>>> >>>>> If I were you, I would likely file a JIRA detailing your negative >>>>> experience with this core functionality. >>>>> >>>>> =Rob >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Steve Robenalt >>>> Software Architect >>>> HighWire | Stanford University >>>> 425 Broadway St, Redwood City, CA 94063 >>>> >>>> srobenal@stanford.edu >>>> http://highwire.stanford.edu >>>> >>>> >>>> >>>> >>>> >>>> >>> >> >> >> -- >> Steve Robenalt >> Software Architect >> HighWire | Stanford University >> 425 Broadway St, Redwood City, CA 94063 >> >> srobenal@stanford.edu >> http://highwire.stanford.edu >> >> >> >> >> >> --047d7bdc8e10f9e27904f78e5d5d Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
The new node has managed to stay up without dying for abou= t 24 hours now... but it still is in JOINING state. A new concern has poppe= d up. Disk usage is at 500GB on the new node. The three original nodes have= about 40GB each. Any ideas why this is happening?


On Sat, Apr 1= 9, 2014 at 9:19 PM, Phil Burress <philburresseme@gmail.com><= /span> wrote:

Thank you all for your advice= and good info. The node has died a couple of times with out of memory erro= rs. I've restarted each time but it starts re - running compaction and = then dies again.

Is there a better way to do this?

<= div class=3D"h5">
On Apr 18, 2014 6:06 PM, "Steven A Robenalt= " <srobe= nal@stanford.edu> wrote:
That's what I'd be doing, but I wouldn't expec= t it to run for 3 days this time. My guess is that whatever was going wrong= with the bootstrap when you had 3 nodes starting at once was interfering w= ith the completion of the 1 remaining node of those 3. A clean bootstrap of= a single node should complete eventually, and I would think it'll be a= lot less than 3 days. Our database is much smaller than yours at the momen= t, so I can't really guide you on how long it should take, but I'd = think that others on the list with similar database sizes might be able to = give you a better idea.

Steve

<= br>
On Fri, Apr 18, 2014 at 1:43 PM, Phil Bur= ress <philburresseme@gmail.com> wrote:
First, I just stopped 2 of = the nodes and left one running. But this morning, I stopped that third node= , cleared out the data, restarted and let it rejoin again. It appears strea= ming is done (according to netstats), right now it appears to be running co= mpaction and building secondary index (according to compactionstats). Just = sit and wait I guess?


On Fri, Apr 1= 8, 2014 at 2:23 PM, Steven A Robenalt <srobenal@stanford.edu> wrote:
Looking back through this e= mail chain, it looks like Phil said he wasn't using vnodes.

For the record, we are using vnodes since we brought up our first clus= ter, and have not seen any issues with bootstrapping new nodes either to re= place existing nodes, or to grow/shrink the cluster. We did adhere to the c= aveats that new nodes should not be seed nodes, and that we should allow ea= ch node to join the cluster completely before making any other changes.

Phil, when you dropped to adding just the single node t= o your cluster, did you start over with the newly added node (blowing away = the database created on the previous startup), or did you shut down the oth= er 2 added nodes and leave the remaining one in progress to continue?

Steve
<= br>
On Fri, Apr 18, 2014 at 10:40 AM, Robert = Coli <rcoli@eventbrite.com> wrote:
=
On Fri, Apr 18, 2014 at 5:05 AM, Phil Burre= ss <philburresseme@gmail.com> wrote:
nodetool netstats shows 84 = files. They are all at 100%. Nothing showing in Pending or Active for Read = Repair Stats.

I'm assuming this means it's done. But it still show= s "JOINING". Is there an undocumented step I'm missing here? = This whole process seems broken to me.

Lately it seems like a lot more people than usual are :

1) using vnodes
2) unable to bootstrap new= nodes

If I were you, I would likely file a JIRA d= etailing your negative experience with this core functionality.

=3DRob

=C2=A0



<= font color=3D"#888888">--
Steve Robenalt
Software Architect
HighWire | Stanford University=C2=A0
425 Broadway St, Redwoo= d City, CA 94063=C2=A0
=

srobena= l@stanford.edu









--
=
Steve Robenalt
=
Software Architect
=
HighWire | Stanford University=C2= =A0
425 Broadway St, Redwoo= d City, CA 94063=C2=A0
=

srobena= l@stanford.edu






--047d7bdc8e10f9e27904f78e5d5d--