From user-return-63692-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org Wed Apr 17 18:31:29 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 6551C180672 for ; Wed, 17 Apr 2019 20:31:29 +0200 (CEST) Received: (qmail 49062 invoked by uid 500); 17 Apr 2019 18:31:25 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 49051 invoked by uid 99); 17 Apr 2019 18:31:25 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Apr 2019 18:31:25 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 84AA1C256D for ; Wed, 17 Apr 2019 18:31:24 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.8 X-Spam-Level: * X-Spam-Status: No, score=1.8 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=smartthings.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 5es5r_9dwABc for ; Wed, 17 Apr 2019 18:31:22 +0000 (UTC) Received: from mail-ed1-f44.google.com (mail-ed1-f44.google.com [209.85.208.44]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 99A285F4D3 for ; Wed, 17 Apr 2019 18:31:21 +0000 (UTC) Received: by mail-ed1-f44.google.com with SMTP id f53so20272322ede.4 for ; Wed, 17 Apr 2019 11:31:21 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=3t4Ukc0lWfOYTHhn94zPehIQpf4aynEHMKLZwo/nHJg=; b=AhFqJ4g9uUs5xQ4exDim7D9KtSsQXM0Prz1lwe0UWU55eu20K4FZSeDtK3xM/wvVH7 ghvhP9kDz0wzyqo32xc19I7SXwONXAbkbWwXMVR5jut8tTfIHHoaQO5tN69CrZ8uvIRd KbvjrgGr1V5qi1Ov24lqiA5bx6wdupIEc6vBjNlQjv1OJfvUDTqsh9nVt+8WDS3wJ8Tb byMTiZ6fm+rshKAK90uuBb/Asr8OdV6qFUJIR5sza2tbDNr/IibB+MuK+vwktGd3AW+A OgIz+MQe/6Hd9MuBTroCgVmdGYs2sCyZz/PW1Xb74kOQaR/LO/pFn3rYMwKbqsWBI4r3 kdug== X-Gm-Message-State: APjAAAX4MBoO/KDNsFS/wmKvA0NXI27KMYdavsPmkWgHOo0/57mFKw6A 6IWLGRQSiXlf0t8nSNwgZZHDTgewFFS1dPUnXASSQiUj X-Google-Smtp-Source: APXvYqwoIyZLln3aXzz0UlZHY8L1zaB2lYbezpBjC9lY7gti7b4ENOjZkRguoi6a46qPPDXmo0YAwmrlr0Sm4zajP3Y= X-Received: by 2002:a17:906:16ce:: with SMTP id t14mr48747444ejd.244.1555525881032; Wed, 17 Apr 2019 11:31:21 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Carl Mueller Date: Wed, 17 Apr 2019 13:31:10 -0500 Message-ID: Subject: Re: 2.1.9 --> 2.2.13 upgrade node startup after upgrade very slow To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary="00000000000094128f0586be175a" --00000000000094128f0586be175a Content-Type: text/plain; charset="UTF-8" No, we just did the package upgrade 2.1.9 --> 2.2.13 It definitely feels like some indexes are being recalculated or the entire sstables are being scanned due to suspected corruption. On Wed, Apr 17, 2019 at 12:32 PM Jeff Jirsa wrote: > There was a time when changing some of the parameters (especially bloom > filter FP ratio) would cause the bloom filters to be rebuilt on startup if > the sstables didnt match what was in the schema, leading to a delay like > that and similar logs. Any chance you changed the schema on that table > since the last time you restarted it? > > > > On Wed, Apr 17, 2019 at 10:30 AM Carl Mueller > wrote: > >> Oh, the table in question is SizeTiered, had about 10 sstables total, it >> was JBOD across two data directories. >> >> On Wed, Apr 17, 2019 at 12:26 PM Carl Mueller < >> carl.mueller@smartthings.com> wrote: >> >>> We are doing a ton of upgrades to get out of 2.1.x. We've done probably >>> 20-30 clusters so far and have not encountered anything like this yet. >>> >>> After upgrade of a node, the restart takes a long time. like 10 minutes >>> long. ALmost all of our other nodes took less than 2 minutes to upgrade >>> (aside from sstableupgrades). >>> >>> The startup stalls on a particular table, it is the largest table at >>> about 300GB, but we have upgraded other clusters with about that much data >>> without this 8-10 minute delay. We have the ability to roll back the node, >>> and the restart as a 2.1.x node is normal with no delays. >>> >>> Alas this is a prod cluster so we are going to try to sstable load the >>> data on a lower environment and try to replicate the delay. If we can, we >>> will turn on debug logging. >>> >>> This occurred on the first node we tried to upgrade. It is possible it >>> is limited to only this node, but we are gunshy to play around with >>> upgrades in prod. >>> >>> We have an automated upgrading program that flushes, snapshots, shuts >>> down gossip, drains before upgrade, suppressed autostart on upgrade, and >>> has worked about as flawlessly as one could hope for so far for 2.1->2.2 >>> and 2.2-> 3.11 upgrades. >>> >>> INFO [main] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:389 - >>> Initializing zzzz.access_token >>> INFO [main] 2019-04-16 17:22:17,096 ColumnFamilyStore.java:389 - >>> Initializing zzzz.refresh_token >>> INFO [main] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:389 - >>> Initializing zzzz.userid >>> INFO [main] 2019-04-16 17:28:52,930 ColumnFamilyStore.java:389 - >>> Initializing zzzz.access_token_by_auth >>> >>> You can see the 6:30 delay in the startup log above. All the other >>> keyspace/tables initialize in under a second. >>> >>> >>> --00000000000094128f0586be175a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
No, we just did the package upgrade 2.1.9 --> 2.2.13
It definitely feels like some indexes are being recalculated or the en= tire sstables are being scanned due to suspected corruption.

=

= On Wed, Apr 17, 2019 at 12:32 PM Jeff Jirsa <jjirsa@gmail.com> wrote:
There was a time when changing so= me of the parameters (especially bloom filter FP ratio) would cause the blo= om filters to be rebuilt on startup if the sstables didnt match what was in= the schema, leading to a delay like that and similar logs. Any chance you = changed the schema on that table since the last time you restarted it?=C2= =A0



On Wed, Apr 17, 2019 at 10:30 AM Carl Muel= ler <carl.mueller@smartthings.com.invalid> wrote:
Oh, the table in q= uestion is SizeTiered, had about 10 sstables total, it was JBOD across two = data directories.=C2=A0

On Wed, Apr 17, 2019 at 12:26 PM Carl Mueller <carl.muelle= r@smartthings.com> wrote:
We are doing a ton of up= grades to get out of 2.1.x. We've done probably 20-30 clusters so far a= nd have not encountered anything like this yet.

After upgrade of a n= ode, the restart takes a long time. like 10 minutes long. ALmost all of our= other nodes took less than 2 minutes to upgrade (aside from sstableupgrade= s).=C2=A0

The startup stalls on a particular table, it is the larges= t table at about 300GB, but we have upgraded other clusters with about that= much data without this 8-10 minute delay. We have the ability to roll back= the node, and the restart as a 2.1.x node is normal with no delays.
Alas this is a prod cluster so we are going to try to sstable load the dat= a on a lower environment and try to replicate the delay. If we can, we will= turn on debug logging.

This occurred on the first node we tried to = upgrade. It is possible it is limited to only this node, but we are gunshy = to play around with upgrades in prod.

We have an automated upgrading= program that flushes, snapshots, shuts down gossip, drains before upgrade,= suppressed autostart on upgrade, and has worked about as flawlessly as one= could hope for so far for 2.1->2.2 and 2.2-> 3.11 upgrades.=C2=A0
INFO=C2=A0 [main] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:= 389 - Initializing zzzz.access_token
INFO=C2=A0 [main] 2019-04-16= 17:22:17,096 ColumnFamilyStore.java:389 - Initializing zzzz.refresh_token<= /div>
INFO=C2=A0 [main] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:= 389 - Initializing zzzz.userid
INFO=C2=A0 [main] 2019-04-16 17:28= :52,930 ColumnFamilyStore.java:389 - Initializing zzzz.access_token_by_auth=

You can see the 6:30 delay in the startup log above. All the= other keyspace/tables initialize in under a second.

--00000000000094128f0586be175a--