From user-return-63691-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org Wed Apr 17 17:32:40 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id C2490180672 for ; Wed, 17 Apr 2019 19:32:39 +0200 (CEST) Received: (qmail 65123 invoked by uid 500); 17 Apr 2019 17:32:24 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 65113 invoked by uid 99); 17 Apr 2019 17:32:24 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Apr 2019 17:32:24 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 7A5CAC23A4 for ; Wed, 17 Apr 2019 17:32:23 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.8 X-Spam-Level: * X-Spam-Status: No, score=1.8 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 4vgxmxrHoD7Q for ; Wed, 17 Apr 2019 17:32:21 +0000 (UTC) Received: from mail-ua1-f52.google.com (mail-ua1-f52.google.com [209.85.222.52]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 4B54F5FDF0 for ; Wed, 17 Apr 2019 17:32:21 +0000 (UTC) Received: by mail-ua1-f52.google.com with SMTP id c6so8188668uan.1 for ; Wed, 17 Apr 2019 10:32:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=5f17zxCJJs8Dum3sMNEe4JX4wy2/bwOemQlvOXWKwnw=; b=ssypNaDsrkvoFNajyfbqJ+rO+zgaO56J49N6b5hYxwjd/3oBaNCIwOjt/pXDNXA9LK 0febjP2AKpfFgZe5kC3jrHgsg/8Wicpuu/DpvZnl4erkuKJs0XBfV0z1vrye8WxyXCWA I54/S0zj6RpJ9Grp4mPmYkA2fsabawvZTyNgbVKNLiYkhGDa+nSZPDZQ173p2Fe/KW6o bVSrKeQlKnowJpMlD/dimfowfLGEYvi9AM3zMTZZrL3JNPoqHcT/Kk4vUU4pu+djAcUS MmiB7QOHJqYN7M4PtvGm2uURxFUtSuXuzggWVME2TvnSHYCHim0aNEsn5kG3bDGNUHsY quSw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=5f17zxCJJs8Dum3sMNEe4JX4wy2/bwOemQlvOXWKwnw=; b=Py1dG3SFDfBKfDBue+YoWt315D+554mdWLcNXhTCwXMIbVv/dmc2cG1BxeGVRpSKBF 7rpkZleg63Xy/von5aAo7lSLFkZPD/jrVK6rxM5VKm1CslT9f0EMLgHO+qQ9fmNppoy2 Txe8ybrvXPvAe3OZgP2rvCUW/rjUU0WR8LJvQIIOhvE9WgpE0x/Pl7M8kHbxZ4Fnv4y6 7t9uhIU2rLPfgDvlxY1xCzDN2hAMxH6dU2jXaMFsvjXekiOL2nd/kSyhhl0Jo9Lm/tcV jjdHye4cjiTZKm8UUWlk21E07OmCYLbWAuwJtw9joElltZREaEolEc60/QcxDaDEfGjq 8D/A== X-Gm-Message-State: APjAAAVkxXXyohFOqKLR/BiI5EsJQ0H0xQO2itDZK85czgzYwjxiAeqS 0KZLoXlCJmRiPdAR72iQ7fXYd4LYuZs0CMM7NwP0OQ== X-Google-Smtp-Source: APXvYqwEYiRHGI8QlP8InjgEoAZReIPuigf8OqFDou9k1LxiP6zZYVZ1Yq9BgVzzq4CyubbZHA02t2xFdYmGMHRgZfg= X-Received: by 2002:ab0:3445:: with SMTP id a5mr686326uaq.136.1555522335188; Wed, 17 Apr 2019 10:32:15 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Jeff Jirsa Date: Wed, 17 Apr 2019 10:32:04 -0700 Message-ID: Subject: Re: 2.1.9 --> 2.2.13 upgrade node startup after upgrade very slow To: cassandra Content-Type: multipart/alternative; boundary="0000000000003aab440586bd44ef" --0000000000003aab440586bd44ef Content-Type: text/plain; charset="UTF-8" There was a time when changing some of the parameters (especially bloom filter FP ratio) would cause the bloom filters to be rebuilt on startup if the sstables didnt match what was in the schema, leading to a delay like that and similar logs. Any chance you changed the schema on that table since the last time you restarted it? On Wed, Apr 17, 2019 at 10:30 AM Carl Mueller wrote: > Oh, the table in question is SizeTiered, had about 10 sstables total, it > was JBOD across two data directories. > > On Wed, Apr 17, 2019 at 12:26 PM Carl Mueller < > carl.mueller@smartthings.com> wrote: > >> We are doing a ton of upgrades to get out of 2.1.x. We've done probably >> 20-30 clusters so far and have not encountered anything like this yet. >> >> After upgrade of a node, the restart takes a long time. like 10 minutes >> long. ALmost all of our other nodes took less than 2 minutes to upgrade >> (aside from sstableupgrades). >> >> The startup stalls on a particular table, it is the largest table at >> about 300GB, but we have upgraded other clusters with about that much data >> without this 8-10 minute delay. We have the ability to roll back the node, >> and the restart as a 2.1.x node is normal with no delays. >> >> Alas this is a prod cluster so we are going to try to sstable load the >> data on a lower environment and try to replicate the delay. If we can, we >> will turn on debug logging. >> >> This occurred on the first node we tried to upgrade. It is possible it is >> limited to only this node, but we are gunshy to play around with upgrades >> in prod. >> >> We have an automated upgrading program that flushes, snapshots, shuts >> down gossip, drains before upgrade, suppressed autostart on upgrade, and >> has worked about as flawlessly as one could hope for so far for 2.1->2.2 >> and 2.2-> 3.11 upgrades. >> >> INFO [main] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:389 - >> Initializing zzzz.access_token >> INFO [main] 2019-04-16 17:22:17,096 ColumnFamilyStore.java:389 - >> Initializing zzzz.refresh_token >> INFO [main] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:389 - >> Initializing zzzz.userid >> INFO [main] 2019-04-16 17:28:52,930 ColumnFamilyStore.java:389 - >> Initializing zzzz.access_token_by_auth >> >> You can see the 6:30 delay in the startup log above. All the other >> keyspace/tables initialize in under a second. >> >> >> --0000000000003aab440586bd44ef Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
There was a time when changing some of the parameters (esp= ecially bloom filter FP ratio) would cause the bloom filters to be rebuilt = on startup if the sstables didnt match what was in the schema, leading to a= delay like that and similar logs. Any chance you changed the schema on tha= t table since the last time you restarted it?=C2=A0


=

On Wed, Apr 17, 2019 at 10:30 AM Carl Mueller <carl.mueller@smartt= hings.com.invalid> wrote:
Oh, the table in question is SizeTiered, had = about 10 sstables total, it was JBOD across two data directories.=C2=A0
On W= ed, Apr 17, 2019 at 12:26 PM Carl Mueller <carl.mueller@smartthings.com> w= rote:
We are doing a ton of upgrades to get out of 2.1.= x. We've done probably 20-30 clusters so far and have not encountered a= nything like this yet.

After upgrade of a node, the restart takes a = long time. like 10 minutes long. ALmost all of our other nodes took less th= an 2 minutes to upgrade (aside from sstableupgrades).=C2=A0

The star= tup stalls on a particular table, it is the largest table at about 300GB, b= ut we have upgraded other clusters with about that much data without this 8= -10 minute delay. We have the ability to roll back the node, and the restar= t as a 2.1.x node is normal with no delays.

Alas this is a prod clus= ter so we are going to try to sstable load the data on a lower environment = and try to replicate the delay. If we can, we will turn on debug logging.
This occurred on the first node we tried to upgrade. It is possible i= t is limited to only this node, but we are gunshy to play around with upgra= des in prod.

We have an automated upgrading program that flushes, sn= apshots, shuts down gossip, drains before upgrade, suppressed autostart on = upgrade, and has worked about as flawlessly as one could hope for so far fo= r 2.1->2.2 and 2.2-> 3.11 upgrades.=C2=A0

INFO=C2=A0 [mai= n] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:389 - Initializing zzzz.a= ccess_token
INFO=C2=A0 [main] 2019-04-16 17:22:17,096 ColumnFamil= yStore.java:389 - Initializing zzzz.refresh_token
INFO=C2=A0 [mai= n] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:389 - Initializing zzzz.u= serid
INFO=C2=A0 [main] 2019-04-16 17:28:52,930 ColumnFamilyStore= .java:389 - Initializing zzzz.access_token_by_auth

You can se= e the 6:30 delay in the startup log above. All the other keyspace/tables in= itialize in under a second.


--0000000000003aab440586bd44ef--