From user-return-63689-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org Wed Apr 17 17:27:13 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id BA19E180672 for ; Wed, 17 Apr 2019 19:27:12 +0200 (CEST) Received: (qmail 46830 invoked by uid 500); 17 Apr 2019 17:27:08 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 46815 invoked by uid 99); 17 Apr 2019 17:27:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Apr 2019 17:27:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 09D00C24D4 for ; Wed, 17 Apr 2019 17:27:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.794 X-Spam-Level: * X-Spam-Status: No, score=1.794 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.006, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=smartthings.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id nn3fNH_UR3uS for ; Wed, 17 Apr 2019 17:27:05 +0000 (UTC) Received: from mail-ed1-f42.google.com (mail-ed1-f42.google.com [209.85.208.42]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id C4A325FB83 for ; Wed, 17 Apr 2019 17:27:04 +0000 (UTC) Received: by mail-ed1-f42.google.com with SMTP id k92so3553206edc.12 for ; Wed, 17 Apr 2019 10:27:04 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=qVFHzizIhB5FHA8PF1HHzXxEozwCrf8x1tBCw+jvT7Y=; b=p2W6KKKLw7e7KD5tgItFNdx41McHT74lF8vZ9zGs4AJugOjgFB5hX70FuunDT3SuEe qFNqsdkkS8xPr5XKLAvQ3ATmLTf5iHeSSY6dl728Y6OiDoMCp07iv2rkYc7altvj2hnU zAN9UY0/84Z7ujggQ6QpFOmj/0VXIDjLgoe6jSl3xx2vleCgnDEi718qoVkvAioEhcy5 J/6ziuBj9q7pHyrwCVXig6+Y548pTndmbVSyaO6uEpQD8kNvecSj1qENNgf3PHdFwgHU bMkVmFEfzd8Be9Jp04BxxKs4aKSpA3dnpCD4zdkRcBH9TuVV29M1lVX6GCWsZhp9+1/J 5mxw== X-Gm-Message-State: APjAAAWJmyla3MWXqMdkM0hg8LAxAelmZdIfT9RSPPItYDqN2KLU3Bo5 r6YBLU/75ZPfysT61m/yFtm6XTt2bHoJ+BNv6tURuyMZm1A= X-Google-Smtp-Source: APXvYqwI7Y7bTJlM7dAh3BopNZm9iA2j23WQvhztJoV6tbXckpM4aXG9P8S0g7VjSy5UZC4xv/NXTFh6b32Qj5E3Tj8= X-Received: by 2002:a50:e718:: with SMTP id a24mr41334052edn.63.1555522017682; Wed, 17 Apr 2019 10:26:57 -0700 (PDT) MIME-Version: 1.0 From: Carl Mueller Date: Wed, 17 Apr 2019 12:26:47 -0500 Message-ID: Subject: 2.1.9 --> 2.2.13 upgrade node startup after upgrade very slow To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary="0000000000004e010a0586bd3191" --0000000000004e010a0586bd3191 Content-Type: text/plain; charset="UTF-8" We are doing a ton of upgrades to get out of 2.1.x. We've done probably 20-30 clusters so far and have not encountered anything like this yet. After upgrade of a node, the restart takes a long time. like 10 minutes long. ALmost all of our other nodes took less than 2 minutes to upgrade (aside from sstableupgrades). The startup stalls on a particular table, it is the largest table at about 300GB, but we have upgraded other clusters with about that much data without this 8-10 minute delay. We have the ability to roll back the node, and the restart as a 2.1.x node is normal with no delays. Alas this is a prod cluster so we are going to try to sstable load the data on a lower environment and try to replicate the delay. If we can, we will turn on debug logging. This occurred on the first node we tried to upgrade. It is possible it is limited to only this node, but we are gunshy to play around with upgrades in prod. We have an automated upgrading program that flushes, snapshots, shuts down gossip, drains before upgrade, suppressed autostart on upgrade, and has worked about as flawlessly as one could hope for so far for 2.1->2.2 and 2.2-> 3.11 upgrades. INFO [main] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:389 - Initializing zzzz.access_token INFO [main] 2019-04-16 17:22:17,096 ColumnFamilyStore.java:389 - Initializing zzzz.refresh_token INFO [main] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:389 - Initializing zzzz.userid INFO [main] 2019-04-16 17:28:52,930 ColumnFamilyStore.java:389 - Initializing zzzz.access_token_by_auth You can see the 6:30 delay in the startup log above. All the other keyspace/tables initialize in under a second. --0000000000004e010a0586bd3191 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
We are doing a ton of upgrades to get out= of 2.1.x. We've done probably 20-30 clusters so far and have not encou= ntered anything like this yet.

After upgrade of a node, the restart = takes a long time. like 10 minutes long. ALmost all of our other nodes took= less than 2 minutes to upgrade (aside from sstableupgrades).=C2=A0

= The startup stalls on a particular table, it is the largest table at about = 300GB, but we have upgraded other clusters with about that much data withou= t this 8-10 minute delay. We have the ability to roll back the node, and th= e restart as a 2.1.x node is normal with no delays.

Alas this is a p= rod cluster so we are going to try to sstable load the data on a lower envi= ronment and try to replicate the delay. If we can, we will turn on debug lo= gging.

This occurred on the first node we tried to upgrade. It is po= ssible it is limited to only this node, but we are gunshy to play around wi= th upgrades in prod.

We have an automated upgrading program that flu= shes, snapshots, shuts down gossip, drains before upgrade, suppressed autos= tart on upgrade, and has worked about as flawlessly as one could hope for s= o far for 2.1->2.2 and 2.2-> 3.11 upgrades.=C2=A0

INFO=C2= =A0 [main] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:389 - Initializin= g zzzz.access_token
INFO=C2=A0 [main] 2019-04-16 17:22:17,096 Col= umnFamilyStore.java:389 - Initializing zzzz.refresh_token
INFO=C2= =A0 [main] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:389 - Initializin= g zzzz.userid
INFO=C2=A0 [main] 2019-04-16 17:28:52,930 ColumnFam= ilyStore.java:389 - Initializing zzzz.access_token_by_auth

Yo= u can see the 6:30 delay in the startup log above. All the other keyspace/t= ables initialize in under a second.


--0000000000004e010a0586bd3191--