Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2C1B51837B for ; Fri, 19 Feb 2016 21:47:11 +0000 (UTC) Received: (qmail 5590 invoked by uid 500); 19 Feb 2016 21:47:07 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 5545 invoked by uid 500); 19 Feb 2016 21:47:07 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 5535 invoked by uid 99); 19 Feb 2016 21:47:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Feb 2016 21:47:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 55231C32C2 for ; Fri, 19 Feb 2016 21:47:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.398 X-Spam-Level: *** X-Spam-Status: No, score=3.398 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_REPLY=1, HTML_MESSAGE=2, KAM_LINEPADDING=1.2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id KTFI37-qBKDx for ; Fri, 19 Feb 2016 21:47:06 +0000 (UTC) Received: from mail-ig0-f175.google.com (mail-ig0-f175.google.com [209.85.213.175]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id E07B75F263 for ; Fri, 19 Feb 2016 21:47:05 +0000 (UTC) Received: by mail-ig0-f175.google.com with SMTP id xg9so43993833igb.1 for ; Fri, 19 Feb 2016 13:47:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=VtJ58zrG5d1q94fYaqOknYtxz/KlVEqLckrsp//xDW0=; b=OzZHoDeXQeCk/ip/wgjt+My8VfuZyeJYiUdxmrQ9pjCG/HnHh4b9IINLeZ8U0+tvkw 5jGE3jFKPL0iHuHEmHElB0/Vipn5ekRQkfJcnrMtss3rMQ3y/AI+D9DU2IgIILzt+LzB KBioEib72MyqVftj6l7F4QVROn3zY9mWDiSuAcL0ekgh0alSWRfQmbD4gtaiLitcz5Sp ZAi2sB6ifsZt7roesXDIYLC04aysDA3DzrpXlCucj2ny6fr14q7/kK6eIf9nSya0n7A1 aTVYWquth27McCXEuL1bap4z8X4oA6Vw3UnSR4aA1zzRCphtrIRorWXjK7Z79L130ZWi JQcw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=VtJ58zrG5d1q94fYaqOknYtxz/KlVEqLckrsp//xDW0=; b=G5Awdz0GOWvzyzeLPyPxctJfJqDpRTCVEzXHyjiQ1Icsu2g/pwzVyvx1kKPWfPeF8V +mSRFrcz2FkVGc4ySmlDkXeFf3bidGwxgALTz20MzfM9MxWGYNulgp2WOA2pfe9a5NEj 5CzruRsklGiCgazrgvtTx6V8E0PtzBfcugSuI68FeF+cumQf78zH3mbXaBwvAlSeAVUH VnW7C3CX1p3xAgELUvxcXrPY5x+GIvD8iOVVY0Q8viWTDejN5Pn8gfe3UpxnC3WXviBh 9e6kynXqMgLQpEalxFHQoYWdVnOIodflT1AgNtnCB3Z8uKo7X4kh0ilhyFabju565ZRc uPGg== X-Gm-Message-State: AG10YOQpjUvbE1qSzFBMWwBq2OSSNsoI7lyODSHLe+vElwMaztTXH0bKGKvg8Q4oYMk25f1KF9lHucuq/v4sBg== X-Received: by 10.50.114.196 with SMTP id ji4mr70131igb.56.1455918425330; Fri, 19 Feb 2016 13:47:05 -0800 (PST) MIME-Version: 1.0 Received: by 10.50.181.199 with HTTP; Fri, 19 Feb 2016 13:46:35 -0800 (PST) In-Reply-To: <794486693.5503516.1455905565312.JavaMail.yahoo@mail.yahoo.com> References: <1242549769.5244726.1455847648582.JavaMail.yahoo.ref@mail.yahoo.com> <1242549769.5244726.1455847648582.JavaMail.yahoo@mail.yahoo.com> <794486693.5503516.1455905565312.JavaMail.yahoo@mail.yahoo.com> From: daemeon reiydelle Date: Fri, 19 Feb 2016 13:46:35 -0800 Message-ID: Subject: Re: Live upgrade 2.0 to 2.1 temporarily increases GC time causing timeouts and unavailability To: user@cassandra.apache.org, Sotirios Delimanolis Cc: Alain RODRIGUEZ Content-Type: multipart/alternative; boundary=089e0112d18e90593a052c266cf7 --089e0112d18e90593a052c266cf7 Content-Type: text/plain; charset=UTF-8 FYI, my observations were with native, not thrift. *.......* *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872* On Fri, Feb 19, 2016 at 10:12 AM, Sotirios Delimanolis wrote: > Does your cluster contain 24+ nodes or fewer? > > We did the same upgrade on a smaller cluster of 5 nodes and we didn't see > this behavior. On the 24 node cluster, the timeouts only took effect once > ~5-6-7+ nodes had been upgraded. > > We're doing some more upgrades next week, trying different deployment > plans. I'll report back with the results. > > Thanks for the reply (we absolutely want to move to CQL) > > > On Friday, February 19, 2016 1:10 AM, Alain RODRIGUEZ > wrote: > > > I performed this exact update a few days ago, excepted clients were using > native protocol and it wen smoothly. So I think this might be thrift > related. No idea what is producing this though, just wanted to give the > info fwiw. > > As a side note, unrelated to the issue, performances using native are a > lot better than thrift starting in C* 2.1. Drivers using native are also > more modern allowing you to do very interesting stuff. Updating to native > now that you are using 2.1 is something you might want to do soon enough > :-). > > C*heers, > ----------------- > Alain Rodriguez > France > > The Last Pickle > http://www.thelastpickle.com > > 2016-02-19 3:07 GMT+01:00 Sotirios Delimanolis : > > We have a Cassandra cluster with 24 nodes. These nodes were running > 2.0.16. > > While the nodes are in the ring and handling queries, we perform the > upgrade to 2.1.12 as follows (more or less) one node at a time: > > > 1. Stop the Cassandra process > 2. Deploy jars, scripts, binaries, etc. > 3. Start the Cassandra process > > > A few nodes into the upgrade, we start noticing that the majority of > queries (mostly through Thrift) time out or report unavailable. Looking at > system information, Cassandra GC time goes through the roof, which is what > we assume causes the time outs. > > Once all nodes are upgraded, the cluster stabilizes and no more (barely > any) time outs occur. > > What could explain this? Does it have anything to do with how a 2.0 > communicates with a 2.1? > > Our Cassandra consumers haven't changed. > > > > > > > > > --089e0112d18e90593a052c266cf7 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
FYI, my observations were with native,= not thrift.

=
.......


Daemeon= C.M. Reiydelle
USA (+1) 415.501.0198
London (+44) (0) 20 8144 9872


On Fri, Feb 19, 2016 at 10:12 AM, Sotirios D= elimanolis <sotodel_89@yahoo.com> wrote:
Does your cluster contain 24+ nodes or fewer?=C2=A0

We did the same upgrade on a smaller cluster of 5 nodes and = we didn't see this behavior. On the 24 node cluster, the timeouts only = took effect once ~5-6-7+ nodes had been upgraded.
We're doing some more upgrades next week, tryin= g different deployment plans. I'll report back with the results.
<= div dir=3D"ltr">
Thanks for the reply (we absolut= ely want to move to CQL)


On Friday, February 19, 2016 1:10 AM, Ala= in RODRIGUEZ <ar= odrime@gmail.com> wrote:


<= div dir=3D"ltr">I performed this exact update a few days ago, excepted clie= nts were using native protocol and it wen smoothly. So I think this might b= e thrift related. No idea what is producing this though, just wanted to giv= e the info fwiw.

As a side note, unrelate= d to the issue, performances using native are a lot better than thrift star= ting in C* 2.1. Drivers using native are also more modern allowing you to d= o very interesting stuff. Updating to native now that you are using 2.1 is = something you might want to do soon enough :-).

C*heers,
-----------------
Alain= Rodriguez
France

The Las= t Pickle
<= /div>

2016-02-19 3:07 GMT+01:0= 0 Sotirios Delimanolis <sotodel_89@yaho= o.com>:
We have a Cassandra cluster wit= h 24 nodes. These nodes were running 2.0.16.=C2=A0
While the nodes are in the ring and= handling queries, we perform the upgrade to 2.1.12 as follows (more or les= s) one node at a time:

  1. Stop the Cassandra process
  2. Deploy=C2=A0jars,=C2=A0sc= ripts, binaries, etc.
  3. Start the Cassandra process
A few nodes into the upgrade, we start noticing = that the majority of queries (mostly through Thrift) time out or report una= vailable. Looking at system information, Cassandra GC time goes through the= roof, which is what we assume causes the time outs.

Once all nodes are upgraded, the cluster stabilizes and n= o more (barely any) time outs occur.=C2=A0

What could explain this? Does it have anything to do with how a 2.0= communicates with a 2.1?

Our Cassa= ndra consumers haven't changed.








--089e0112d18e90593a052c266cf7--