Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 301A910AAC for ; Fri, 4 Oct 2013 18:26:21 +0000 (UTC) Received: (qmail 42896 invoked by uid 500); 4 Oct 2013 18:26:18 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 42569 invoked by uid 500); 4 Oct 2013 18:26:16 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 42561 invoked by uid 99); 4 Oct 2013 18:26:14 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Oct 2013 18:26:14 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of pauloricardomg@gmail.com designates 209.85.220.53 as permitted sender) Received: from [209.85.220.53] (HELO mail-pa0-f53.google.com) (209.85.220.53) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Oct 2013 18:26:08 +0000 Received: by mail-pa0-f53.google.com with SMTP id kq14so4495246pab.40 for ; Fri, 04 Oct 2013 11:25:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=aVN4hGKNTcAgqNNvimnFZnO82qNHrYxPdqKX+hc2RVk=; b=HOWH3z3o137ezuCUYyV6pyL+GUVwor1CNT9b6L5FAUMvaAOOzxtImaRXk4MF3+C072 HBTJtmZMKshJk+C61qJnMNPo/sQY5vlbL/WjEfoyjF0r/NE5Qkqqepl57FLH4TDDUrXJ 3+A4zFtzARF6nm4kJ9uWsQSdgzZZTH0juyQX+9Qj28rWYR3ldB/JUyxtSICQtCeb8+fw MSHgSahlJEN7M/KIHs/9keQNKfDU0Z8OEa6Iu+gqfiozQiOCkX11kk0Q8SOlXOj3bGhy pUZdcvpWrODZF4e2MOSv2KHRXO5EKczoK+SxiKBSWZgiOI88iZRnAc+PSVLrLL1/XpaX ll4w== X-Received: by 10.66.182.36 with SMTP id eb4mr16896865pac.125.1380911147388; Fri, 04 Oct 2013 11:25:47 -0700 (PDT) MIME-Version: 1.0 Received: by 10.70.21.129 with HTTP; Fri, 4 Oct 2013 11:25:27 -0700 (PDT) In-Reply-To: References: From: Paulo Motta Date: Fri, 4 Oct 2013 15:25:27 -0300 Message-ID: Subject: Re: Increased read timeouts during rolling upgrade to C* 1.2 To: "user@cassandra.apache.org" Content-Type: multipart/alternative; boundary=047d7bd6aa5c6817d804e7ee6e63 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bd6aa5c6817d804e7ee6e63 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable One more piece of information to help troubleshooting the issue: During the "nodetool drain" operation just before the upgrade, instead of just stopping accepting new writes, the node actually shuts itself down. This bug was also reported in this other thread: http://mail-archives.apache.org/mod_mbox/cassandra-user/201303.mbox/%3CCAFD= WQMTrYm7hBxXKoW8+eVKfNE6zvjW2h8_BSVGmOL7=3DgRDtLw@mail.gmail.com%3E Since I started Cassandra 1.2 only a few seconds before cassandra 1.1 died (after the nodetool drain), I'm afraid there wasn't sufficient time for the remaining nodes to update the metadata about the "downed" node. So when the upgraded node was restarted, the metadata in the other nodes was still referring to the previous version of the same node, so this may have caused the handshake problem, and consequently the read timeout. Does that theory make sense? 2013/10/4 Robert Coli > On Fri, Oct 4, 2013 at 9:09 AM, Paulo Motta wro= te: > >> I manually tried to insert and retrieve some data into both the newly >> upgraded nodes and the old nodes, and the behavior was very unstable: >> sometimes it worked, sometimes it didn't (TimedOutException), so I don't >> think it was a network problem. >> >> The number of read timeouts diminished as the number of upgraded nodes >> increased, until it reached stability. The logs were showing the followi= ng >> messages periodically: >> >> ... > >> Two similar issues were reported, but without satisfactory responses: >> >> - >> http://stackoverflow.com/questions/15355115/rolling-upgrade-for-cassandr= a-1-0-9-cluster-to-1-2-1 >> - https://issues.apache.org/jira/browse/CASSANDRA-5740 >> > > Both of these issues relate to upgrading from 1._0_.x to 1.2.x, which is > not supported. > > Were I you, I would summarize the above experience in a JIRA ticket, as > 1.1.x to 1.2.x should be a supported operation and should not unexpectedl= y > result in decreased availability during the upgrade. > > =3DRob > --=20 Paulo Ricardo --=20 European Master in Distributed Computing*** Royal Institute of Technology - KTH * *Instituto Superior T=E9cnico - IST* *http://paulormg.com* --047d7bd6aa5c6817d804e7ee6e63 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
One more piece of information to help troubleshooting the = issue:

During the "nodetool drain" operation j= ust before the upgrade, instead of just stopping accepting new writes, the = node actually shuts itself down. This bug was also reported in this other t= hread:=A0http://mail-archives.apache.org/mod_mbox/cassandra-user/20130= 3.mbox/%3CCAFDWQMTrYm7hBxXKoW8+eVKfNE6zvjW2h8_BSVGmOL7=3DgRDtLw@mail.gmail.= com%3E

Since I started Cassandra 1.2 only a few seconds before= cassandra 1.1 died (after the nodetool drain), I'm afraid there wasn&#= 39;t sufficient time for the remaining nodes to update the metadata about t= he "downed" node. So when the upgraded node was restarted, the me= tadata in the other nodes was still referring to the previous version of th= e same node, so this may have caused the handshake problem, and consequentl= y the read timeout. Does that theory make sense?


2013/10= /4 Robert Coli <rcoli@eventbrite.com>
On Fri, Oct 4, 2013 at 9:09 AM, Paulo Mo= tta <pauloricardomg@gmail.com> wrote:
I manually tried to in= sert and retrieve some data into both the newly upgraded nodes and the old = nodes, and the behavior was very unstable: sometimes it worked, sometimes i= t didn't (TimedOutException), so I don't think it was a network pro= blem.

The number of read timeouts diminished as the number of= upgraded nodes increased, until it reached stability. The logs were showin= g the following messages periodically:

...=A0
T= wo similar issues were reported, but without satisfactory responses:=


Both of these = issues relate to upgrading from 1._0_.x to 1.2.x, which is not supported.

Were I you, I would summarize the above experience in a= JIRA ticket, as 1.1.x to 1.2.x should be a supported operation and should = not unexpectedly result in decreased availability during the upgrade.

=3DRob=A0



--
Paulo R= icardo

--
European Master in Distributed Comp= uting
Royal Institute of Technology -=A0KTH
Instituto= Superior T=E9cnico - IST
--047d7bd6aa5c6817d804e7ee6e63--