Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 21722EAC5 for ; Thu, 14 Mar 2013 14:28:21 +0000 (UTC) Received: (qmail 59210 invoked by uid 500); 14 Mar 2013 14:28:18 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 59179 invoked by uid 500); 14 Mar 2013 14:28:18 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 59137 invoked by uid 99); 14 Mar 2013 14:28:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Mar 2013 14:28:17 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [192.174.58.134] (HELO XEDGEA.nrel.gov) (192.174.58.134) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Mar 2013 14:28:08 +0000 Received: from XHUBA.nrel.gov (10.20.4.58) by XEDGEA.nrel.gov (192.174.58.134) with Microsoft SMTP Server (TLS) id 8.3.245.1; Thu, 14 Mar 2013 08:27:43 -0600 Received: from MAILBOX2.nrel.gov ([fe80::19a0:6c19:6421:12f]) by XHUBA.nrel.gov ([::1]) with mapi; Thu, 14 Mar 2013 08:27:43 -0600 From: "Hiller, Dean" To: "user@cassandra.apache.org" Date: Thu, 14 Mar 2013 08:27:43 -0600 Subject: Re: Failed migration from 1.1.6 to 1.2.2 Thread-Topic: Failed migration from 1.1.6 to 1.2.2 Thread-Index: Ac4gwBcttvVtqc5aTAuJGagqMp+akg== Message-ID: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.3.1.130117 acceptlanguage: en-US Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org You should really be testing this stuff in QA. We had the exact same issue= from 1.1.4 to 1.2.2. In QA, we decided we could take an outage so we test= ed taking every node down, upgrading every node and bringing the cluster ba= ck online. This worked perfectly so we rolled it into production=85.produc= tion took 45 minutes to start for us(especially one node under pressure)=85= .that was only initially though=85now everything seems fine. Another optio= n in QA was we could have tested upgrading to 1.1.9 first then to 1.2.2. I= have no idea if it will work but I am sure they test closer release scenar= ios on upgrading more so than the big jump releases Aaron, it would be really neat if some releases were tagged with LT(long te= rm) or something so upgrades are tested from LT to LT releases so we know w= e can always safely first upgrade to an LT release and then upgrade to anot= her LT release from that one=85just a thought. This would also get more peo= ple using/testing the same upgrade paths which would help everyone. Dean From: Alain RODRIGUEZ > Reply-To: "user@cassandra.apache.org" > Date: Thursday, March 14, 2013 5:31 AM To: "user@cassandra.apache.org" > Subject: Re: Failed migration from 1.1.6 to 1.2.2 We have it set to 0.0.0.0 but anyway, as told before, I don't think our pro= blem come from this bug. 2013/3/14 Michal Michalski > It will happen if your rpc_address is set to 0.0.0.0. Ops, it's not what I meant ;-) It will happen, if your rpc_address is set to IP that is not defined in you= r cluster's config (e.g. in cassandra-topology.properties for PropertyFileS= nitch) M. M. W dniu 14.03.2013 13:03, Alain RODRIGUEZ pisze: Thanks for this pointer but I don't think this is the source of our problem since we use 1 data center and Ec2Snitch. 2013/3/14 Jean-Armel Luce > Hi Alain, Maybe it is due to https://issues.apache.org/jira/browse/CASSANDRA-5299 A patch is provided with this ticket. Regards. Jean Armel 2013/3/14 Alain RODRIGUEZ > Hi We just tried to migrate our production cluster from C* 1.1.6 to 1.2.2. This has been a disaster. I just switch one node to 1.2.2, updated its configuration (cassandra.yaml / cassandra-env.sh) and restart it. It resulted on error on all the 5 remaining 1.1.6 nodes : ERROR [RequestResponseStage:2] 2013-03-14 09:53:25,750 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[RequestResponseStage:2,5,main] java.io.IOError: java.io.EOFException at org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowReso= lver.java:71) at org.apache.cassandra.service.ReadCallback.response(ReadCallback.java:155) at org.apache.cassandra.net.ResponseVerbHandl= er.doVerb(ResponseVerbHandler.java:45) at org.apache.cassandra.net.MessageDeliveryTa= sk.run(MessageDeliveryTask.java:59) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.j= ava:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:= 908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.jav= a:100) at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.jav= a:81) at org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowReso= lver.java:64) ... 6 more I had this a lot of times, and my entire cluster wasn't reachable by our 4 clients (phpCassa, Hector, Cassie, Helenus) I decommissioned the 1.2.2 node to get our cluster answering queries. It worked. Then I tried to replace this node by a new C*1.1.6 one with the same token as the previous node decommissioned. The node joined the ring and before getting any data switch to normal status. In all the other nodes I had : ERROR [MutationStage:8] 2013-03-14 10:21:01,288 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[MutationStage:8,5,main] java.lang.AssertionError at org.apache.cassandra.locator.TokenMetadata.getToken(TokenMetadata.java:304) at org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:3= 71) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.j= ava:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:= 908) at java.lang.Thread.run(Thread.java:662) So I decommissioned this new 1.1.6 node and we are now running with 5 servers, not balanced along the ring, without any possibility of adding nodes, nor upgradinc C* version. We are quite desperate over here. If someone has any idea of what could happened and how to stabilize the cluster, it will be very appreciated. It's quite an emergency since we can't add nodes and are under heavy load.