From user-return-34852-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Tue Jun 25 06:19:56 2013 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 15AD4C484 for ; Tue, 25 Jun 2013 06:19:56 +0000 (UTC) Received: (qmail 90035 invoked by uid 500); 25 Jun 2013 06:19:53 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 87998 invoked by uid 500); 25 Jun 2013 06:19:46 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 87986 invoked by uid 99); 25 Jun 2013 06:19:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Jun 2013 06:19:42 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of abarua@247-inc.com designates 216.32.181.184 as permitted sender) Received: from [216.32.181.184] (HELO ch1outboundpool.messaging.microsoft.com) (216.32.181.184) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Jun 2013 06:19:36 +0000 Received: from mail131-ch1-R.bigfish.com (10.43.68.225) by CH1EHSOBE014.bigfish.com (10.43.70.64) with Microsoft SMTP Server id 14.1.225.23; Tue, 25 Jun 2013 06:19:14 +0000 Received: from mail131-ch1 (localhost [127.0.0.1]) by mail131-ch1-R.bigfish.com (Postfix) with ESMTP id D22E04094C for ; Tue, 25 Jun 2013 06:19:14 +0000 (UTC) X-Forefront-Antispam-Report: CIP:111.221.112.165;KIP:(null);UIP:(null);IPV:NLI;H:HKXPRD0310HT004.apcprd03.prod.outlook.com;RD:none;EFVD:NLI X-SpamScore: -2 X-BigFish: PS-2(zzbb2dIc85fh4015Izz1ee6h1fdah1202h1e76h1d2ah1fc6hzz1d7338h17326ah18c673h8275bh8275dhz2fh2a8h668h839hd25hf0ah1288h12a5h12bdh137ah1441h1504h1537h153bh15d0h162dh1631h1758h18e1h1946h19b5h19ceh1ad9h1b0ah1bceh1d07h1d0ch1d2eh1d3fh1dc1h1de9h1dfeh1dffh1e1dh1155h) Received-SPF: pass (mail131-ch1: domain of 247-inc.com designates 111.221.112.165 as permitted sender) client-ip=111.221.112.165; envelope-from=abarua@247-inc.com; helo=HKXPRD0310HT004.apcprd03.prod.outlook.com ;.outlook.com ; Received: from mail131-ch1 (localhost.localdomain [127.0.0.1]) by mail131-ch1 (MessageSwitch) id 1372141152277712_25607; Tue, 25 Jun 2013 06:19:12 +0000 (UTC) Received: from CH1EHSMHS031.bigfish.com (snatpool2.int.messaging.microsoft.com [10.43.68.230]) by mail131-ch1.bigfish.com (Postfix) with ESMTP id 381402600B4 for ; Tue, 25 Jun 2013 06:19:12 +0000 (UTC) Received: from HKXPRD0310HT004.apcprd03.prod.outlook.com (111.221.112.165) by CH1EHSMHS031.bigfish.com (10.43.70.31) with Microsoft SMTP Server (TLS) id 14.1.225.23; Tue, 25 Jun 2013 06:19:12 +0000 Received: from HKXPRD0310MB352.apcprd03.prod.outlook.com ([169.254.5.148]) by HKXPRD0310HT004.apcprd03.prod.outlook.com ([10.255.6.167]) with mapi id 14.16.0324.000; Tue, 25 Jun 2013 06:19:08 +0000 From: Arindam Barua To: "user@cassandra.apache.org" Subject: Problems with node rejoining cluster Thread-Topic: Problems with node rejoining cluster Thread-Index: Ac5t5OKmS5VkplA4Te2ShfOhUXzCnw== Date: Tue, 25 Jun 2013 06:19:08 +0000 Message-ID: <17C39FE466076C46B6E83F129C7B19CE2E7E6AD1@HKXPRD0310MB352.apcprd03.prod.outlook.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [98.207.232.166] Content-Type: multipart/alternative; boundary="_000_17C39FE466076C46B6E83F129C7B19CE2E7E6AD1HKXPRD0310MB352_" MIME-Version: 1.0 X-OriginatorOrg: 247-inc.com X-Virus-Checked: Checked by ClamAV on apache.org --_000_17C39FE466076C46B6E83F129C7B19CE2E7E6AD1HKXPRD0310MB352_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable We need to do a rolling upgrade of our Cassandra cluster in production, sin= ce we are upgrading Cassandra on solaris to Cassandra on CentOS. (We went with solaris initially since most of our other hosts in production= are solaris, but were running into some lockup issues during perf tests, a= nd decided to switch to linux) Here are the steps we are following to take the node out of service, and ge= t it back. Can someone comment if we are missing anything (eg. is it recomm= ended to specify tokens in cassandra.yaml, or do something different with t= he seed hosts than mentioned below) 1. nodetool decommission - wait for the data to be streamed out. 2. Re-image (everything is wiped off the disks) the host to CentOS, w= ith the same Cassandra version 3. Get Cassandra back up. Other details: - Using Cassandra 1.1.5 - We do not specify any tokens in cassandra.yaml relying on bootst= rap assigning the tokens automatically. - We are testing with a 4 node cluster, with only one seed host. T= he seed host is specified in the cassandra.yaml of each node and is not cha= nged at any point. While testing the solaris to linux upgrade path, things seem to work smooth= ly. The data streams out fine, and streams back in when the node comes back= up. However, testing the linux to solaris path (in case we need to rollbac= k), we are facing some issues with the nodes joining back the ring. nodetoo= l indicates that the node has joined back the ring, but no data streams in,= the node doesn't know about the keyspaces/column families, etc. We see som= e errors in the logs of the newly added nodes pasted below. [17/06/2013:14:10:17 PDT] MutationStage:1: ERROR RowMutationVerbHandler.jav= a (line 61) Error in row mutation org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=3D= 1020 at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(Colum= nFamilySerializer.java:126) at org.apache.cassandra.db.RowMutation$RowMutationSerializer.deseri= alize(RowMutation.java:439) at org.apache.cassandra.db.RowMutation$RowMutationSerializer.deseri= alize(RowMutation.java:447) at org.apache.cassandra.db.RowMutation.fromBytes(RowMutation.java:3= 95) at org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutatio= nVerbHandler.java:42) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDelivery= Task.java:59) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoo= lExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExe= cutor.java:908) Thanks, Arindam --_000_17C39FE466076C46B6E83F129C7B19CE2E7E6AD1HKXPRD0310MB352_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

 

We need to do a rolling upgrade of our Cassandra clu= ster in production, since we are upgrading Cassandra on solaris to Cassandr= a on CentOS.

(We went with solaris initially since most of our ot= her hosts in production are solaris, but were running into some lockup issu= es during perf tests, and decided to switch to linux)

 

Here are the steps we are following to take the node= out of service, and get it back. Can someone comment if we are missing any= thing (eg. is it recommended to specify tokens in cassandra.yaml, or do som= ething different with the seed hosts than mentioned below)

1.     &= nbsp; nodetool decommission – wait for the data to = be streamed out.

2.     &= nbsp; Re-image (everything is wiped off the disks) the ho= st to CentOS, with the same Cassandra version

3.     &= nbsp; Get Cassandra back up.

 

Other details:

-     &= nbsp;    Using Cassandra 1.1.5

-     &= nbsp;    We do not specify any tokens in cassandra.yaml rely= ing on bootstrap assigning the tokens automatically.

-     &= nbsp;    We are testing with a 4 node cluster, with only one= seed host. The seed host is specified in the cassandra.yaml of each node a= nd is not changed at any point.

 

While testing the solaris to linux upgrade path, thi= ngs seem to work smoothly. The data streams out fine, and streams back in w= hen the node comes back up. However, testing the linux to solaris path (in = case we need to rollback), we are facing some issues with the nodes joining back the ring. nodetool indicate= s that the node has joined back the ring, but no data streams in, the node = doesn’t know about the keyspaces/column families, etc. We see some er= rors in the logs of the newly added nodes pasted below.

 

[17/06/2013:14:10:17 PDT] MutationStage:1: ERROR Row= MutationVerbHandler.java (line 61) Error in row mutation

org.apache.cassandra.db.UnknownColumnFamilyException= : Couldn't find cfId=3D1020

        at org.ap= ache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer= .java:126)

        at org.ap= ache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation= .java:439)

        at org.ap= ache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation= .java:447)

        at org.ap= ache.cassandra.db.RowMutation.fromBytes(RowMutation.java:395)

        at org.ap= ache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java= :42)

        at org.ap= ache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)

        at java.u= til.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:88= 6)

        at java.u= til.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

 

Thanks,

Arindam

--_000_17C39FE466076C46B6E83F129C7B19CE2E7E6AD1HKXPRD0310MB352_--